This post is about a disaster-recovery scenario I described in our bi-weekly newsletter a couple of weeks ago, and wanted to make sure it’s out on the web too for people to find and use.
I was helping someone try to recover data from a corrupt database, from an online forum question. They did not have any up-to-date backups without the corruption in, so fixing their backup strategy was a piece of advice they were given by a few people.
The output from DBCC CHECKDB on the database was:
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0x0; actual signature: 0x5555300). It occurred during a read of page (1:58) in database ID 10 at offset 0x00000000074000 in file ‘D:\dbname.mdf:MSSQL_DBCC10’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
They’d tried running repair, but of course if DBCC CHECKDB says that it has to stop (i.e. error message 8921), then it can’t run repair.
I explained this, and how page 1:58 is a system table page and unrepairable, and so they’d have to script out as much of the database schema as possible, create a new database, and extract as much data as possible from the broken database.
I also explained that the page is part of the sys.syscolpars table, which is the equivalent of the old syscolumns system table, so that approach might not work if the corruption was such that it stopped the Query Processor from being able to use the table metadata.
Unfortunately my suspicions were correct, and the script/extract approach did indeed fail.
On a whim, I suggested trying something radical. A few years ago I blogged about a way to ‘fix’ broken boot pages using a hex editor to overwrite a broken boot page with one from an older copy of the database (see here) and demonstrated it at various conferences. I’d never tried it on a system table page before, but I figured that the page ID was low enough that the page likely hadn’t changed for a while.
What do I mean by that? Well, the sys.syscolpars clustered index is ordered by object ID, so the first few pages in the clustered index (of which page 1:58 is one), have the columns from the system tables, with very low object IDs. There’s never going to be the case where a new user table gets created and causes an insert into one of these low tables.
This means that an older backup of the database would have the current state of page 1:58 in it. So I suggested using the boot page hack on page 1:58 from the person’s older backup.
And it worked!
Luckily there wasn’t any other corruption in the database, so all the person had to do was root-cause analysis and remediation, and fixing the backup strategy so the situation wouldn’t arise in future.
Summary: In a disaster situation, when backups aren’t available; don’t be afraid to try something radical. As long as you try it on a copy of the database, it’s not as if you can make the situation any worse. And if you’re lucky, you’ll be able to make the situation a lot better.
2 thoughts on “Disaster recovery 101: fixing a broken system table page”
101, eh? ;-)
Yup – very easy to do when there’s no alternative.