The Curious Case of… trying to find an MDF file in a RAW disk

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

Continuing on from my theme last time of using a hex editor to find information, this morning I had an email from someone who was trying to find an MDF file in a RAW disk without a file system. Their client had suffered a catastrophic I/O subsystem failure and had ended up with disks that could only be accessed using the RAW file system, and the only valid backups were from 2018.

I explained that I was teaching so unable to consult, but as usual with nasty corruption cases, I’d be happy to swap a few emails to see if I could provide some help anyway.

The first thing is to find the start of the file. My thoughts on this were:

  • Look at the boot page from the old database (dbcc page (name, 1, 9, 3)), and find the offset of the string with the name of the database from the start of the boot page.
  • Above is an screenshot of part of a DBCC PAGE dump of the boot page for a database called salesdb. You can see that the name of the database starts on the third line with ‘7300’, where the cursor is. This is the byte reversed Unicode for lowercase s. So the name starts at offset 0x34 into the boot page record, which itself starts at offset 0x60 into the page. So the name string starts at offset 0x94, or 148, bytes from the start of the boot page.
  • Search from the start of the raw disk for the first occurrence of the pattern of bytes with the name of the database, then go back 148 bytes and look to see if you’ve found the boot page by checking the page ID (the 6 bytes starting at offset 32 from the start of the page; 4 byte page number, byte reversed; 2 byte file number, byte reversed). Now go back 73728 bytes in the disk and hopefully that should be the file header page, page zero.

Then you can continue looking for pages from there, searching for the first two bytes being ‘0100’ and then a valid page ID as described above.

You could even build some automation to look for specific allocation unit IDs in the page header m_objId field (4 bytes, byte reversed at offset 24 in a page) and m_indexId field (2 bytes, byte reversed at offset 6 in a page), using the following code (from a disaster recovery case I worked on a few years ago):

-- Plug in the allocation unit ID you're interested in
DECLARE @alloc BIGINT = requiredallocunitID;
DECLARE @index BIGINT;

SELECT @index =
    CONVERT (BIGINT,
    CONVERT (FLOAT, @alloc)
        * (1 / POWER (2.0, 48)) -- right shift, reciprocal of left shift
    );

SELECT
    CONVERT (BIGINT,
    CONVERT (FLOAT, @alloc - (@index * CONVERT (BIGINT, POWER (2.0, 48))))
        * (1 / POWER (2.0, 16)) -- right shift, reciprocal of left shift
    ) AS [m_objId],
    @index AS [m_indexId];
GO

You might run into false positives, but you should be able to find all the pages. I heard back from the original sender that he’d found several million pages from the MDF so far, using a similar method.

This reminds me of another heroic data recovery case I blogged about back in 2011 – see here – where someone did the same thing to salvage a database whose most recent valid backup was from *SIX* years previously.

Bottom line: up-to-date and valid backups will prevent someone having to do this to get your data back, but, with a little ingenuity and lots of time, nearly anything is possible :-)

Disaster recovery 101: Object ID 0, index ID -1, partition ID 0

Quite often I see questions about DBCC CHECKDB error messages along the lines of:

Msg 8909, Level 16, State 1, Line 2
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 72057594045726720 (type Unknown), page ID (16:1330467) contains an incorrect page ID in its page header. The PageId in the page header = (48:93722876).

Specifically, the part that confuses people is “Object ID 0, index ID -1, partition ID 0”. There isn’t an object ID of 0 in the system, and how can an index have an ID of -1?

This pattern means that DBCC couldn’t work out which object the page is part of.

There’s a data structure I invented when I rewrote DBCC CHECKDB for SQL Server 2005 called a DbccCombinedID, which encapsulates all the relevant IDs that identify where a page belongs. The default values for object, index, and partition IDs are 0, -1, and 0, respectively. When the page is read by DBCC CHECKDB, the allocation unit ID is calculated (see here for the algorithm) from the m_objId and m_indexId fields in the page header, and then DBCC CHECKDB does all the metadata look-ups to figure out the other IDs. If the page header is corrupt such that the calculated allocation unit ID isn’t valid, the other IDs can’t be populated and the ‘owner’ of the page cannot be found. There will likely be other error messages referencing the corrupt page as well.

So what do you do? Follow your usual method for dealing with corruption. See my post SQLskills SQL101: Dealing with SQL Server corruption for lots of advice.

 

Disaster recovery 101: fixing a broken system table page

This post is about a disaster-recovery scenario I described in our bi-weekly newsletter a couple of weeks ago, and wanted to make sure it’s out on the web too for people to find and use.

I was helping someone try to recover data from a corrupt database, from an online forum question. They did not have any up-to-date backups without the corruption in, so fixing their backup strategy was a piece of advice they were given by a few people.

The output from DBCC CHECKDB on the database was:

Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0x0; actual signature: 0x5555300). It occurred during a read of page (1:58) in database ID 10 at offset 0x00000000074000 in file ‘D:\dbname.mdf:MSSQL_DBCC10’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

They’d tried running repair, but of course if DBCC CHECKDB says that it has to stop (i.e. error message 8921), then it can’t run repair.

I explained this, and how page 1:58 is a system table page and unrepairable, and so they’d have to script out as much of the database schema as possible, create a new database, and extract as much data as possible from the broken database.

I also explained that the page is part of the sys.syscolpars table, which is the equivalent of the old syscolumns system table, so that approach might not work if the corruption was such that it stopped the Query Processor from being able to use the table metadata.

Unfortunately my suspicions were correct, and the script/extract approach did indeed fail.

On a whim, I suggested trying something radical. A few years ago I blogged about a way to ‘fix’ broken boot pages using a hex editor to overwrite a broken boot page with one from an older copy of the database (see here) and demonstrated it at various conferences. I’d never tried it on a system table page before, but I figured that the page ID was low enough that the page likely hadn’t changed for a while.

What do I mean by that? Well, the sys.syscolpars clustered index is ordered by object ID, so the first few pages in the clustered index (of which page 1:58 is one), have the columns from the system tables, with very low object IDs. There’s never going to be the case where a new user table gets created and causes an insert into one of these low tables.

This means that an older backup of the database would have the current state of page 1:58 in it. So I suggested using the boot page hack on page 1:58 from the person’s older backup.

And it worked!

Luckily there wasn’t any other corruption in the database, so all the person had to do was root-cause analysis and remediation, and fixing the backup strategy so the situation wouldn’t arise in future.

Summary: In a disaster situation, when backups aren’t available; don’t be afraid to try something radical. As long as you try it on a copy of the database, it’s not as if you can make the situation any worse. And if you’re lucky, you’ll be able to make the situation a lot better.