The Curious Case of… how to find which page an index row is on

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

I was working recently on a client’s corruption issue where I needed to know how to find which page an index row is on. The client was getting weird errors from DBCC CHECKDB around missing nonclustered index records, but SELECTs using that index on the supposedly missing key values would work. I suspected a stale read problem from the I/O subsystem as part of building the database snapshot that DBCC CHECKDB uses, but needed to prove which page the issue was on.

There’s a cool undocumented column that you can select called %%PHYSLOC%% that gives the page:record:slot of the selected record, and I’ve used it many times when working on corruption/data recovery issues for clients. It gives the information back in hex, so there’s a companion function called sys.fn_PhysLocCracker that formats the output nicely. Both of these have been in the product since SQL Server 2005.

Let me show you an example using the old AdventureWorks sample database.

Let’s say there’s a corruption issue where it says there’s a missing row for StateProvinceID = 1 and AddressID = 519 in the IX_Address_StateProvinceID nonclustered index (which is index ID 4) of the Person.Address table. If I want to prove that that row is NOT missing, here’s what I can do:

SELECT
	[StateProvinceID]
	, [AddressID]
	, [physloc].*
FROM [Person].[Address]
WITH (INDEX = 4)
CROSS APPLY sys.fn_PhysLocCracker (%%physloc%%) AS [physloc]
WHERE [StateProvinceID] = 1
	AND [AddressID] = 519;
GO

I’m selecting the index keys for the nonclustered index and using an index hint to force that index to be used.

And if I get any output, the row exists. In this case, I get:

StateProvinceID AddressID   file_id     page_id     slot_id
--------------- ----------- ----------- ----------- -----------
1               519         1           8120        9

Pretty cool!

I’ve also used this to find rows that I need to bit-twiddle using DBCC WRITEPAGE to work around corruptions to allow data recovery from trashed databases – tedious, but possible if you know what you’re doing.

Bottom line: there are quite a few little undocumented columns and functions you can use to easily get internals information about SQL Server, with real-world applications.

The Curious Case of… trying to find an MDF file in a RAW disk

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

Continuing on from my theme last time of using a hex editor to find information, this morning I had an email from someone who was trying to find an MDF file in a RAW disk without a file system. Their client had suffered a catastrophic I/O subsystem failure and had ended up with disks that could only be accessed using the RAW file system, and the only valid backups were from 2018.

I explained that I was teaching so unable to consult, but as usual with nasty corruption cases, I’d be happy to swap a few emails to see if I could provide some help anyway.

The first thing is to find the start of the file. My thoughts on this were:

  • Look at the boot page from the old database (dbcc page (name, 1, 9, 3)), and find the offset of the string with the name of the database from the start of the boot page.
  • Above is an screenshot of part of a DBCC PAGE dump of the boot page for a database called salesdb. You can see that the name of the database starts on the third line with ‘7300’, where the cursor is. This is the byte reversed Unicode for lowercase s. So the name starts at offset 0x34 into the boot page record, which itself starts at offset 0x60 into the page. So the name string starts at offset 0x94, or 148, bytes from the start of the boot page.
  • Search from the start of the raw disk for the first occurrence of the pattern of bytes with the name of the database, then go back 148 bytes and look to see if you’ve found the boot page by checking the page ID (the 6 bytes starting at offset 32 from the start of the page; 4 byte page number, byte reversed; 2 byte file number, byte reversed). Now go back 73728 bytes in the disk and hopefully that should be the file header page, page zero.

Then you can continue looking for pages from there, searching for the first two bytes being ‘0100’ and then a valid page ID as described above.

You could even build some automation to look for specific allocation unit IDs in the page header m_objId field (4 bytes, byte reversed at offset 24 in a page) and m_indexId field (2 bytes, byte reversed at offset 6 in a page), using the following code (from a disaster recovery case I worked on a few years ago):

-- Plug in the allocation unit ID you're interested in
DECLARE @alloc BIGINT = requiredallocunitID;
DECLARE @index BIGINT;

SELECT @index =
    CONVERT (BIGINT,
    CONVERT (FLOAT, @alloc)
        * (1 / POWER (2.0, 48)) -- right shift, reciprocal of left shift
    );

SELECT
    CONVERT (BIGINT,
    CONVERT (FLOAT, @alloc - (@index * CONVERT (BIGINT, POWER (2.0, 48))))
        * (1 / POWER (2.0, 16)) -- right shift, reciprocal of left shift
    ) AS [m_objId],
    @index AS [m_indexId];
GO

You might run into false positives, but you should be able to find all the pages. I heard back from the original sender that he’d found several million pages from the MDF so far, using a similar method.

This reminds me of another heroic data recovery case I blogged about back in 2011 – see here – where someone did the same thing to salvage a database whose most recent valid backup was from *SIX* years previously.

Bottom line: up-to-date and valid backups will prevent someone having to do this to get your data back, but, with a little ingenuity and lots of time, nearly anything is possible :-)

PFS corruption after upgrading from SQL Server 2014

I’m seeing reports from a few people of DBCC CHECKDB reporting PFS corruption after an upgrade from SQL Server 2014 to SQL Server 2016 or later. The symptoms are that you run DBCC CHECKDB after the upgrade and get output similar to this:

Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3863) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
Msg 8948, Level 16, State 6, Line 5
Database error: Page (3:3864) is marked with the wrong type in PFS page (1:1). PFS status 0x40 expected 0x60.
CHECKDB found 2 allocation errors and 0 consistency errors not associated with any single object.
CHECKDB found 2 allocation errors and 0 consistency errors in database 'MyProdDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MyProdDB).

I’ve discussed with the SQL Server team and this is a known bug in SQL Server 2014.

The problem can occur if an ALTER INDEX … REORGANIZE is performed in a transaction and then rolled back, one of the affected extents can have some of its pages marked with the wrong PFS status. This state is valid in SQL Server 2014, but if one of the upgrade steps happens to move one of these pages, DBCC CHECKDB on the new version will complain with the errors above.

Note: this is not a bug in DBCC CHECKDB :-)

The fix for this issue is to run DBCC CHECKDB (yourdb, REPAIR_ALLOW_DATA_LOSS) and that will fix the PFS state. From anecdotal evidence, you might need to run repair twice. Repair will simply fix the PFS status, not deallocate/delete anything.

If you experience this issue, the SQL Server team requests that you contact CSS so they know how many people are hitting the issue and they may ask for access to the database to aid with developing a fix.

I’ll update this post when I get more information – at present (9/26/18) there is no fix available apart from running repair.