Over the last few weeks I’ve been investigating a series of customer corruption issues (on one customer’s system) where it appears that the page checksum algorithm is missing a combination of DWORD bit flips, caused by another (Query Processor) bug that’s been introduced in recent builds of SQL Server 2008 R2, SQL Server 2012, and SQL Server 2014 (on 2014, most likely when using a readable AG secondary).

I’ve gone back-and-forth with my good friends in Product Support and the Product Group quite a bit and we’ve decided to publicize the issue before more people find they have unrecoverable corruption because of this. I’ve been told that this will be fixed ASAP for all three versions, but it isn’t going to delay the RTM and general release of 2014. I’ll blog details as soon as I know.

So what’s the bug?

Well, it involves the way the page checksum algorithm aggregates values. It turns out that a certain repeating bit pattern on adjacent words can cause the Fletcher’s Checksum algorithm (the one that’s used – it’s way cheaper than a full CRC-based checksum – see this Wikipedia link) to fail because of the order insensitivity of the algorithm (it’s explained in the Wikipedia link, under the section Weaknesses of Simple Checksums) and the fact that it can’t distinguish between a block of zeros and a block of ones (described in the main Weaknesses section). The SQL Server implementation was supposed to work around that problem (making it more like a traditional Adler’s Checksum, from which the Fletcher’s Checksum is derived), but obviously it doesn’t work entirely correctly.

The bug hasn’t come to light until now because there has never been a possibility of having these bit patterns adjacent in a data value that would have a page checksum on, or at least it’s been highly unlikely and we all just got lucky. Now that the main Query Processor bug exists, it’s able to persist these bad values because of the page checksum hole. It’s just a nasty combination.

To make matters worse, unfortunately DBCC CHECKDB also has an issue where it also fails to spot this corruption. DBCC CHECKDB uses the buffer pool to do all it’s I/O, and because the page checksum comes back clean (from the broken algorithm), DBCC CHECKDB does a reduced-scope page audit, and so misses the new corruption problem. That issue has been there since my rewrite in SQL Server 2005.

So it’s really a trifecta of nasty bugs, only two of which I’m responsible for (the Query Processor bug is new). I can only apologize I guess.

And the final kicker is that because the page checksum algorithm is compromised, backup checksums also don’t find the problem so most people’s 2008 R2/2012/2014 backups are also corrupt. Even restoring the backup and running DBCC CHECKDB won’t find the problem because of the page checksum bug I describe above.

It’s a pretty sucky combination that’s going to cause a lot of hassle for people to figure out whether they’ve been affected or not. From what we can tell, many people will have undetectable corruption that will continue to get worse until it causes query failures or the Storage Engine sets the database offline. It seems to be more common with OLTP query patterns so far, and 7 of our long-term clients have discovered they have this ‘undetectable’ corruption after I checked them all last week.

How can you tell if you have the corruption?

It’s not a simple test to figure out whether you have the problem, but with some luck you may be able to spot it. I know how to find it using a hex editor to search through data files (which I’ll explain below) and I’ll write some code that uses DBCC PAGE … WITH TABLERESULTS and some post-processing once I have a 100% accurate test and then I’ll blog the code.

In the meantime, here’s an example using XVI32 (note that XVI32 will truncate data files larger that 2GB so you’ll need to use my other favorite freeware hex editor HxD).

You need to set the database you suspect (I’d test all of them to be sure, including master, but you don’t need to check msdb) to be offline and then open it with your hex editor. Then you need to search for the bit pattern of all ones followed by all zeroes, for 32-bits each.

Here’s an example from a 2012 database:

fool Nasty day 1 bug causing page checksums to miss corruptions on 2008 R2/2012/2014

Summary

You really think there would be day-1 bugs in DBCC CHECKDB and the page checksum algorithm? Oh ye of little faith!

Have a great day :-)