Nasty day-1 bug causing page checksums to miss corruptions on 2008 R2/2012/2014

Over the last few weeks I’ve been investigating a series of customer corruption issues (on one customer’s system) where it appears that the page checksum algorithm is missing a combination of DWORD bit flips, caused by another (Query Processor) bug that’s been introduced in recent builds of SQL Server 2008 R2, SQL Server 2012, and SQL Server 2014 (on 2014, most likely when using a readable AG secondary).

I’ve gone back-and-forth with my good friends in Product Support and the Product Group quite a bit and we’ve decided to publicize the issue before more people find they have unrecoverable corruption because of this. I’ve been told that this will be fixed ASAP for all three versions. I’ll blog details as soon as I know.

So what’s the bug?

Well, it involves the way the page checksum algorithm aggregates values. It turns out that a certain repeating bit pattern on adjacent words can cause the Fletcher’s Checksum algorithm (the one that’s used – it’s way cheaper than a full CRC-based checksum – see this Wikipedia link) to fail because of the order insensitivity of the algorithm (it’s explained in the Wikipedia link, under the section Weaknesses of Simple Checksums) and the fact that it can’t distinguish between a block of zeros and a block of ones (described in the main Weaknesses section). The SQL Server implementation was supposed to work around that problem (making it more like a traditional Adler’s Checksum, from which the Fletcher’s Checksum is derived), but obviously it doesn’t work entirely correctly.

The bug hasn’t come to light until now because there has never been a possibility of having these bit patterns adjacent in a data value that would have a page checksum on, or at least it’s been highly unlikely and we all just got lucky. Now that the main Query Processor bug exists, it’s able to persist these bad values because of the page checksum hole. It’s just a nasty combination.

To make matters worse, unfortunately DBCC CHECKDB also has an issue where it also fails to spot this corruption. DBCC CHECKDB uses the buffer pool to do all it’s I/O, and because the page checksum comes back clean (from the broken algorithm), DBCC CHECKDB does a reduced-scope page audit, and so misses the new corruption problem. That issue has been there since my rewrite in SQL Server 2005.

So it’s really a trifecta of nasty bugs, only two of which I’m responsible for (the Query Processor bug is new). I can only apologize I guess.

And the final kicker is that because the page checksum algorithm is compromised, backup checksums also don’t find the problem so most people’s 2008 R2/2012/2014 backups are also corrupt. Even restoring the backup and running DBCC CHECKDB won’t find the problem because of the page checksum bug I describe above.

It’s a pretty sucky combination that’s going to cause a lot of hassle for people to figure out whether they’ve been affected or not. From what we can tell, many people will have undetectable corruption that will continue to get worse until it causes query failures or the Storage Engine sets the database offline. It seems to be more common with OLTP query patterns so far, and 7 of our long-term clients have discovered they have this ‘undetectable’ corruption after I checked them all last week.

How can you tell if you have the corruption?

It’s not a simple test to figure out whether you have the problem, but with some luck you may be able to spot it. I know how to find it using a hex editor to search through data files (which I’ll explain below) and I’ll write some code that uses DBCC PAGE … WITH TABLERESULTS and some post-processing once I have a 100% accurate test and then I’ll blog the code.

In the meantime, here’s an example using XVI32 (note that XVI32 will truncate data files larger that 2GB so you’ll need to use my other favorite freeware hex editor HxD).

You need to set the database you suspect (I’d test all of them to be sure, including master, but you don’t need to check msdb) to be offline and then open it with your hex editor. Then you need to search for the bit pattern of all ones followed by all zeroes, for 32-bits each.

Here’s an example from a 2012 database:

fool

Summary

You really think there would be day-1 bugs in DBCC CHECKDB and the page checksum algorithm? Oh ye of little faith!

Have a great day :-)

26 thoughts on “Nasty day-1 bug causing page checksums to miss corruptions on 2008 R2/2012/2014

  1. I got it when you said “DBCC CHECKDB does a reduced-scope page audit” which is probably false. The point of CHECKDB is to do a full audit.

  2. Its 23:47 GMT and i thought i had made it the whole way through the day without falling for one. But you screwed me but good right at the end there!

  3. The things you find till this day continue to amaze me. Going to be interesting to try and check the DBs

  4. And I *so* said to myself “be on the lookout, it’s April 2nd but you didn’t check yesterday’s posts yet”. And still you managed to get me. Well played, sir.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.