How to tell if the IO subsystem is causing corruptions?

One of the comments I received recently is below:

Hi Paul,

If the corruption happens to be related to I/O Erros and there is nothing in the Event log or anywhere that points to I/O related issues, is there any Trace flag that we can enable when performing checkdb or checktable operations that can show us any information related to I/O Problems, Driver issues etc?. we have table corruptions happening on a regular basis but I need some kind of evidence to show to the SAN guys thats its a disk issue and not necessarily SQL Server. Any ideas or suggestions?

Thanks

Meher

This leads nicely into a blog post/repost about how to tell if your I/O subsystem is causing corruptions. You’ve got recurring corruption – you blame the hardware and the hardware guys blame the software. There’s no smoking gun and the hardware diagnostics come back clean. What can you do?

SQLIOSim

This is the hardware diagnostic you really want to run. It simulates a very heavy SQL Server workload and should do a far better job of discovering flaws in your hardware setup than individual hardware vendors’ diagnostics will. I always recommend that you run it before installing a system, as well as using it to expose hardware as the problem in difficult-to-diagnose corruption problems.

You can find info on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;231619 – this has been heavily updated since I originally publicized it last year. There are also some great resources on how to interpret the results – Kevin Kline wrote a blog post pulling them all together – check it out here.

One thing I like to say at conferences when I’m discussing SQLIOSim is that you’re not just testing the hardware. You’re also testing all the software in between the disk and SQL Server (the OS, 3rd party drivers, RAID controller firmware, disk drive firmware,…)

Page Checksums

This is a cool new feature of 2005. Once page checksums are enabled (at the database level), whenever a database page is written out of SQL Server’s buffer pool, a checksum is calculated over the page’s contents and stamped on the page. This is the very last operation performed on the page before it leaves SQL Server’s control. When a page is read into SQL Server’s buffer pool, if it has a page checksum on it then the checksum is recalculated and verified. If the re-calculated checksum doesn’t match the one stamped on the page, something in the I/O subsystem MUST have changed the page (i.e. in between SQL Server writing and subsequently reading the page, something underneath SQL Server in the I/O stack corrupted the page).

Here are some points to note about page checksums (they debunk a bunch of common misconceptions):

  • Databases that are created on SQL Server 2005 automatically have page checksums turned on
  • Page checksums are a super-set of torn-page detection. Page checksums will also detect torn pages.
  • You cannot enable page checksums and torn-page detection at the same time.
  • Upgrading a database to SQL Server 2005 and turning on the page checksum option does not automatically protect all the pages, as a page has to be changed and written to disk after the database option is enabled to have a checksum written on it. Only when a page has been through this process is it protected. There is no tool or automatic way to force all pages to go through this process – as I mentioned in the last post.
  • In benchmarking tests with a TPC-H workload during SQL Server 2005 development, we measured approx 2% performance degradation as a result of having checksums enabled.
  • The checksum cannot be used for error correction. Generating an error-correcting checksum would be a more complicated algorithm and so would be slower to compute.
  • The checksum is validated when a page is read for checking by any of the DBCC CHECK* commands (regardless of whether the PHYSICAL_ONLY option was used) so all existing page checksums can be checked by issuing a DBCC CHECKDB command.
  • Any existing checksums are checked when pages are read as part of taking a backup. In addition, the restore logic will also verify the page checksums as the pages are restored from the backup media, so there is solid assurance that the data from the backup is consistent.
  • Page checksums do not prevent in-memory corruptions from memory scribblers (where the page is read in, corrupted in memory from some rogue process, and then written out with a new checksum). There is a ‘checksum sniffer’ that runs as part of the lazywriter process, randomly picking unchanged pages in the buffer pool and validating their checksums to see if the page has been scribbled on. There are now documented cases of this process finding memory corruptions on SQL Server 2005 installations.

Bad page checksums will result in IO errors being reported (as I mentioned in the previous post):

  • I/O failures will trigger read-retry logic, which will re-read the page several times to see if the error clears itself (if it does, a message is written to the SQL error log)
  • I/O failures that persist through read-retry are logged in the error log and Windows event log, so monitoring these will allow you to be alerted to hardware problems quickly.
  • If an I/O error is reported, the page is not taken offline in any way. Anything that subsequently touches the page will get another I/O error. The exception to this is if the IO error is encountered during transaction rollback. In this case, the database is taken offline and must be brought back online manually.

Trace flags

There are two trace flags you can use to add some extra auditing – these are documented in the SQL Server 2005 version of the SQL Server I/O whitepaper.

Trace flag 806 will cause ‘DBCC-style’ page auditing to be performed whenever a database page is read into the buffer pool. This is useful to catch cases where pages are being corrupted in memory and then written out to disk with a new page checksum. When they’re read back in the checksum will look correct, but the page is corrupt (because of the previous memory corruption). This page auditing goes someway to catching this – especially on non-Enterprise Edition systems that don’t have the ‘checksum sniffer’ I describe above.

Trace flag 3422 will cause auditing of transaction log records as they’re read (during transaction rollback or log recovery). This is useful because there is no equivalent to page checksums for transaction log records and so no way to detect whether log records are being corrupted.

Be careful with these trace flags – I don’t recommend using them unless your experiencing corruptions that you can’t diagnose. Turning them on will cause a big CPU hit because of the extra auditing that’s happening.

Summary

So, to answer the question in the comment, there are a few things you should do on SQL Server 2005 to help detect I/O subsystem problems. Page checksums in particular have helped to vastly reduce the number of undiagnosed corruption problems (saving time and hassle for customers and Product Support).

2 thoughts on “How to tell if the IO subsystem is causing corruptions?

  1. Hi Paul,

    Great Post Indeed. Will be very much obliged if get to know more about checksum sniffer part of Lazy Writer process is only in EnterPrise Edition and TF 806 is relevant only on non-Enterprise editions.

    Regards,
    Lucy Hawk

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.