One of the comments I received recently is below:
Hi Paul, If the corruption happens to be related to I/O Erros and there is nothing in the Event log or anywhere that points to I/O related issues, is there any Trace flag that we can enable when performing checkdb or checktable operations that can show us any information related to I/O Problems, Driver issues etc?. we have table corruptions happening on a regular basis but I need some kind of evidence to show to the SAN guys thats its a disk issue and not necessarily SQL Server. Any ideas or suggestions? Thanks Meher
This leads nicely into a blog post/repost about how to tell if your IO subsystem is causing corruptions. You've got recurring corruption - you blame the hardware and the hardware guys blame the software. There's no smoking gun and the hardware diagnostics come back clean. What can you do?
SQLIOSim
This is the hardware diagnostic you really want to run. It simulates a very heavy SQL Server workload and should do a far better job of discovering flaws in your hardware setup than individual hardware vendors' diagnostics will. I always recommend that you run it before installing a system, as well as using it to expose hardware as the problem in difficult-to-diagnose corruption problems.
You can find info on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;231619 - this has been heavily updated since I originally publicized it last year. There are also some great resources on how to interpret the results - Kevin Kline wrote a blog post pulling them all together - check it out here.
One thing I like to say at conferences when I'm discussing SQLIOSim is that you're not just testing the hardware. You're also testing all the software in between the disk and SQL Server (the OS, 3rd party drivers, RAID controller firmware, disk drive firmware,...)
Page Checksums
This is a cool new feature of 2005. Once page checksums are enabled (at the database level), whenever a database page is written out of SQL Server's buffer pool, a checksum is calculated over the page's contents and stamped on the page. This is the very last operation performed on the page before it leaves SQL Server's control. When a page is read into SQL Server's buffer pool, if it has a page checksum on it then the checksum is recalculated and verified. If the re-calculated checksum doesn't match the one stamped on the page, something in the IO subsystem MUST have changed the page (i.e. in between SQL Server writing and subsequently reading the page, something underneath SQL Server in the IO stack corrupted the page).
Here are some points to note about page checksums (they debunk a bunch of common misconceptions):
Bad page checksums will result in IO errors being reported (as I mentioned in the previous post):
Trace flags
There are two trace flags you can use to add some extra auditing - these are documented in the SQL Server 2005 version of the SQL Server IO whitepaper.
Trace flag 806 will cause 'DBCC-style' page auditing to be performed whenever a database page is read into the buffer pool. This is useful to catch cases where pages are being corrupted in memory and then written out to disk with a new page checksum. When they're read back in the checksum will look correct, but the page is corrupt (because of the previous memory corruption). This page auditing goes someway to catching this - especially on non-Enterprise Edition systems that don't have the 'checksum sniffer' I describe above.
Trace flag 3422 will cause auditing of transaction log records as they're read (during transaction rollback or log recovery). This is useful because there is no equivalent to page checksums for transaction log records and so no way to detect whether log records are being corrupted.
Be careful with these trace flags - I don't recommend using them unless your experiencing corruptions that you can't diagnose. Turning them on will cause a big CPU hit because of the extra auditing that's happening.
Summary
So, to answer the question in the comment, there are a few things you should do on SQL Server 2005 to help detect IO subsystem problems. Page checksums in particular have helped to vastly reduce the number of undiagnosed corruption problems (saving time and hassle for customers and Product Support).
Remember Me
a@href@title, strike
Theme design by Jelle Druyts
Pick a theme: BlogXP sqlx BlogXP sqlx
Powered by: newtelligence dasBlog 2.0.7226.0
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
© Copyright 2008, Paul S. Randal
E-mail