Saturday, April 19, 2008

Many times I've been asked to do a blog post about creating Agent alerts, and given that today I demo'd it as part of our Accidental DBA workshop at Connections, it seemed a good time to do the blog post too!

I demo this in the context of alerting a DBA when an 823 or 824 IO error occurs. One of my early blog posts (see here) explains what these are, as well as providing a corrupt database that you can use to see these errors happening.

The idea is that I want to know as soon as an IO error occurs so I can start recovering and take preventative action to stop it happening again. I don't want to rely on users telling me when a query hits an IO error, and I don't want to have to scan the SQL error logs to find them. So I'm going to create an alert.

The first step is to fire up Management Studio and make sure SQL Server Agent is running. Next we need to make there's actually an Operator defined - so the new alert has someone to actually alert! - so we'll use the New Operator wizard (see below for how to get there).

In the New Operator Wizard that appears, I've created an operator named 'SysAdmin'. There are three Notification Options you can use - email, net send, and pager. I've setup SysAdmin to use net send to my local machine. You need to make sure the Messenger service is enabled otherwise net send will not work. Also, be aware the net sends will NOT work unless the machine has a network connection - even if the net send source and destination are the same machine! Given the various issues with net send, it's better to use email or pager alerts - but for the purposes of this blog post its the easiest option.

Now let's create the new alert - using the New Alert wizard (see the below for how to get there).

In the New Alert Wizard that appears, I've created an alert named 'IO Errors' for all severity 24 errors on all databases. Below is a portion of the General tab of the wizard showing these settings:

I also need to specify what happens. In the Response tab of the wizard I've set the SysAdmin operator to be notified using net send. Again, see below.

In the Options tab I've checked the box to include the error text in the net send.

Now let's test it. Using the database called 'broken' that I provide as an example (see here), I'll force an IO error to occur. In my query window I get:

SELECT * from broken..brokentable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 10 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

And a few seconds later I get the net send:

Pretty cool!

You can also use the WMI Provider to do this - see Creating a SQL Server Agent Alert by Using the WMI Provider for Server Events.

Saturday, April 19, 2008 7:05:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, March 21, 2008

This is a really interesting question that came up in the Microsoft Certified Architect class I'm teaching at present - if a database has torn-page protection enabled, and page checksums are enabled, is all the existing torn-page detection lost?

This is an important question, because enabling page checksums doesn't suddenly make all allocated pages be protected by page checksums (it's not until a page is read into the buffer pool, modified, and then written back to disk, that it gets a page checksum). If all the existing torn-page protection is discarded when page checksums are enabled, then the pages would be unprotected until they got page checksums on. I couldn't remember the answer, so I experimented!

My idea was to create a database with torn-page protection, create a table with a simulated torn-page in it, then enable page checksums and see if the torn-page was still reported.

-- Create the test database
USE master;
GO
CREATE DATABASE ChecksumTest;
GO
USE ChecksumTest;
GO

-- Explicitly set the database to have torn-page detection
ALTER DATABASE ChecksumTest SET PAGE_VERIFY TORN_PAGE_DETECTION;
GO

-- Create a test table and insert a row.
CREATE TABLE BrokenTable (c1 INT, c2 CHAR (1000));
INSERT INTO BrokenTable VALUES (1, 'a');
GO

-- Ensure the page is written to disk and then tossed from the buffer pool
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO

Now I'm going to examine the page. There are two bits in the page header that specify whether the page is protected by torn-page detection or with a page checksum. Specifically, the m_flagBits field will have 0x100 set if the page is encoded for torn-page protection, and 0x200 set if the page has a page-checksum stored on it, and the page has not been modified (i.e. the checksum is stillvalid). You should not see the 0x100 bit set as torn-page encoding is removed when the page is read into the buffer pool - UNLESS the page IS actually torn, in which case the encoding is NOT removed.

sp_allocationmetadata 'BrokenTable';
GO
DBCC TRACEON (3604);
GO
DBCC PAGE ('ChecksumTest', 1, 143, 3);
GO

<snip>

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 2                        m_freeCnt = 6070
m_freeData = 2118                    m_reservedCnt = 0                    m_lsn = (28:183:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 770
      

<snip>     

In this case the torn-page encoding has been removed, and the page is fine. Once I've corrupted the page on disk, it's tricky to be able to see it with DBCC PAGE. I managed to catch it once and saw the following:

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8100
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 1                        m_freeCnt = 7083
m_freeData = 1107                    m_reservedCnt = 0                    m_lsn = (28:81:20)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 41949233

Now if I try to select from the table I get:         

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

The crux of the question is whether this will still be reported if the database switches to page checksums - let's try:

ALTER DATABASE checksumtest SET PAGE_VERIFY CHECKSUM;
GO

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Cool! The answer is YES - the torn-page is still detected, because the bit in the page header specifies which page protection algorithm the page is using. In fact, it even works if you turn off page checksums and torn-page detection completely.

Friday, March 21, 2008 3:23:07 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, March 12, 2008

A couple of weeks ago I blogged about the three tracks of the SQL Server 2008 JumpStart course that SQLskills.com taught internally for Microsoft and some MVPs - see here for details. Well, the content is now available to download! Note that this was based on CTP-5 (November 2007 CTP) and there have been *lots* of behavioral changes since then (with more planned for CTP-6 Refresh and RTM), but if you want a high-level overview of a bunch of the features (albeit in bullet-point summaries on slides) then this is a good place to start.

For me, what's *REALLY* cool is that the site also has a downloadable VPC plus lab manuals for all of the AlwaysOn High-Availability hands-on labs that SQLskills.com wrote. We originally wrote these labs for SQL Server 2005 and I updated them all for CTP-5. The VPC has a long lab on each of the following:

  • Database Snapshots
  • Data Recovery and Preventative Techniques
  • Instant Initialization
  • Peer-to-Peer Replication (including the new Topology Wizard I blogged about here)
  • Table and Index Partitioning
  • Snapshot Isolation
  • Online Operations
  • Database Mirroring (including a demo I wrote of Automatic Page Repair, described here)
  • Service Oriented Database Architecture

There is some great depth in each of these - Kimberly blogged more info about the exercises in each lab here. There's also another VPC image with some higher-level labs on a variety of 2008 features and written by a number of different people- including some labs on Policy-Based Management and Performance Data Collection that Kimberly wrote.

So - where can you get these from? Go to http://sqlserver2008jumpstart.microsofttraining.com/ and hit the Download link on the right-hand side. Register and then you can get to the materials. The AlwaysOn VPC image is Collection 2 at the bottom of the page, and you'll see all the slide decks as you scroll down the page.

Enjoy!

Wednesday, March 12, 2008 9:23:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, December 12, 2007

While we were in Barcelona we sat down with Richard Campbell and Greg Hughes from RunAs Radio to record a 1/2 hour interview on SQL Server 2008. We touch on a ton of different features (look at the number of Categories I've tagged this with!) and have a bunch of laughs along the way - check it out here.

PS There's been a ton of interest in the slide deck idea I had so we'll be going ahead with that. Look for an announcement sometime in the first few months of next year about how to get them. Thanks to everyone that replied!

Wednesday, December 12, 2007 10:07:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, September 27, 2007

One of the hottest features in SQL Server 2005 is database mirroring, and it's helped many companies implement successful and relatively inexpensive high-availability strategies. In SQL Server 2008, Database Mirroring has been enhanced in several ways - one of which is the ability to automatically repair corrupt pages!

This feature is based on the fact that the principal and mirror databases are exactly the same. So, if a page becomes corrupt on the principal, SQL Server should be able to read the page from the mirror and use it to fix the principal. Similarly, if a page becomes corrupt on the mirror, the page can be read from the principal to fix up the mirror. Pretty cool, eh?

Details

  • The feature works for pages that have 824 errors, 823 errors where the OS returns a CRC error while reading the page (to prevent resource issues triggering a page repair), and pages that have 829 errors (where the page is marked as restore pending). See my previous post here for more details on page errors.
  • Pages are fixed asynchronously.
    • If the page is corrupt in the principal, the query that hit the corrupt page will fail. Once discovered, a page is marked as being 829 until its fixed. This prevents an issue where a transient disk error could allow a subsequent update to change the page after it's been queued for being repaired, and then the page is overwritten with a copy from the mirror, losing the update. Nasty.
    • If the page is corrupt in the mirror (which is discovered when the page is read as part of the continual recovery of the log), the mirroring session is suspended. The mirror keeps track of all corrupt pages that need to be repaired with copies from the principal. Once all corrupt pages have been repaired, the mirroring session will be resumed automatically. This means that if a page is corrupt in both the mirror and the principal, manual intervention will be required to resolve the issue.
  • The feature is only available in Enterprise Edition. This means that if the principal is on Enterprise Edition and the mirror is on Standard Edition, then corrupt pages on the principal can repaired from the mirror but not the other way around.
  • There is a new DMV - sys.dm_db_mirroring_auto_page_repair - that allows you to track corrupt pages in mirrored databases 
    • It covers all mirrored databases on a server.
    • It provides info on the last 100 pages that were found in any mirrored database, as well as the status of the automatic page repair operation.
    • This DMV isn't yet included in the SQL Server 2008 Books Online available on TechNet but will have the following info:
      • Database ID the page is in
      • The Page ID, split into file and page-in-file
      • The error type - distinguishing between 823 errors, torn-page errors, page checksum failures, and all-other-824 errors
      • The status of the page repair operation
      • The time that the status was last updated
  • If a page repair fails for any reason (e.g. the mirroring partner couldn't supply the page) then the repair will be marked as failed. If the page is then hit again (by a query on the principal or a recovery operation on the mirror) then it will be re-queued for repair.
  • Not all pages can be repaired - the file header page, database boot page, and allocation bitmap pages (GAM, SGAM, PFS) cannot be repaired this way.

In Action

I've got a system with a few SQL Server 2008 instances running so I decided to give it a try. Here's what I did, using the TicketSalesDB from the Always-On DVDs and labs that SQLskills produces:

  • Changed the database to use page checksums
  • Rebuilt the clustered index of one of the tables (so the pages have page checksums on)
  • Took the initial backup of the database that's needed for mirroring (so that it has no corruptions in)
  • Used DBCC IND to find a page in the leaf level of the clustered index and corrupted the page
  • Setup mirroring (but using the backup I took before introducing the corruption - so the mirror database will have a clean copy of the page I corrupted)
  • Performed a query on the table with the corrupt page to force the page checksum failure and kick-off automatic page repair

And it worked! Going back into the instance with the principal database and querying the DMV gives:

C:\>sqlcmd /E /S.\KATMAI01
1> SELECT * FROM sys.dm_db_mirroring_auto_page_repair;
2> GO
database_id file_id     page_id              error_type page_status modification_time
----------- ----------- -------------------- ---------- ----------- -----------------------
          6           4                 4256         -1           5 2007-09-27 17:23:20.067

(1 rows affected)
1>

The page I corrupted was (4:4256) and page_status of 5 means the repair succeeded. Running the query again confirms that the corruption has been fixed. The page was also logged in the suspect_pages table in msdb:

1> SELECT * FROM msdb..suspect_pages;
2> GO
database_id file_id     page_id              event_type  error_count last_update_date
----------- ----------- -------------------- ----------- ----------- -----------------------
          6           4                 4256           5           1 2007-09-27 17:23:20.407

(1 rows affected)
1>

In Books Online (both 2005 and 2008), event_type of 5 means that the page was repaired.

I checked the SQL Server error log for the principal and this is what I found (the bolding is mine):

2007-09-27 17:17:10.41 spid25s     Database mirroring is active with database 'TicketSalesDB' as the principal copy. This is an informational message only. No user action is required.
2007-09-27 17:23:19.92 spid51      Error: 824, Severity: 24, State: 2.
2007-09-27 17:23:19.92 spid51      SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x55684fbe; actual: 0x16e84fbe). It occurred during a read of page (4:4256) in database ID 6 at offset 0x00000002140000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.6\MSSQL\Data\TicketSalesFG2005Q1.NDF'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2007-09-27 17:23:19.96 spid26s     Database mirroring is attempting to repair physical page (4:4256) in database "TicketSalesDB" by requesting a copy from the partner.
2007-09-27 17:23:20.42 spid26s     Database mirroring successfully repaired physical page (4:4256) in database "TicketSalesDB" by obtaining a copy from the partner.

I also checked the error log for the mirror and there's nothing relevant in there. I guess the same info would be output to the mirror database's error log if the mirror became corrupted. It's reasonably simple to check this using similar steps as above, but corrupting the database before the initial backup is taken, fixing the corruption again after the backup and before mirroring starts (so the corruption is on the mirror but not the principal), and then triggering an update on the corrupt page. When the update is replayed on the mirror, the corruption will be hit and the page repaired in the same way. I'll check later and blog if there's any difference.

Summary

SQL Server 2008 introduces an enhancement to database mirroring that can bi-directionally pull pages between the principal and mirror databases to fix page corruptions. One word of caution I'd give is that this feature doesn't mean you can ignore these errors when they occur - you still need to do root-cause analysis on the corruption and take steps to prevent them happening again before a corruption occurs that automatic page repair cannot fix (as I mentioned above).

Nevertheless, this is a tremendously useful feature that's going to save a lot of downtime. Cool!

Thursday, September 27, 2007 2:53:30 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, September 12, 2007

One of the comments I received recently is below:

Hi Paul,

If the corruption happens to be related to I/O Erros and there is nothing in the Event log or anywhere that points to I/O related issues, is there any Trace flag that we can enable when performing checkdb or checktable operations that can show us any information related to I/O Problems, Driver issues etc?. we have table corruptions happening on a regular basis but I need some kind of evidence to show to the SAN guys thats its a disk issue and not necessarily SQL Server. Any ideas or suggestions?

Thanks

Meher

This leads nicely into a blog post/repost about how to tell if your IO subsystem is causing corruptions. You've got recurring corruption - you blame the hardware and the hardware guys blame the software. There's no smoking gun and the hardware diagnostics come back clean. What can you do?

SQLIOSim

This is the hardware diagnostic you really want to run. It simulates a very heavy SQL Server workload and should do a far better job of discovering flaws in your hardware setup than individual hardware vendors' diagnostics will. I always recommend that you run it before installing a system, as well as using it to expose hardware as the problem in difficult-to-diagnose corruption problems.

You can find info on it at http://support.microsoft.com/default.aspx?scid=kb;en-us;231619 - this has been heavily updated since I originally publicized it last year. There are also some great resources on how to interpret the results - Kevin Kline wrote a blog post pulling them all together - check it out here.

One thing I like to say at conferences when I'm discussing SQLIOSim is that you're not just testing the hardware. You're also testing all the software in between the disk and SQL Server (the OS, 3rd party drivers, RAID controller firmware, disk drive firmware,...)

Page Checksums

This is a cool new feature of 2005. Once page checksums are enabled (at the database level), whenever a database page is written out of SQL Server's buffer pool, a checksum is calculated over the page's contents and stamped on the page. This is the very last operation performed on the page before it leaves SQL Server's control. When a page is read into SQL Server's buffer pool, if it has a page checksum on it then the checksum is recalculated and verified. If the re-calculated checksum doesn't match the one stamped on the page, something in the IO subsystem MUST have changed the page (i.e. in between SQL Server writing and subsequently reading the page, something underneath SQL Server in the IO stack corrupted the page).

Here are some points to note about page checksums (they debunk a bunch of common misconceptions):

  • Databases that are created on SQL Server 2005 automatically have page checksums turned on
  • Page checksums are a super-set of torn-page detection. Page checksums will also detect torn pages.
  • You cannot enable page checksums and torn-page detection at the same time.
  • Upgrading a database to SQL Server 2005 and turning on the page checksum option does not automatically protect all the pages, as a page has to be changed and written to disk after the database option is enabled to have a checksum written on it. Only when a page has been through this process is it protected. There is no tool or automatic way to force all pages to go through this process - as I mentioned in the last post.
  • In benchmarking tests with a TPCH workload during SQL Server 2005 development, we measured approx 2% performance degradation as a result of having checksums enabled.
  • The checksum cannot be used for error correction. Generating an error-correcting checksum would be a more complicated algorithm and so would be slower to compute.
  • The checksum is validated when a page is read for checking by any of the DBCC CHECK* commands (regardless of whether the PHYSICAL_ONLY option was used) so all existing page checksums can be checked by issuing a DBCC CHECKDB command.
  •  Any existing checksums are checked when pages are read as part of taking a backup. In addition, the restore logic will also verify the page checksums as the pages are restored from the backup media, so there is solid assurance that the data from the backup is consistent. 
  •  Page checksums do not prevent in-memory corruptions from memory scribblers (where the page is read in, corrupted in memory from some rogue process, and then written out with a new checksum). In Enterprise Edition, there is a ‘checksum sniffer’ that runs constantly as part of the lazywriter process, randomly picking unchanged pages in the buffer pool and validating their checksums to see if the page has been scribbled on. There are now documented cases of this process finding memory corruptions on SQL Server 2005 installations.

Bad page checksums will result in IO errors being reported (as I mentioned in the previous post):

  •  IO failures will trigger read-retry logic, which will re-read the page several times to see if the error clears itself (if it does, a message is written to the SQL error log)
  • IO failures that persist through read-retry are logged in the error log and Windows event log, so monitoring these will allow you to be alerted to hardware problems quickly. 
  •  If an IO error is reported, the page is not taken offline in any way. Anything that subsequently touches the page will get another IO error. The exception to this is if the IO error is encountered during transaction rollback. In this case, the database is taken offline and must be brought back online manually.

Trace flags

There are two trace flags you can use to add some extra auditing - these are documented in the SQL Server 2005 version of the SQL Server IO whitepaper.

Trace flag 806 will cause 'DBCC-style' page auditing to be performed whenever a database page is read into the buffer pool. This is useful to catch cases where pages are being corrupted in memory and then written out to disk with a new page checksum. When they're read back in the checksum will look correct, but the page is corrupt (because of the previous memory corruption). This page auditing goes someway to catching this - especially on non-Enterprise Edition systems that don't have the 'checksum sniffer' I describe above.

Trace flag 3422 will cause auditing of transaction log records as they're read (during transaction rollback or log recovery). This is useful because there is no equivalent to page checksums for transaction log records and so no way to detect whether log records are being corrupted.

Be careful with these trace flags - I don't recommend using them unless your experiencing corruptions that you can't diagnose. Turning them on will cause a big CPU hit because of the extra auditing that's happening.

Summary

So, to answer the question in the comment, there are a few things you should do on SQL Server 2005 to help detect IO subsystem problems. Page checksums in particular have helped to vastly reduce the number of undiagnosed corruption problems (saving time and hassle for customers and Product Support).

 

Wednesday, September 12, 2007 3:09:31 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Tuesday, September 04, 2007

This was originally posted as two posts on the SQL Server Storage Engine site. It was very popular so I've combined the two posts together and added a bunch more commentary - especially on page checksums and IO errors. You may also notice some color differences between the scripts from earlier in the year and today's post - more keywords are recognized in Management Studio in SP2 than before (but still not CHECKDB though...)

It's almost inevitable that at some point every DBA will face dealing with corruption - so it's very important that you know how the server will behave when corruption happens. You also need to make sure that whatever logic you've created to catch corruptions (either through error log parsing, alerts, or Agent jobs - topic for a future post) will actually work.

To do all this you need a corrupt database to play with. Earlier this year I created two corrupt databases - one for 2000 (attached in broken2000.zip) and one for 2005 (attached in broken2005.zip). The two attached files can be restored by unzipping them and then using the following syntax (substituting the correct backup name):

RESTORE DATABASE broken FROM DISK='c:\sqlskills\brokenXXX.bck'

WITH MOVE 'broken' TO 'c:\sqlskills\broken.mdf',

MOVE 'broken_log' TO 'c:\sqlskills\broken_log.ldf';

GO

The databases have the same schema - a table called 'brokentable' (c1 int, c2 varchar(7000)) with one row in it. The table has a single data-page which I've corrupted differently in 2000 and 2005:

  • 2000: The corrupt page has page ID (1:75) and the page header is corrupt so that selecting from the table will result in a 605 error which will kill the connection.
  • 2005: The corrupt page has page ID (1:143) and the page header is corrupt such that the page checksum is bad.

A cautionary note on page checksums - if you upgrade a database from 2000 to 2005 and turn on page checksums, nothing happens! It's not until a database page is read into the buffer pool, changed in some way and then written back out to disk that it will have a page checksum stamped on it. This means that once you turn them on, you need to trigger a page checksum being written to each page in some way - e.g. rebuilding all indexes or forcing an in-place update of all table rows. Neither of these is very palatable and there's no tool to force page checksums in SQL Server 2005. Unfortunately there are no plans to include such a tool in SQL Server 2008 either.

Below I've listed a few things you can try out to see what would happen on your database if a checksum failed. These are all using the 2005 corrupt database.

Query errors

Any query that touches that page is going to fail with an 824 error. The IO errors in 2005 are different from 2000 - they've been split into 3:

  • 823 - a hard IO error. This is where SQL Server has asked the OS to read the page but it just can't.
  • 824 - a soft IO error. This is where the OS could read the page but SQL Server decided that the page was corrupt - for example with a page checksum failure
  • 825 - a read-retry error. This is where either an 823 or 824 occured, SQL server retried the IO automatically and it succeeded. This error is written to the errorlog only - you need to be aware of these as they're a sign of your IO subsystem going awry. There's no way to turn off read-retry and force SQL Server to 'fail-fast' - whether this behavior is a good or bad thing can be argued both ways - personally I don't like it.

SELECT * FROM broken..brokentable;

GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:\sqlskills\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

DBCC CHECKDB

CHECKDB throws us some nice errors. Note that it doesn't actually mention a page checksum failure. CHECKDB is the only thing in SQL Server that can 'eat' IO errors and convert them into non-fatal corruption errors. Note in the CHECKDB output below that the repair level needed to repair this error is 'repair_allow_data_loss' - this is because the repair for a page with any kind of IO error on it is to delete the page, fix-up all relevant linkages, and rebuild any referencing indexes.

DBCC CHECKDB ('broken') WITH NO_INFOMSGS, ALL_ERRORMSGS;

GO

Msg 8928, Level 16, State 1, Line 1

Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.

Msg 8939, Level 16, State 98, Line 1

Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12716041 and -4.

CHECKDB found 0 allocation errors and 2 consistency errors in table 'brokentable' (object ID 2073058421).

CHECKDB found 0 allocation errors and 2 consistency errors in database 'broken'.

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (broken).

Backup with CHECKSUM

If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they're read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup - in much the same way that page checksums work.

BACKUP DATABASE broken TO DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM;

GO

Msg 3043, Level 16, State 1, Line 1

BACKUP 'broken' detected an error on page (1:143) in file 'c:\sqlskills\broken.mdf'.

Msg 3013, Level 16, State 1, Line 1

BACKUP DATABASE is terminating abnormally.

The backup has failed because of a bad page checksum. However, we can force it to backup. If this is the only copy of the database we have, and we're being forced to run repair to fix a corruption, for instance, then we want to make sure we have a backup to restore from in case something goes wrong with the repair. Even a backup that contains a corrupt database is better than no backup at all. In this case, we can use the CONTINUE_AFTER_ERROR option which will force the backup to continue when it finds a bad page.

BACKUP DATABASE broken TO DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM, CONTINUE_AFTER_ERROR;

GO

Processed 160 pages for database 'broken', file 'broken' on file 1.

Processed 1 pages for database 'broken', file 'broken_log' on file 1.

BACKUP WITH CONTINUE_AFTER_ERROR successfully generated a backup of the damaged database. Refer to the SQL Server error log for information about the errors that were encountered.

BACKUP DATABASE successfully processed 161 pages in 2.025 seconds (0.651 MB/sec).

RESTORE VERIFYONLY

You can check the high-level validity of any backup set using the RESTORE VERIFYONLY command.

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2005.bck';

GO

The backup set on file 1 is valid.

What about on the backup that we forced using CONTINUE_AFTER_ERROR?

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2.bck';

GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

Isn't that cool? It tells us that the backup was already corrupt when it was written. Ok - let's ask it to specifically check the checksums in the backup. This will look through all the pages in the backup that have page checksums, check them, and recalculate the backup stream checksum.

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2.bck'

WITH CHECKSUM;

GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

In our case, we get the same as above because the database was known to be corrupt when the backup was taken, so none of the checksum checking is done. What about if we try to check the checksums on the initial backup?

RESTORE VERIFYONLY FROM DISK='c:\sqlskills\broken2005.bck'

WITH CHECKSUM;

GO

Msg 3187, Level 16, State 1, Line 1

RESTORE WITH CHECKSUM cannot be specified because the backup set does not contain checksum information.

Msg 3013, Level 16, State 1, Line 1

VERIFY DATABASE is terminating abnormally.

We can't do that as the backup wasn't taken using the WITH CHECKSUM option in the first place, even though some of the database pages may have page checksums on them.

RESTORE

How about we try to overwrite the existing 'broken' database with the one from the second backup we took?

RESTORE DATABASE broken FROM DISK='c:\sqlskills\broken2.bck'

WITH REPLACE;

GO

Msg 3183, Level 16, State 1, Line 1

RESTORE detected an error on page (1:143) in database "broken" as read from the backup set.

Msg 3013, Level 16, State 1, Line 1

RESTORE DATABASE is terminating abnormally.

It won't let us because the backup contains corrupt data (and it knows that because we forced the backup to complete using the CONTINUE_AFTER_ERROR option). However, there may be cases where you've lost your database and all you have is a corrupt backup. In this case it may be better to restore what data you do have rather than lose everything. You can do it using the CONTINUE_AFTER_ERROR option on the RESTORE command this time:

RESTORE DATABASE broken FROM DISK='c:\sqlskills\broken2.bck'

WITH REPLACE, CONTINUE_AFTER_ERROR;

GO

Processed 160 pages for database 'broken', file 'broken' on file 1.

Processed 1 pages for database 'broken', file 'broken_log' on file 1.

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible.

RESTORE DATABASE successfully processed 161 pages in 0.392 seconds (3.364 MB/sec).

Isn't that cool? It works BUT it tells us that the backup set contained corrupt data and that the database was restored but could have corrupt data in.

Summary

Have a play about with these databases to familiarize yourself with the kind of responses you'll get from the various tools when a corruption exists, and how to work around it if need be.

Let me know if you want to see any particular kinds of corruptions explored, or want a database with something specific corrupted in.

broken2000.zip (41 KB)broken2005.zip (149.9 KB)
Tuesday, September 04, 2007 12:31:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

Theme design by Jelle Druyts

Pick a theme: