This was originally posted as two posts on the SQL Server Storage Engine site. It was very popular so I’ve combined the two posts together and added a bunch more commentary – especially on page checksums and IO errors. You may also notice some color differences between the scripts from earlier in the year and today’s post – more keywords are recognized in Management Studio in SP2 than before (but still not CHECKDB though…) It’s almost inevitable that at some point every DBA will face dealing with corruption – so it’s very important that you know how the server will behave when corruption happens. You also need to make sure that whatever logic you’ve created to catch corruptions (either through error log parsing, alerts, or Agent jobs – topic for a future post) will actually work. To do all this you need a corrupt database to play with. Earlier this year I created two corrupt databases – one for 2000 (attached in broken2000.zip) and one for 2005 (attached in broken2005.zip). The two attached files can be restored by unzipping them and then using the following syntax (substituting the correct backup name): RESTORE DATABASE broken FROM DISK=‘c:\sqlskills\brokenXXX.bck’ WITH
MOVE
‘broken_log’ TO ‘c:\sqlskills\broken_log.ldf’;GO
The databases have the same schema – a table called ‘brokentable’ (c1 int, c2 varchar(7000)) with one row in it. The table has a single data-page which I’ve corrupted differently in 2000 and 2005:
2000: The corrupt page has page ID (1:75) and the page header is corrupt so that selecting from the table will result in a 605 error which will kill the connection.
2005: The corrupt page has page ID (1:143) and the page header is corrupt such that the page checksum is bad.
A cautionary note on page checksums – if you upgrade a database from 2000 to 2005 and turn on page checksums, nothing happens! It’s not until a database page is read into the buffer pool, changed in some way and then written back out to disk that it will have a page checksum stamped on it. This means that once you turn them on, you need to trigger a page checksum being written to each page in some way – e.g. rebuilding all indexes or forcing an in-place update of all table rows. Neither of these is very palatable and there’s no tool to force page checksums in SQL Server 2005. Unfortunately there are no plans to include such a tool in SQL Server 2008 either.
Below I’ve listed a few things you can try out to see what would happen on your database if a checksum failed. These are all using the 2005 corrupt database.
Query errors
Any query that touches that page is going to fail with an 824 error. The IO errors in 2005 are different from 2000 – they’ve been split into 3:
823 – a hard IO error. This is where SQL Server has asked the OS to read the page but it just can’t.
824 – a soft IO error. This is where the OS could read the page but SQL Server decided that the page was corrupt – for example with a page checksum failure
825 – a read-retry error. This is where either an 823 or 824 occured, SQL server retried the IO automatically and it succeeded. This error is written to the errorlog only – you need to be aware of these as they’re a sign of your IO subsystem going awry. There’s no way to turn off read-retry and force SQL Server to ‘fail-fast’ – whether this behavior is a good or bad thing can be argued both ways – personally I don’t like it.
* FROM broken..brokentable;
SELECT
GO
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file ‘c:\sqlskills\broken.mdf’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
DBCC CHECKDB
CHECKDB throws us some nice errors. Note that it doesn’t actually mention a page checksum failure. CHECKDB is the only thing in SQL Server that can ‘eat’ IO errors and convert them into non-fatal corruption errors. Note in the CHECKDB output below that the repair level needed to repair this error is ‘repair_allow_data_loss’ – this is because the repair for a page with any kind of IO error on it is to delete the page, fix-up all relevant linkages, and rebuild any referencing indexes.
CHECKDB (‘broken’) WITH NO_INFOMSGS, ALL_ERRORMSGS;
DBCC
GO
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12716041 and -4.
CHECKDB found 0 allocation errors and 2 consistency errors in table ‘brokentable’ (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database ‘broken’.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (broken).
Backup with CHECKSUM
If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they’re read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup – in much the same way that page checksums work.
DATABASE broken TO DISK=‘c:\sqlskills\broken2.bck’
BACKUP
WITH
CHECKSUM;GO
Msg 3043, Level 16, State 1, Line 1
BACKUP ‘broken’ detected an error on page (1:143) in file ‘c:\sqlskills\broken.mdf’.
Msg 3013, Level 16, State 1, Line 1
BACKUP DATABASE is terminating abnormally.
The backup has failed because of a bad page checksum. However, we can force it to backup. If this is the only copy of the database we have, and we’re being forced to run repair to fix a corruption, for instance, then we want to make sure we have a backup to restore from in case something goes wrong with the repair. Even a backup that contains a corrupt database is better than no backup at all. In this case, we can use the CONTINUE_AFTER_ERROR option which will force the backup to continue when it finds a bad page.
DATABASE broken TO DISK=‘c:\sqlskills\broken2.bck’
BACKUP
WITH
CHECKSUM, CONTINUE_AFTER_ERROR;GO
Processed 160 pages for database ‘broken’, file ‘broken’ on file 1.
Processed 1 pages for database ‘broken’, file ‘broken_log’ on file 1.
BACKUP WITH CONTINUE_AFTER_ERROR successfully generated a backup of the damaged database. Refer to the SQL Server error log for information about the errors that were encountered.
BACKUP DATABASE successfully processed 161 pages in 2.025 seconds (0.651 MB/sec).
RESTORE VERIFYONLY
You can check the high-level validity of any backup set using the RESTORE VERIFYONLY command.
RESTORE VERIFYONLY FROM DISK=‘c:\sqlskills\broken2005.bck’;
GO
The backup set on file 1 is valid.
What about on the backup that we forced using CONTINUE_AFTER_ERROR?
VERIFYONLY FROM DISK=‘c:\sqlskills\broken2.bck’;
RESTORE
GO
The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.
Isn’t that cool? It tells us that the backup was already corrupt when it was written. Ok – let’s ask it to specifically check the checksums in the backup. This will look through all the pages in the backup that have page checksums, check them, and recalculate the backup stream checksum.
VERIFYONLY FROM DISK=‘c:\sqlskills\broken2.bck’
RESTORE
WITH
CHECKSUM;GO
The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.
In our case, we get the same as above because the database was known to be corrupt when the backup was taken, so none of the checksum checking is done. What about if we try to check the checksums on the initial backup?
VERIFYONLY FROM DISK=‘c:\sqlskills\broken2005.bck’
RESTORE
WITH
CHECKSUM;GO
Msg 3187, Level 16, State 1, Line 1
RESTORE WITH CHECKSUM cannot be specified because the backup set does not contain checksum information.
Msg 3013, Level 16, State 1, Line 1
VERIFY DATABASE is terminating abnormally.
We can’t do that as the backup wasn’t taken using the WITH CHECKSUM option in the first place, even though some of the database pages may have page checksums on them.
RESTORE
How about we try to overwrite the existing ‘broken’ database with the one from the second backup we took?
DATABASE broken FROM DISK=‘c:\sqlskills\broken2.bck’
RESTORE
WITH
REPLACE;GO
Msg 3183, Level 16, State 1, Line 1
RESTORE detected an error on page (1:143) in database “broken” as read from the backup set.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
It won’t let us because the backup contains corrupt data (and it knows that because we forced the backup to complete using the CONTINUE_AFTER_ERROR option). However, there may be cases where you’ve lost your database and all you have is a corrupt backup. In this case it may be better to restore what data you do have rather than lose everything. You can do it using the CONTINUE_AFTER_ERROR option on the RESTORE command this time:
DATABASE broken FROM DISK=‘c:\sqlskills\broken2.bck’
RESTORE
WITH
REPLACE, CONTINUE_AFTER_ERROR;GO
Processed 160 pages for database ‘broken’, file ‘broken’ on file 1.
Processed 1 pages for database ‘broken’, file ‘broken_log’ on file 1.
The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.
RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible.
RESTORE DATABASE successfully processed 161 pages in 0.392 seconds (3.364 MB/sec).
Isn’t that cool? It works BUT it tells us that the backup set contained corrupt data and that the database was restored but could have corrupt data in.
Summary
Have a play about with these databases to familiarize yourself with the kind of responses you’ll get from the various tools when a corruption exists, and how to work around it if need be.
Let me know if you want to see any particular kinds of corruptions explored, or want a database with something specific corrupted in.
broken2000.zip (41 KB)broken2005.zip (149.9 KB)
12 Responses to Example 2000/2005 corrupt databases and some more info on backup, restore, page checksums and IO errors
Hi Paul,
Hope you are doing fine. Wonderful post though I am too late to pick this up, I am just now practicing it.
Anyways, I will get to the point. Just for practicing, I repaired the broken database with REPAIR_ALLOW_DATA_LOSS option as that is the minimal option. Before doing this put it on the SINGLE USER mode which I done and it repaired successfully and the database was still in SINGLE user mode.
Now, to practice more I have actually restored the database from the corrupted broken database but what happens is the database goes back to multiuser mode automatically.
I have tested the same repaired database while still in single user mode, restored it from a clean backup and the database still in single user mode.
My question:
Why does the database change to multiuser with a corrupted backup?
Thanks for your time :)
Hey Krisha – a database will always be restored to the state it was when backed up. Thanks
Hey Paul,
Do you have a recommended method for triggering a page checksum to be written to each page? I’m looking at enabling the CHECKSUM PAGE_VERIFY option in my databases and was wondering if either of your recommendations above was more preferred?
Thanks
John
I’d say rebuilding indexes would be the most effective as you can combine it with your regular index maintenance. I’m still working on a way to do this without having to take such (possible drastic) measures.
[...] (if it exists). You can think of this as 'recovery with CONTINUE_AFTER_ERROR' – see this post for more details on the real CONTINUE_AFTER_ERROR option for BACKUP and RESTORE. The idea behind [...]
[...] a corrupt database. The best way to do this is to use an already corrupt database – see my recent post that provides a corrupt 2000 and 2005 database as well as some things to try with them. One of the [...]
Not working for me, when restore with these options: WITH REPLACE,CONTINUE_AFTER_ERROR,
I get this.
Msg 3242, Level 16, State 2, Line 1
The file on device ‘C:\Data\Database\SQLBackup\PTS.bak’ is not a valid Microsoft Tape Format backup set.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
What isn’t working for you?
[...] Bad page checksums will result in IO errors being reported (as I mentioned in the previous post): [...]
[...] If the page is part of a table with a cluster index you can rebuild the index in case it belongs to a heap then you can force an in place update, for more info check this link. [...]
Hi Paul
First of all thanks for motivating us for R&D , here is my question.
I was just trying different scheneriosn in corruption. I have created one database (5120kb datafile). just afer creating database i stopped sql server and open datafile in editplus. i delete few lines from lower part of datafile and followe these steps : MY Database is coming back to suspected
If Datafile is courrupted
———————————–>
select * from sys.databases — SUSPECT
—————————–>
1. EXEC sp_resetstatus ‘sus’
—-> Warning: You must recover this database prior to access.
2. dbcc checkdb (sus)
—-> Check statement aborted. Database contains deferred transactions.
—-> Datbase is still in suspect mode
3. ALTER DATABASE sus SET EMERGENCY
—-> Datbase is in emergency mode
4. ALTER DATABASE
sus SET SINGLE_USER
WITH ROLLBACK IMMEDIATE
—->
5. DBCC CheckDB (‘sus’, REPAIR_ALLOW_DATA_LOSS)
Msg 5028, Level 16, State 4, Line 1
The system could not activate enough of the database to rebuild the log.
DBCC results for ‘SUS’.
CHECKDB found 0 allocation errors and 0 consistency errors in database ‘SUS’.
Msg 7909, Level 20, State 1, Line 1
The emergency-mode repair failed.You must restore from backup.
6.(Re-running same) DBCC CheckDB (‘sus’, REPAIR_ALLOW_DATA_LOSS)
—-> Database ‘SUS’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
—-> Database goes in suspect mode again
6. ALTER DATABASE sus SET MULTI_USER
———>
Msg 926, Level 14, State 1, Line 1
Database ‘SUS’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.
Msg 5125, Level 24, State 2, Line 1
File ‘C:\SUS.mdf’ appears to have been truncated by the operating system. Expected size is 5120 KB but actual size is 5112 KB.
Msg 3414, Level 21, State 1, Line 1
An error occurred during recovery, preventing the database ‘SUS’ (database ID 12) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.
Ok – so what’s your question? EMERGENCY-mode repair is documented as not being infallible – you corrupted the data file in a way that prevented it working.