This was originally posted as two posts on the SQL Server Storage Engine site. It was very popular so I’ve combined the two posts together and added a bunch more commentary – especially on page checksums and IO errors.
It’s almost inevitable that at some point every DBA will face dealing with corruption – so it’s very important that you know how the server will behave when corruption happens. You also need to make sure that whatever logic you’ve created to catch corruptions (either through error log parsing, alerts, or Agent jobs – topic for a future post) will actually work.
To do all this you need a corrupt database to play with. Earlier this year I created two corrupt databases – one for 2000 (attached in broken2000.zip) and one for 2005 (attached in broken2005.zip). The two attached files can be restored by unzipping them and then using the following syntax (substituting the correct backup name):
RESTORE DATABASE [broken] FROM DISK = N'C:\sqlskills\brokenXXX.bck' WITH MOVE N'broken' TO N'C:\sqlskills\broken.mdf', MOVE N'broken_log' TO N'C:\sqlskills\broken_log.ldf'; GO
The databases have the same schema – a table called brokentable with schema (c1 int, c2 varchar(7000)) with one row in it. The table has a single data-page which I’ve corrupted differently in 2000 and 2005:
- 2000: The corrupt page has page ID (1:75) and the page header is corrupt so that selecting from the table will result in a 605 error which will kill the connection.
- 2005: The corrupt page has page ID (1:143) and the page header is corrupt such that the page checksum is bad.
A cautionary note on page checksums – if you upgrade a database from 2000 to 2005 and turn on page checksums, nothing happens! It’s not until a database page is read into the buffer pool, changed in some way and then written back out to disk that it will have a page checksum stamped on it. This means that once you turn them on, you need to trigger a page checksum being written to each page in some way – e.g. rebuilding all indexes or forcing an in-place update of all table rows. Neither of these is very palatable and there’s no tool to force page checksums in SQL Server 2005. Unfortunately there are no plans to include such a tool in SQL Server 2008 either.
Below I’ve listed a few things you can try out to see what would happen on your database if a checksum failed. These are all using the 2005 corrupt database.
Any query that touches that page is going to fail with an 824 error. The IO errors in 2005 are different from 2000 – they’ve been split into 3:
- 823 – a hard IO error. This is where SQL Server has asked the OS to read the page but it just can’t.
- 824 – a soft IO error. This is where the OS could read the page but SQL Server decided that the page was corrupt – for example with a page checksum failure
- 825 – a read-retry error. This is where either an 823 or 824 occured, SQL server retried the IO automatically and it succeeded. This error is written to the errorlog only – you need to be aware of these as they’re a sign of your IO subsystem going awry. There’s no way to turn off read-retry and force SQL Server to ‘fail-fast’ – whether this behavior is a good or bad thing can be argued both ways – personally I don’t like it.
SELECT * FROM [broken]..[brokentable]; GO
Msg 824, Level 24, State 2, Line 1 SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:\sqlskills\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
DBCC CHECKDB throws us some nice errors. Note that it doesn’t actually mention a page checksum failure. DBCC CHECKDB is the only thing in SQL Server that can ‘eat’ IO errors and convert them into non-fatal corruption errors. Note in the DBCC CHECKDB output below that the repair level needed to repair this error is ‘repair_allow_data_loss’ – this is because the repair for a page with any kind of IO error on it is to delete the page, fix-up all relevant linkages, and rebuild any referencing indexes.
DBCC CHECKDB (N'broken') WITH NO_INFOMSGS, ALL_ERRORMSGS; GO
Msg 8928, Level 16, State 1, Line 1 Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details. Msg 8939, Level 16, State 98, Line 1 Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12716041 and -4. CHECKDB found 0 allocation errors and 2 consistency errors in table 'brokentable' (object ID 2073058421). CHECKDB found 0 allocation errors and 2 consistency errors in database 'broken'. repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (broken).
Backup with CHECKSUM
If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they’re read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup – in much the same way that page checksums work.
BACKUP DATABASE [broken] TO DISK = N'c:\sqlskills\broken2.bck' WITH CHECKSUM; GO
Msg 3043, Level 16, State 1, Line 1 BACKUP 'broken' detected an error on page (1:143) in file 'c:\sqlskills\broken.mdf'. Msg 3013, Level 16, State 1, Line 1 BACKUP DATABASE is terminating abnormally.
The backup has failed because of a bad page checksum. However, we can force it to backup. If this is the only copy of the database we have, and we’re being forced to run repair to fix a corruption, for instance, then we want to make sure we have a backup to restore from in case something goes wrong with the repair. Even a backup that contains a corrupt database is better than no backup at all. In this case, we can use the CONTINUE_AFTER_ERROR option which will force the backup to continue when it finds a bad page.
BACKUP DATABASE [broken] TO DISK = N'c:\sqlskills\broken2.bck' WITH CHECKSUM, CONTINUE_AFTER_ERROR; GO
Processed 160 pages for database 'broken', file 'broken' on file 1. Processed 1 pages for database 'broken', file 'broken_log' on file 1. BACKUP WITH CONTINUE_AFTER_ERROR successfully generated a backup of the damaged database. Refer to the SQL Server error log for information about the errors that were encountered. BACKUP DATABASE successfully processed 161 pages in 2.025 seconds (0.651 MB/sec).
You can check the high-level validity of any backup set using the RESTORE VERIFYONLY command.
RESTORE VERIFYONLY FROM DISK = N'c:\sqlskills\broken2005.bck'; GO
The backup set on file 1 is valid.
What about on the backup that we forced using CONTINUE_AFTER_ERROR?
RESTORE VERIFYONLY FROM DISK = N'c:\sqlskills\broken2.bck'; GO
The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.
Isn’t that cool? It tells us that the backup was already corrupt when it was written. Ok – let’s ask it to specifically check the checksums in the backup. This will look through all the pages in the backup that have page checksums, check them, and recalculate the backup stream checksum.
RESTORE VERIFYONLY FROM DISK = N'c:\sqlskills\broken2.bck' WITH CHECKSUM; GO
The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.
In our case, we get the same as above because the database was known to be corrupt when the backup was taken, so none of the checksum checking is done. What about if we try to check the checksums on the initial backup?
RESTORE VERIFYONLY FROM DISK = N'c:\sqlskills\broken2005.bck' WITH CHECKSUM; GO
Msg 3187, Level 16, State 1, Line 1 RESTORE WITH CHECKSUM cannot be specified because the backup set does not contain checksum information. Msg 3013, Level 16, State 1, Line 1 VERIFY DATABASE is terminating abnormally.
We can’t do that as the backup wasn’t taken using the WITH CHECKSUM option in the first place, even though some of the database pages may have page checksums on them.
How about we try to overwrite the existing ‘broken’ database with the one from the second backup we took?
RESTORE DATABASE [broken] FROM DISK = N'c:\sqlskills\broken2.bck' WITH REPLACE; GO
Msg 3183, Level 16, State 1, Line 1 RESTORE detected an error on page (1:143) in database "broken" as read from the backup set. Msg 3013, Level 16, State 1, Line 1 RESTORE DATABASE is terminating abnormally.
It won’t let us because the backup contains corrupt data (and it knows that because we forced the backup to complete using the CONTINUE_AFTER_ERROR option). However, there may be cases where you’ve lost your database and all you have is a corrupt backup. In this case it may be better to restore what data you do have rather than lose everything. You can do it using the CONTINUE_AFTER_ERROR option on the RESTORE command this time:
RESTORE DATABASE [broken] FROM DISK = N'c:\sqlskills\broken2.bck' WITH REPLACE, CONTINUE_AFTER_ERROR; GO
Processed 160 pages for database 'broken', file 'broken' on file 1. Processed 1 pages for database 'broken', file 'broken_log' on file 1. The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR. RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible. RESTORE DATABASE successfully processed 161 pages in 0.392 seconds (3.364 MB/sec).
Isn’t that cool? It works BUT it tells us that the backup set contained corrupt data and that the database was restored but could have corrupt data in.
Have a play about with these databases to familiarize yourself with the kind of responses you’ll get from the various tools when a corruption exists, and how to work around it if need be.
Let me know if you want to see any particular kinds of corruptions explored, or want a database with something specific corrupted in.
16 thoughts on “Example 2000/2005 corrupt databases and some more info on backup, restore, page checksums and IO errors”
Hope you are doing fine. Wonderful post though I am too late to pick this up, I am just now practicing it.
Anyways, I will get to the point. Just for practicing, I repaired the broken database with REPAIR_ALLOW_DATA_LOSS option as that is the minimal option. Before doing this put it on the SINGLE USER mode which I done and it repaired successfully and the database was still in SINGLE user mode.
Now, to practice more I have actually restored the database from the corrupted broken database but what happens is the database goes back to multiuser mode automatically.
I have tested the same repaired database while still in single user mode, restored it from a clean backup and the database still in single user mode.
Why does the database change to multiuser with a corrupted backup?
Thanks for your time :)
Hey Krisha – a database will always be restored to the state it was when backed up. Thanks
Do you have a recommended method for triggering a page checksum to be written to each page? I’m looking at enabling the CHECKSUM PAGE_VERIFY option in my databases and was wondering if either of your recommendations above was more preferred?
I’d say rebuilding indexes would be the most effective as you can combine it with your regular index maintenance. I’m still working on a way to do this without having to take such (possible drastic) measures.
Not working for me, when restore with these options: WITH REPLACE,CONTINUE_AFTER_ERROR,
I get this.
Msg 3242, Level 16, State 2, Line 1
The file on device ‘C:DataDatabaseSQLBackupPTS.bak’ is not a valid Microsoft Tape Format backup set.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.
What isn’t working for you?
First of all thanks for motivating us for R&D , here is my question.
I was just trying different scheneriosn in corruption. I have created one database (5120kb datafile). just afer creating database i stopped sql server and open datafile in editplus. i delete few lines from lower part of datafile and followe these steps : MY Database is coming back to suspected
If Datafile is courrupted
select * from sys.databases — SUSPECT
1. EXEC sp_resetstatus ‘sus’
—-> Warning: You must recover this database prior to access.
2. dbcc checkdb (sus)
—-> Check statement aborted. Database contains deferred transactions.
—-> Datbase is still in suspect mode
3. ALTER DATABASE sus SET EMERGENCY
—-> Datbase is in emergency mode
4. ALTER DATABASE
sus SET SINGLE_USER
WITH ROLLBACK IMMEDIATE
5. DBCC CheckDB (‘sus’, REPAIR_ALLOW_DATA_LOSS)
Msg 5028, Level 16, State 4, Line 1
The system could not activate enough of the database to rebuild the log.
DBCC results for ‘SUS’.
CHECKDB found 0 allocation errors and 0 consistency errors in database ‘SUS’.
Msg 7909, Level 20, State 1, Line 1
The emergency-mode repair failed.You must restore from backup.
6.(Re-running same) DBCC CheckDB (‘sus’, REPAIR_ALLOW_DATA_LOSS)
—-> Database ‘SUS’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
—-> Database goes in suspect mode again
6. ALTER DATABASE sus SET MULTI_USER
Msg 926, Level 14, State 1, Line 1
Database ‘SUS’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.
Msg 5125, Level 24, State 2, Line 1
File ‘C:\SUS.mdf’ appears to have been truncated by the operating system. Expected size is 5120 KB but actual size is 5112 KB.
Msg 3414, Level 21, State 1, Line 1
An error occurred during recovery, preventing the database ‘SUS’ (database ID 12) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.
Ok – so what’s your question? EMERGENCY-mode repair is documented as not being infallible – you corrupted the data file in a way that prevented it working.
If we change page_verify to None, can we read page failed with error 824.
No – as the buffer pool won’t test for page checksum or torn page failures.
I have set up alerts to pick up any 3043 error thrown by the backup jobs (which run WITH CHECKSUM). Early today we received one from a transaction log backup and a few minutes ago ran DBCC CHECKDB WITH NO_INFOMSGS on the database that had complained. It did not return any errors.
Could you explain what might have happened with the bad checksum that triggered the error?
Thanks in advance,
It’s likely that whatever corruption existed at the time the backup failed was no longer part of the database when you ran the DBCC CHECKDB. A page might have been deallocated, so DBCC CHECKDB wouldn’t have any cause to read it.
inin my case, I have a 580 GB .Bak file (.mdf, .ndf, ldf) corrupted.
do you know of a way to extract ndf only?
No, you can’t restore an NDF on it’s own from a backup without restoring the entire primary filegroup first.
Respected Paul sir,
Please I faced the issue where drive got full because of long running transaction in simple recovery model in sql 2012 and dB went into Recovery.(how the dB changed its state itself means it went offline and then online )
Please wnat to know when drive space was not there and it needed to rollback, it should have gone into recovery pending state , while it went into Recovery state.
What would have happened if the sufficient space wouldn’t be there while it started Recovery, it was waste of resources.
The log manager reserves enough space in the log to always be able to rollback transactions if the log runs out of space.