These next few posts are based off part of my Secrets of Fast Detection and Recovery from Database Corruptions session from TechEd and various user groups around the world (Edit: see here for a video recording from TechEd 2008).
In this post I want to describe the two worst things I think you can do to your database – rebuilding a transaction log and running REPAIR_ALLOW_DATA_LOSS – that people often try doing instead of restoring from their backups.
Rebuilding a transaction log
It’s pretty well known that in SQL Server 2000 (and before) there’s an undocumented and unsupported command called DBCC REBUILD_LOG. It deletes the transaction log file(s) and creates a new one. It completely disregards any uncommitted transactions that may exist – it just deletes them. This means that these uncommitted transactions don’t get a chance to roll back.
What does this mean? Well, in the best case, the only in-flight transactions were altering user data – so your business logic, inherent and constraint-enforced relationships between tables, and the basic logical integrity of your user data are all broken. In the worst case, the in-flight transactions were altering the structure of the database (e.g. doing a page split) so that fact that they didn’t get a chance to rollback means the database may be structurally corrupt!
Here are two examples (somewhat contrived) that illustrate the possible consequences of rebuilding a transaction log.
1: Logical data loss
Imagine you’re at an ATM transferring $1000 from your checking account to your savings account. On the SQL Server in the bank’s data-center, the transaction happens in two parts – update the record in the checking accounts table with the balance minus $1000 and then update the record in the savings account table with the balance plus $1000. The process is half-way through – $1000 has been debited from your checking account, but not yet credited to your savings account, when disaster strikes! A work crew outside the data-center accidentally cuts the power and the machine hosting SQL Server powers down. The ATM displays one of those infuriating “We’re sorry, our computers are unavailable at present” and you walk away grumbling but think nothing more about it.
Meanwhile, the power’s been restored to the data-center and SQL Server is going through crash recovery. The partially-completed transaction on your account should rollback and credit back the $1000 to your checking account. But the new DBA at the bank decides that its taking too long for the system to come back up. He deletes the transaction log and rebuilds it to get the system up faster.
Unfortunately, the portion of the transaction log that had not had a chance to recover included the transaction involving your bank account. Even more unfortunately, a checkpoint occurred right before the power loss, and the database page containing the updated checking account balance was flushed to disk. When the transaction log is deleted and rebuilt, your transaction can’t rollback – because it’s simply gone. So the $1000 debit from your checking account is not rolled back – you’ve lost $1000!!
2: Physical database corruption.
Imagine an insert into a table with a single non-clustered index. Under the covers, the insert happens in two parts – insert the record into the table and then insert the corresponding non-clustered index record. Imagine a similar disaster-recovery situation as I described above occurring after the table insert has been done but not the non-clustered index insert. If the log is rebuilt, then the table and its index are not in sync – actual physical corruption!
Whenever a transaction log is rebuilt, a message is output to the SQL Server error log and the Windows event log. In SQL Server 2005 onwards, the message is:
Warning: The log for database 'test' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
In SQL Server 2005, I removed the old DBCC REBUILD_LOG command and it has been replaced with undocumented (and still unsupported) syntax. However, there is now a fully documented and supported way to do this using EMERGENCY-mode repair.
Sometimes on SQL Server 2000 there’s no alternative to rebuilding a transaction log – when the log is physically damaged and there’s no backup. Unfortunately, all too often I see people simply rebuilding the transaction log and continuing with regular operations – no checks, no repairs, no root-cause analysis. Nowadays you can use EMERGENCY-mode repair, but you still need to do root-cause analysis, and you’re still going to lose data and who-knows what else.
REPAIR_ALLOW_DATA_LOSS is the repair level that DBCC CHECKDB recommends when it finds corruptions. This is because fixing nearly anything that’s not a minor non-clustered index issue requires deleting something to repair it. So, REPAIR_ALLOW_DATA_LOSS will delete things. This means it will probably delete some of your data as well. If, for instance it finds a corrupt record on a data page, it may end up having to delete the entire data page, including all the other records on the page, to fix the corruption. That could be a lot of data. For this reason, the repair level name was carefully chosen. You can’t type in REPAIR_ALLOW_DATA_LOSS without realizing that you’re probably going to lose some data as part of the operation.
I’ve been asked why this is. The purpose of repair is not to save user data. The purpose of repair is to make the database structurally consistent as fast as possible (to limit downtime) and correctly (to avoid making things worse). This means that repairs have to be engineered to be fast and reliable operations that will work in every circumstance. The simple way to do this is to delete what’s broken and fix up everything that linked to (or was linked from) the thing being deleted – whether a record or page. Trying to do anything more complicated would increase the chances of the repair not working, or even making things worse.
The ramifications of this are that running REPAIR_ALLOW_DATA_LOSS can lead to the same effect on your data as rebuilding a transaction log with in-flight transactions altering user data – your business logic, inherent and constraint-enforced relationships between tables, and the basic logical integrity of your user data could all be broken. BUT, the database is now structurally consistent and SQL Server can run on it without fear of hitting a corruption that could cause a crash.
To continue the contrived example from above, imagine your bank checking and savings accounts just happen to be stored on the same data page in the bank’s SQL Server database. The new DBA doesn’t realize that backups are necessary for disaster recovery and data preservation and so isn’t taking any. Disaster strikes again in the form of the work crew outside the data-center accidentally cutting the power and the machine hosting SQL Server powers down. This time, one of the drives has a problem while powering down and a page write doesn’t complete – causing a torn page. Unfortunately, it’s the page holding your bank accounts. As the DBA doesn’t have any backups, the only alternative to fix the torn-page is to run REPAIR_ALLOW_DATA_LOSS. For this error, it will delete the page, and does so. In the process, everything else on the page is also lost, including your bank accounts!!
So you can see how these two operations are really very, very bad things to do to a database and can cause havoc with your data. And yet people still have to use these operations because they don’t have valid backups…
In the next post I’ll introduce EMERGENCY mode and how to use it.