This is a true story, and unfolded over the last few days. It’s deliberately written this way, I’m not trying to be patronizing – just illustrating the mistakes people can make if they don’t know what not to do.
Once upon a time (well, a few days ago), there was a security person who had access to a SQL Server 2000 instance. He saw that the server had run out of disk space, but had no SQL server knowledge. He decided that the best thing to do was to delete the log file, instead of contact the real DBA to take corrective action. This person should not have had any access to SQL Server, or the ability to delete the log. Nevertheless, the log was deleted. And then the database became suspect. Eventually, with no other option, someone decided to create a new transaction log file using DBCC REBUILD_LOG. This was the right thing to do. The database wasn’t being backed up so restore wasn’t an option. Some of the details of who did what are unclear, as it so often the case. Needless to say, the real DBA was kept in the dark.
Unfortunately, whoever rebuilt the log didn’t run DBCC CHECKDB afterwards to find out what corruption had been caused in the database by having the transaction log unceremoniously ripped out from under the database’s feet. Rebuilding a transaction log is one of the worst things that can happen to a database, as all active transactions are thrown away, with no chance of rolling back. This leaves the database in a transactionally inconsistent, and potentially structurally inconsistent state (see Corruption: Last resorts that people try first… filed under my Bad Advice category, and Q4 from TechNet Magazine: February 2009 SQL Q&A column: Is it ever safe to rebuild a transaction log?).
Instead they carried on running. The real DBA thought everything was cool and even upgraded the instance from SQL 2000 to SQL 2005. To be honest, after seeing the 19000+ corruptions they had in the database, I’m surprised that the database upgraded successfully. But it did. Then he tried to rebuild some indexes, which failed with corruption errors. Which brought us to this morning, when DBCC CHECKDB was run. It reported a lot of corruption, but running from SSMS only showed the first 1000 errors (which was the inspiration for one of my previous posts today How to get all the corruption messages from CHECKDB. So they started to run REPAIR_ALLOW_DATA_LOSS, which will cause data loss – we chose the name of the repair option well. I got involved at this point, and saw from the list of errors that massive data loss would be the result.
Once they had re-run DBCC CHECKDB through the command line (luckily only 20 minutes) I could see all the errors. And I must say I was astounded. It was the *worst* case of multiply-allocated extents (see Inside the Storage Engine: Anatomy of an extent) that I’ve ever seen. Not only were the extents allocated by multiple IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) – in other words, two objects thought they had the same pages in the database allocated, but the two objects had actually each proceeded to overwrite the other’s updates in the pages – getting the two clustered indexes hopelessly interlinked. Lots of errors (1,000s) like:
Msg 8904, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Extent (1:9528) in database ID 5 is allocated by more than one allocation object.
Msg 8978, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: Object ID 1445580188, index ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type In-row data). Page (1:52696) is missing a reference from previous page (1:427112). Possible chain linkage problem.
Msg 8977, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: Object ID 1445580188, index ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type In-row data). Parent node for page (1:143210) was not encountered.
Msg 2533, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: page (1:405139) allocated to object ID 1445580188, index ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
CHECKDB found 1653 allocation errors and 17646 consistency errors in database 'NAMESCHANGEDTOPROTECTTHEINNOCENT'.
The repair for all of this? Delete it all and fix up the links. MASSIVE data loss.
Lessons to learn from this:
- Don’t give people with no clue about SQL Server access to SQL Server.
- Don’t delete a transaction log to reclaim space. Cardinal sin.
- Don’t rebuild a transaction log UNLESS you run a full DBCC CHECKDB afterwards and satisfy yourself that you know the extent of the damage.
- Don’t upgrade a database without running DBCC CHECKDB first. Best case – the upgrade fails. Worst case – it upgrades and then you might not be able to fix the corruptions.
- Don’t just run repair without doing a little analysis on the errors reported, especially with a *huge* number of errors reported.
Finally, I’m really grateful to the DBA in question for letting me help him out with this, and to post this blog post – we all learn from our own and others’ mistakes.
PS And I got involved in this from Twitter – I just *love* it. After initially being skeptical of how much time I’d spend on it, I’m finding the benefits of connecting to the SQL community in ‘real-time’ vastly outweigh the time I’m putting into it. Follow me on http://twitter.com/paulrandal and you’ll see a bunch of other SQL MVPs on there too.