This is a true story, and unfolded over the last few days. It’s deliberately written this way, I’m not trying to be patronizing – just illustrating the mistakes people can make if they don’t know what not to do.
Once upon a time (well, a few days ago), there was a security person who had access to a SQL Server 2000 instance. He saw that the server had run out of disk space, but had no SQL server knowledge. He decided that the best thing to do was to delete the log file, instead of contact the real DBA to take corrective action. This person should not have had any access to SQL Server, or the ability to delete the log. Nevertheless, the log was deleted. And then the database became suspect. Eventually, with no other option, someone decided to create a new transaction log file using DBCC REBUILD_LOG. This was the right thing to do. The database wasn’t being backed up so restore wasn’t an option. Some of the details of who did what are unclear, as it so often the case. Needless to say, the real DBA was kept in the dark.
Unfortunately, whoever rebuilt the log didn’t run DBCC CHECKDB afterwards to find out what corruption had been caused in the database by having the transaction log unceremoniously ripped out from under the database’s feet. Rebuilding a transaction log is one of the worst things that can happen to a database, as all active transactions are thrown away, with no chance of rolling back. This leaves the database in a transactionally inconsistent, and potentially structurally inconsistent state (see Corruption: Last resorts that people try first… filed under my Bad Advice category, and Q4 from TechNet Magazine: February 2009 SQL Q&A column: Is it ever safe to rebuild a transaction log?).
Instead they carried on running. The real DBA thought everything was cool and even upgraded the instance from SQL 2000 to SQL 2005. To be honest, after seeing the 19000+ corruptions they had in the database, I’m surprised that the database upgraded successfully. But it did. Then he tried to rebuild some indexes, which failed with corruption errors. Which brought us to this morning, when DBCC CHECKDB was run. It reported a lot of corruption, but running from SSMS only showed the first 1000 errors (which was the inspiration for one of my previous posts today How to get all the corruption messages from CHECKDB. So they started to run REPAIR_ALLOW_DATA_LOSS, which will cause data loss – we chose the name of the repair option well. I got involved at this point, and saw from the list of errors that massive data loss would be the result.
Once they had re-run DBCC CHECKDB through the command line (luckily only 20 minutes) I could see all the errors. And I must say I was astounded. It was the *worst* case of multiply-allocated extents (see Inside the Storage Engine: Anatomy of an extent) that I’ve ever seen. Not only were the extents allocated by multiple IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) – in other words, two objects thought they had the same pages in the database allocated, but the two objects had actually each proceeded to overwrite the other’s updates in the pages – getting the two clustered indexes hopelessly interlinked. Lots of errors (1,000s) like:
Msg 8904, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1 Extent (1:9528) in database ID 5 is allocated by more than one allocation object. ... Msg 8978, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1 Table error: Object ID 1445580188, index ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type In-row data). Page (1:52696) is missing a reference from previous page (1:427112). Possible chain linkage problem. ... Msg 8977, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1 Table error: Object ID 1445580188, index ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type In-row data). Parent node for page (1:143210) was not encountered. ... Msg 2533, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1 Table error: page (1:405139) allocated to object ID 1445580188, index ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header. ... CHECKDB found 1653 allocation errors and 17646 consistency errors in database 'NAMESCHANGEDTOPROTECTTHEINNOCENT'.
The repair for all of this? Delete it all and fix up the links. MASSIVE data loss.
Lessons to learn from this:
- Don’t give people with no clue about SQL Server access to SQL Server.
- Don’t delete a transaction log to reclaim space. Cardinal sin.
- Don’t rebuild a transaction log UNLESS you run a full DBCC CHECKDB afterwards and satisfy yourself that you know the extent of the damage.
- Don’t upgrade a database without running DBCC CHECKDB first. Best case – the upgrade fails. Worst case – it upgrades and then you might not be able to fix the corruptions.
- Don’t just run repair without doing a little analysis on the errors reported, especially with a *huge* number of errors reported.
Finally, I’m really grateful to the DBA in question for letting me help him out with this, and to post this blog post – we all learn from our own and others’ mistakes.
PS And I got involved in this from Twitter – I just *love* it. After initially being skeptical of how much time I’d spend on it, I’m finding the benefits of connecting to the SQL community in ‘real-time’ vastly outweigh the time I’m putting into it. Follow me on http://twitter.com/paulrandal and you’ll see a bunch of other SQL MVPs on there too.
8 thoughts on “A sad tale of mis-steps and corruption (from today)”
I’ve been burned by this so many times…
I’ve always cursed the person who decided to call a transaction log a LOG. For most people, for most applications, deleting the log is one of the safest things to do to reclaim space in a "system is down" scenario. Of course, people should back up before deleting anything in production, but faced with a panic situation, they often don’t.
I so wish trans logs were called "transaction details" or "never delete this file" or "this file is where we store all our money". Even then we will have to deal with the occasional obliviot, but it would be the exception.
I especially agree with your statement that the guy never should have had the authority to delete the file (system acesss), but I’d also throw in the need for a "shouldn’t have the autority to delete the file (policy) recommendation, too. Sometimes we do need help from non-traditional resources. But they should not be allowed to destroy anything without approval.
Glad you took the time to write this up.
To be fair to our security guys, this server is an antivirus server which they manage for the most part. The SQL Server was stood up before they or I were ever hired here. By "security person" he’s not referring to a night guard, its our IT staff responsible for network security. Yes, this mess was mine and I’ll blog about it in an upcoming post but right now we’re still trying to salvage what we can from this. For the record this database contains no health/personal data so those who freak out over any potential HIPAA violations can calm down. A HUGE thank you to Paul for being kind enough to help me out with this and continuing to help out. Situations like this a shining examples of how well the SQL Community works and how great it is to get involved (which kind of lead to Paul’s pro-Twitter post today as well).
"there was a security person who had access to a 2000 SQL Server instance"
This stopped me in my tracks. A security person had this level of access? And didn’t say anything? This violates the Principle of Least Privilege. But then said person went and stepped outside his/her area of responsibility/accountability? A clear violation of the separation of duties ideal in security…
*shakes his head*
If the security folks aren’t following the rules, how can we expect anyone else to?
Paul, It would be interesting to know under what conditions can one delete a log file. In other words, most of the time (assuming you have the right permissions) you will get the ‘File being used’ message and the delete won’t go through. Most likely (didn’t get a chance to test this) the database has to be ‘closed’ or ‘offline’ for the delete to go through. Can you confirm this? Another reason why you want want to set’Auto Close’ to ‘False’.
Basically the answer is that you should never delete the log file. The only time when it’s marginally acceptable is when transporting a cleanly shutdown database with a very large log file. In that case, you can transport just the data files and SQL Server 2005/2008 will create new log files on attach. This is only when the database is cleanly shutdown (i.e. no active transactions), but even then I don’t recommend doing it.
Theoretically, what would be the damage if a T-Log was deleted during a time of no activity ? Would there be corruption ? Wouldn’t everything have been written to the database ?
If there are no active transactions and a checkpoint has occured, then no damage should occur (i.e. if there’s nothing that has to be replayed or undone)
The fact that the security person did not even bother to shoot the DBA an email saying, “btw I deleted your T-log last night and recreated it. My bad.” is what really bothers me the most about this story. So much time was spent on trying to figure out root cause when a quick email could have shortened root cause analysis.
Don’t get me wrong the security guy deleting the transaction log isn’t acceptable but we have all done stupid things. Owning your mistakes and learning from them outweighs trying to hide the fact that you made a boo-boo.