Unbelievable tale of disaster and recovery

A few days ago one of my new blog readers (a pretty smart cookie, as you'll see) sent me a tale of database catastrophe and an excellent recovery that I’d like to share with you. The story’s been made anonymous and is published with full permission of the author (highlights in bold are mine).

Hey Paul, I was, out of necessity, recently introduced into the mysterious internals of SQL Server through your most-helpful blog. I thought you might be interested in my tale of database death and rebirth, if only as a cautionary tale for others.

I'm an <censored> for the <censored>; I'm a CSci student gone astray. <Censored> are, of course, required by law to retain various kinds of information. One such instance is our collection of <censored>, all of which is stored in a single SQL server DB. The last 10 years of information were in this database; if lost, we would face various fines and legal liabilities.

We had a drive fail in our SQL server's RAID array. We popped in the replacement… and somehow the RAID controller got confused, thinking the array was rebuilt when it wasn't. So whenever it got a request for a block that happened to reside on the new disk, it didn't bother to reconstruct it, but just read back a block of 0s. Our entire 260g array now had 16k blocks of nothingness scattered all over it. One such block happened to hit pages 0 and 1 of our <censored-but-very-important> database. As well as pages 16 & 17, 32 & 33, etc. (with some gaps depending on the parity layout). (I had copied the log file over to a new server before we did the drive swap, so it hadn't suffered this fate.)

And we had no backups, because the backups guy let our SQL Server backup agent license expire, and switched us to _file_ backups of the databases, without checking to see whether those were actually backing anything up (they weren't). The most recent backup was from 2005. There are no words for how screwed we were. We copied the database file to the new server where it obviously would not attach, and could not be repaired through any of the normal SQL Server means (all of which I was completely ignorant of, and learned about through your blog).

While the main IT guy was looking into data recovery services and drinking heavily, I read up on the MDF file format (again, thanks) and figured out with a hex editor where in the page header the page index was stored. I knew that the database file had been contiguous on the disk, and figured that if I could extract an image of the failed drive (which luckily hadn't _physically_ failed) and find those blocks of the file that were on that drive, I could re-integrate them into the main database file. Thanks to some napkin math, I was able to find the appropriate offsets within the image file. I wrote a little thing in Java to scan through that region image for valid pages (skipping the parity blocks) and build a table mapping page indexes to file offsets. Then, scan through the main database file looking for invalid pages, look up the index in the table, and (if it existed) copy in the page from the drive image. (Of course, towards the end of the file were quite a few unallocated pages, which were invalid but had no corresponding page in the image; if I had been smarter I would have worked out the RAID layout exactly so that I could have predicted which pages would be invalid due to the RAID failure and ignored the unallocated pages.)

15 minutes later I had a hopefully-rebuilt database file which, miraculously, attached successfully! Almost… CHECKDB revealed that I had done my math wrong and skipped the last page. But after fixing that and re-running the rebuild, we had a working, consistent database. Which I immediately backed up. Twice.

So anyway, thanks for all the useful information you've put out there. I doubt I would have been able to pull this off without it. 

Wow – that’s some pretty impressive stuff. And not just the cool way he recovered, but the fact that the last SQL backup was SIX YEARS OLD!!! That has to be a new record…

And by the way, if you didn't realize from the story, this guy knew *nothing* about SQL Server before having to fix this problem. Very impressive.

Anyway <censored>, thanks for sharing! And to everyone else, you really don't want this to happen to you…

PS Join our new community to get our monthly newsletter!

40 thoughts on “Unbelievable tale of disaster and recovery

  1. I am so impressed about this post and this DBA I would call him an extra ordinary DBA!!! The method he came up to recover is unbelievable and it was all possible because of SQLSkills and their excellent blogs. Kudos to this DBA and Paul and his team.

  2. That should be required reading by everyone touching SQL Server (Well maybe hide the paragraph about the reconstruction, that shouldn’t be attempted by just anyone!) When I was an independent consultant on the back of my business cards I had a few sayings, my favorite was "Test your restores!!!" This should be a great reminder to do just that as Kimberly alluded to above.

    Thanks for sharing <censored>, great job on the quick thinking and recovery, you deserve a raise for that one (though now you know to at least trust only yourself for the backups and use a "trust but verify" mentality.) And thanks for sharing the letter, Paul.

  3. Imagine your whole business is in that database. Your state transitions from "profitable business" to err_company_does_not_exist in a second.

  4. I’ve seen a similar whole-array failure at at least one of my own <censored> employers, and seen it at numerous clients that I’ve supported – but I have never seen – nor attempted – a recovery without backups. Then again, I didn’t have 10 years of irreplaceable data at stake.

    To the <censored> employer of <censored> : KEEP THIS GUY. At any cost. He demonstrated initiative and resourcefulness that is in the top 0.001% of IT.

    Oh, and one more thing to said employer: Fire your backup guy. He didn’t test recovery. That is all.

  5. This is a truly amazing story. As a Storage & Database Administrator I’m interested in the array that failed. Any chance that you could share the brand and model of that array? :)

  6. Pretty hardcore stuff and its amazing how he is trying to play down the complexity of what he did without any SQL Server knowledge before. I have to say the best automated job I ever built is "testing my restores every day".

  7. Major credit deserved! The backup guy is eternal in debt to the SQL guy and should start each day with a sorry, fresh pastries and a great cup of coffee to the SQL guy.

  8. It’s a fact, the better way to learn is to be in the middle of a problem, and that was a BIG ONE.

    Good bless the mathematics!!!!

  9. That takes me back… to when I still had a Commodore 64. Files on the C64 are stored as singly-linked lists of blocks, with the pointer to the first block stored in the directory (there are no subdirectories, so there’s only the one).

    One day I formatted a disk I shouldn’t have formatted, and there was a file of crucial importance on it. Unformatting tools didn’t exist in those days — or maybe they did, but the Internet didn’t, so I had no way of finding one. What I did was read all individual blocks from the disk, construct the chain of blocks for each file and infer which blocks were the first (the only blocks having no other blocks pointing to them). Having found the right one, I constructed a directory entry and marked the blocks allocated in the allocation map.

    Pretty mundane, except for the fact that I did all the necessary administrative work *by hand*, instead of writing a program to do it. Talk about a waste of time! This was back when I was a teenager and hadn’t yet learned to be constructively lazy. Fortunately, Commodore 64 disks weren’t all that big (there were 683 sectors to a disk, and mine was nowhere near full). A 260G database is quite another matter…

  10. You have to be pretty darn smart to pull that off. You sure that wasn’t an article you wrote in yourself Paul? LOL That is REALLY impressive. Wish I was that smart.

  11. While it is very impressive for a CSci student, it is quite common work for data recovery companies.
    From a MS-SQL database hit by severe damage, and that cannot be fixed to attach, one can expect to recover almost all the data content not damaged. By this I mean all records with data fields in data pages and/or outside record in text tree and text mix pages. There might be some issues with some rare/exotic field types.
    The recovery result would be standard: text files with “insert into table” commands.
    There are several commercially available tools to do this kind of work, with variable results, and commonly used for low cost recoveries.
    Although information is scarce, it looks like several recovery companies have developed own tools to get data out from serious damage where commercial tools cannot do anything.
    An example of what can be done now, consider following.
    With virtualization trend, can be seen two, or even more stacked file systems like ufs at base, then vmfs, and at top ntfs. With damage like the ufs metadata 100% unusable, there are still good chances for an ms-sql database in ntfs to be recovered almost completely.

  12. an extraordinary task he pulled together, he demonstrated Java, RAID(IT) and SQL Server skills through unbelievable data recovery!! I wonder how he kept him self calm to even think about this route.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.