A few days ago one of my new blog readers (a pretty smart cookie, as you'll see) sent me a tale of database catastrophe and an excellent recovery that I’d like to share with you. The story’s been made anonymous and is published with full permission of the author (highlights in bold are mine).
Hey Paul, I was, out of necessity, recently introduced into the mysterious internals of SQL Server through your most-helpful blog. I thought you might be interested in my tale of database death and rebirth, if only as a cautionary tale for others.
I'm an <censored> for the <censored>; I'm a CSci student gone astray. <Censored> are, of course, required by law to retain various kinds of information. One such instance is our collection of <censored>, all of which is stored in a single SQL server DB. The last 10 years of information were in this database; if lost, we would face various fines and legal liabilities.
We had a drive fail in our SQL server's RAID array. We popped in the replacement… and somehow the RAID controller got confused, thinking the array was rebuilt when it wasn't. So whenever it got a request for a block that happened to reside on the new disk, it didn't bother to reconstruct it, but just read back a block of 0s. Our entire 260g array now had 16k blocks of nothingness scattered all over it. One such block happened to hit pages 0 and 1 of our <censored-but-very-important> database. As well as pages 16 & 17, 32 & 33, etc. (with some gaps depending on the parity layout). (I had copied the log file over to a new server before we did the drive swap, so it hadn't suffered this fate.)
And we had no backups, because the backups guy let our SQL Server backup agent license expire, and switched us to _file_ backups of the databases, without checking to see whether those were actually backing anything up (they weren't). The most recent backup was from 2005. There are no words for how screwed we were. We copied the database file to the new server where it obviously would not attach, and could not be repaired through any of the normal SQL Server means (all of which I was completely ignorant of, and learned about through your blog).
While the main IT guy was looking into data recovery services and drinking heavily, I read up on the MDF file format (again, thanks) and figured out with a hex editor where in the page header the page index was stored. I knew that the database file had been contiguous on the disk, and figured that if I could extract an image of the failed drive (which luckily hadn't _physically_ failed) and find those blocks of the file that were on that drive, I could re-integrate them into the main database file. Thanks to some napkin math, I was able to find the appropriate offsets within the image file. I wrote a little thing in Java to scan through that region image for valid pages (skipping the parity blocks) and build a table mapping page indexes to file offsets. Then, scan through the main database file looking for invalid pages, look up the index in the table, and (if it existed) copy in the page from the drive image. (Of course, towards the end of the file were quite a few unallocated pages, which were invalid but had no corresponding page in the image; if I had been smarter I would have worked out the RAID layout exactly so that I could have predicted which pages would be invalid due to the RAID failure and ignored the unallocated pages.)
15 minutes later I had a hopefully-rebuilt database file which, miraculously, attached successfully! Almost… CHECKDB revealed that I had done my math wrong and skipped the last page. But after fixing that and re-running the rebuild, we had a working, consistent database. Which I immediately backed up. Twice.
So anyway, thanks for all the useful information you've put out there. I doubt I would have been able to pull this off without it.
Wow – that’s some pretty impressive stuff. And not just the cool way he recovered, but the fact that the last SQL backup was SIX YEARS OLD!!! That has to be a new record…
And by the way, if you didn't realize from the story, this guy knew *nothing* about SQL Server before having to fix this problem. Very impressive.
Anyway <censored>, thanks for sharing! And to everyone else, you really don't want this to happen to you…
PS Join our new community to get our monthly newsletter!