Unbelievable tale of disaster and recovery

A few days ago one of my new blog readers (a pretty smart cookie, as you'll see) sent me a tale of database catastrophe and an excellent recovery that I’d like to share with you. The story’s been made anonymous and is published with full permission of the author (highlights in bold are mine).

Hey Paul, I was, out of necessity, recently introduced into the mysterious internals of SQL Server through your most-helpful blog. I thought you might be interested in my tale of database death and rebirth, if only as a cautionary tale for others.

I'm an <censored> for the <censored>; I'm a CSci student gone astray. <Censored> are, of course, required by law to retain various kinds of information. One such instance is our collection of <censored>, all of which is stored in a single SQL server DB. The last 10 years of information were in this database; if lost, we would face various fines and legal liabilities.

We had a drive fail in our SQL server's RAID array. We popped in the replacement… and somehow the RAID controller got confused, thinking the array was rebuilt when it wasn't. So whenever it got a request for a block that happened to reside on the new disk, it didn't bother to reconstruct it, but just read back a block of 0s. Our entire 260g array now had 16k blocks of nothingness scattered all over it. One such block happened to hit pages 0 and 1 of our <censored-but-very-important> database. As well as pages 16 & 17, 32 & 33, etc. (with some gaps depending on the parity layout). (I had copied the log file over to a new server before we did the drive swap, so it hadn't suffered this fate.)

And we had no backups, because the backups guy let our SQL Server backup agent license expire, and switched us to _file_ backups of the databases, without checking to see whether those were actually backing anything up (they weren't). The most recent backup was from 2005. There are no words for how screwed we were. We copied the database file to the new server where it obviously would not attach, and could not be repaired through any of the normal SQL Server means (all of which I was completely ignorant of, and learned about through your blog).

While the main IT guy was looking into data recovery services and drinking heavily, I read up on the MDF file format (again, thanks) and figured out with a hex editor where in the page header the page index was stored. I knew that the database file had been contiguous on the disk, and figured that if I could extract an image of the failed drive (which luckily hadn't _physically_ failed) and find those blocks of the file that were on that drive, I could re-integrate them into the main database file. Thanks to some napkin math, I was able to find the appropriate offsets within the image file. I wrote a little thing in Java to scan through that region image for valid pages (skipping the parity blocks) and build a table mapping page indexes to file offsets. Then, scan through the main database file looking for invalid pages, look up the index in the table, and (if it existed) copy in the page from the drive image. (Of course, towards the end of the file were quite a few unallocated pages, which were invalid but had no corresponding page in the image; if I had been smarter I would have worked out the RAID layout exactly so that I could have predicted which pages would be invalid due to the RAID failure and ignored the unallocated pages.)

15 minutes later I had a hopefully-rebuilt database file which, miraculously, attached successfully! Almost… CHECKDB revealed that I had done my math wrong and skipped the last page. But after fixing that and re-running the rebuild, we had a working, consistent database. Which I immediately backed up. Twice.

So anyway, thanks for all the useful information you've put out there. I doubt I would have been able to pull this off without it.

Wow – that’s some pretty impressive stuff. And not just the cool way he recovered, but the fact that the last SQL backup was SIX YEARS OLD!!! That has to be a new record…

And by the way, if you didn't realize from the story, this guy knew *nothing* about SQL Server before having to fix this problem. Very impressive.

Anyway <censored>, thanks for sharing! And to everyone else, you really don't want this to happen to you…

PS Join our new community to get our monthly newsletter!

Posted Under:

Corruption, Disaster Recovery, IO Subsystems

40 thoughts on “Unbelievable tale of disaster and recovery”

Richard Douglas says:

February 16, 2011 at 8:32 am

Pure genius.

Reply
Meher says:

February 16, 2011 at 8:33 am

I am so impressed about this post and this DBA I would call him an extra ordinary DBA!!! The method he came up to recover is unbelievable and it was all possible because of SQLSkills and their excellent blogs. Kudos to this DBA and Paul and his team.

Reply
Mike Walsh says:

February 16, 2011 at 8:35 am

That should be required reading by everyone touching SQL Server (Well maybe hide the paragraph about the reconstruction, that shouldn’t be attempted by just anyone!) When I was an independent consultant on the back of my business cards I had a few sayings, my favorite was "Test your restores!!!" This should be a great reminder to do just that as Kimberly alluded to above.

Thanks for sharing <censored>, great job on the quick thinking and recovery, you deserve a raise for that one (though now you know to at least trust only yourself for the backups and use a "trust but verify" mentality.) And thanks for sharing the letter, Paul.

Reply
David Wimbush says:

February 16, 2011 at 8:41 am

How hardcore is that?! And described in such a nonchalant way as if anyone would have done the same. <censored>, I take my hat off.

Reply
Michelle Ufford says:

February 16, 2011 at 8:42 am

Wow. Just… wow.

Reply
Wes Brown says:

February 16, 2011 at 8:49 am

Holy Cow! Just friggin’ awesome. This guy deserves a medal, pay raise and a all expenses paid vacation.

Reply
Robert Miller says:

February 16, 2011 at 8:52 am

Yes, that is one smart cookie and, yes, I would not want to be in his shoes.

Reply
Rob Volk says:

February 16, 2011 at 8:59 am

I sense a new SQL Server database recovery company starting up in the next few weeks…

Reply
tobi says:

February 16, 2011 at 9:22 am

Imagine your whole business is in that database. Your state transitions from "profitable business" to err_company_does_not_exist in a second.

Reply
Kory Skistad says:

February 16, 2011 at 9:29 am

So is this the level of knowledge they would expect for the SQL Master certification??

Reply
Anthony Young says:

February 16, 2011 at 9:56 am

Wow! That’s amazing stuff. Ship that wizard a bucket of loot! I sure hope he got a raise, a bonus, a week in Tahiti or something else substantial for that.

Reply
Michael J Swart says:

February 16, 2011 at 10:20 am

Above and beyond any standards I can think of!
It must feel good to know you posted something *so* helpful.
I think congratulations are in order Paul.

Reply
ALZDBA says:

February 16, 2011 at 10:30 am

Very impressive story !
This is certainly not Joe Average.
Must be great to receive this kind of feedback !

Reply
Mark Holmes says:

February 16, 2011 at 10:34 am

I’ve seen a similar whole-array failure at at least one of my own <censored> employers, and seen it at numerous clients that I’ve supported – but I have never seen – nor attempted – a recovery without backups. Then again, I didn’t have 10 years of irreplaceable data at stake.

To the <censored> employer of <censored> : KEEP THIS GUY. At any cost. He demonstrated initiative and resourcefulness that is in the top 0.001% of IT.

Oh, and one more thing to said employer: Fire your backup guy. He didn’t test recovery. That is all.

Reply
James Hamoline says:

February 16, 2011 at 11:51 am

This is a truly amazing story. As a Storage & Database Administrator I’m interested in the array that failed. Any chance that you could share the brand and model of that array? :)

Reply
Sankar Reddy says:

February 16, 2011 at 4:16 pm

Pretty hardcore stuff and its amazing how he is trying to play down the complexity of what he did without any SQL Server knowledge before. I have to say the best automated job I ever built is "testing my restores every day".

Reply
Kimberly L. Tripp says:

February 16, 2011 at 8:27 am

That’s awesome!

Test those backups NOW folks!!! ;-)

Cheers,
kt

Reply
Fatherjack says:

February 16, 2011 at 8:32 am

Holy Failed Processes BatMan!! That man deserves a pay increase. I wonder if the Co understand what he actually did for them?

Reply
Glenn Berry says:

February 16, 2011 at 9:03 am

Pretty amazing story, showing that this guy was very persistant. Saved the day and the company.

Reply
paul says:

February 16, 2011 at 9:32 am

No – this is way beyond!

Reply
Jimmy May @aspiringgeek says:

February 16, 2011 at 10:56 am

As Kim says, ‘You don’t have a backup unless you’ve restored it’.

An impressive feat indeed!

Reply
Robert L Davis says:

February 16, 2011 at 11:01 am

I hope I never have to try to do it in a must-do situation like this. I’d like to do it at least one time to see if I could.

Reply
Tom Powell says:

February 16, 2011 at 11:02 am

That is truly amazing! Thanks! I feel TOTALLY inadequate! :-)

Reply
Dirk Hondong says:

February 16, 2011 at 11:31 am

Now this is a story that beats everything else.
A very smart approach. I have to say that I am not that creative.
Chapeau unknown guy who saved the day

Reply
Pedro Lopes says:

February 17, 2011 at 2:47 am

WOW! The way he recovered data over a messed up storage and lack of recent backup is jaw-dropping… Well done!

Reply
Anders Gregersen says:

February 17, 2011 at 5:54 am

Major credit deserved! The backup guy is eternal in debt to the SQL guy and should start each day with a sorry, fresh pastries and a great cup of coffee to the SQL guy.

Reply
Tim Radney says:

February 17, 2011 at 7:27 am

This guy is awesome. <censored> you sir are one smart cookie.

Reply
El Vago says:

February 17, 2011 at 8:41 am

It’s a fact, the better way to learn is to be in the middle of a problem, and that was a BIG ONE.

Good bless the mathematics!!!!

Reply
Bill Anton says:

February 17, 2011 at 9:10 am

very impressed…nice work!

Reply
Muthukkumaran kaliyamoorhty says:

February 17, 2011 at 11:47 pm

Wow…excellent what a presence of mind.

Thanks to sharing paul.Your blog helped all the folks(as usual).

Reply
Brian Garraty says:

February 18, 2011 at 5:01 am

Incredible story – I was thinking the same thing that was mentioned by a previous commenter. I wonder if the company realizes the scope of what this guy pulled off? I sure hope so.

Reply
Merrill Aldrich says:

February 18, 2011 at 3:27 pm

Wow. Consider us all schooled.

Except Paul and Kimberly, of course :-).

Reply
Jeroen Mostert says:

February 19, 2011 at 12:08 am

That takes me back… to when I still had a Commodore 64. Files on the C64 are stored as singly-linked lists of blocks, with the pointer to the first block stored in the directory (there are no subdirectories, so there’s only the one).

One day I formatted a disk I shouldn’t have formatted, and there was a file of crucial importance on it. Unformatting tools didn’t exist in those days — or maybe they did, but the Internet didn’t, so I had no way of finding one. What I did was read all individual blocks from the disk, construct the chain of blocks for each file and infer which blocks were the first (the only blocks having no other blocks pointing to them). Having found the right one, I constructed a directory entry and marked the blocks allocated in the allocation map.

Pretty mundane, except for the fact that I did all the necessary administrative work *by hand*, instead of writing a program to do it. Talk about a waste of time! This was back when I was a teenager and hadn’t yet learned to be constructively lazy. Fortunately, Commodore 64 disks weren’t all that big (there were 683 sectors to a disk, and mine was nowhere near full). A 260G database is quite another matter…

Reply
John Couch says:

February 21, 2011 at 6:52 pm

You have to be pretty darn smart to pull that off. You sure that wasn’t an article you wrote in yourself Paul? LOL That is REALLY impressive. Wish I was that smart.

Reply
aleksey_fomchenko says:

March 4, 2013 at 7:54 pm

Awesome work!

Reply
CareySon says:

July 9, 2013 at 12:55 am

Wow,That’s absolutly incrediblable

Reply
Lucian Vanta says:

July 15, 2013 at 1:41 pm

While it is very impressive for a CSci student, it is quite common work for data recovery companies.
From a MS-SQL database hit by severe damage, and that cannot be fixed to attach, one can expect to recover almost all the data content not damaged. By this I mean all records with data fields in data pages and/or outside record in text tree and text mix pages. There might be some issues with some rare/exotic field types.
The recovery result would be standard: text files with “insert into table” commands.
There are several commercially available tools to do this kind of work, with variable results, and commonly used for low cost recoveries.
Although information is scarce, it looks like several recovery companies have developed own tools to get data out from serious damage where commercial tools cannot do anything.
An example of what can be done now, consider following.
With virtualization trend, can be seen two, or even more stacked file systems like ufs at base, then vmfs, and at top ntfs. With damage like the ufs metadata 100% unusable, there are still good chances for an ms-sql database in ntfs to be recovered almost completely.

Reply
JK says:

September 10, 2013 at 8:35 am

OMG…he is sheer genius.

Reply
dba says:

February 12, 2020 at 1:41 pm

an extraordinary task he pulled together, he demonstrated Java, RAID(IT) and SQL Server skills through unbelievable data recovery!! I wonder how he kept him self calm to even think about this route.

Reply
1. Paul Randal says:
  
  February 18, 2020 at 11:10 pm
  
  I know – hugely impressive!
  
  Reply

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.

Unbelievable tale of disaster and recovery

Posted Under:

40 thoughts on “Unbelievable tale of disaster and recovery”

Leave a Reply Cancel reply

Other articles

Presenting a workshop at PASS Summit West in Seattle!

Summer Sale: $699 Blackbelt Bundle!

The SQL Server Transaction Log, Part 4: Log Records

SQL101: Application data consistency checking

SQL101: Phishing attacks

SQL101: Top Ten SQL Server Performance Tuning Best Practices

Imagine feeling confident enough to handle whatever your database throws at you.

Unbelievable tale of disaster and recovery

Check It Out!

SQL Server Jumpstart+ Bundle

Posted Under:

40 thoughts on “Unbelievable tale of disaster and recovery”

Leave a Reply Cancel reply

Other articles

Presenting a workshop at PASS Summit West in Seattle!

Summer Sale: $699 Blackbelt Bundle!

The SQL Server Transaction Log, Part 4: Log Records

SQL101: Application data consistency checking

SQL101: Phishing attacks

SQL101: Top Ten SQL Server Performance Tuning Best Practices

Imagine feeling confident enough to handle whatever your database throws at you.