As Kimberly mentioned last week, SQLskills is embarking on a new initiative to blog about basic topics, which we’re calling SQL101. We’ll all be blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
For my first SQL101 post, I’d like to touch on a subject that that has been core to my work since I graduated in 1994: dealing with corruption. You may not know that before joining the SQL Server engineering team at Microsoft in early 1999, I worked for the file system group at DEC (Digital Equipment), where among other things I was responsible for the VMS equivalent of the Windows chkdsk (called ANAL/DISK). It was this expertise with corruption and repairing it that led me to work on DBCC, rewriting much of the DBCC CHECKDB check and repair code for SQL Server 2005.
All through my professional career I’ve seen people make mistakes when they encounter corruption, so here I’d like to offer some quick guidelines for how to approach SQL Server corruption.
When corruption appears, it can be scary. Suddenly your main database has all these errors and you don’t know what to do. The absolute best thing you can do is to keep calm and make rational decisions about how to proceed. If you knee jerk or jump to conclusions or let someone pressure you into make a snap decision, the odds are you will make a mistake and make the situation worse.
Make use of the run book
Check to see if your team or department has a disaster recovery handbook (often called a run book). This should give you useful information for you like:
- How to access the backups
- How to access Windows and SQL Server installation media and product keys
- Who to call in various other departments for assistance with infrastructure
- Who to call for help in your department
- Who to notify of the problem (think CIO, CTO, I.T. Director)
- How to proceed with various scenarios (e.g. restoring the main production database, or performing a bare-metal install of a new server)
Your run book might say to immediately fail over to a synchronous Availability Group replica, or some other redundant copy of the database, no matter what the problem is and then figure out the problem on the main production database afterwards. If that’s the case, that’s what you do.
And if you’re reading this and thinking ‘Hmm – we don’t have one of those…’, then that’s a big problem that should be addressed, as well as making sure that even the most junior DBA can follow the various procedures in it.
Consult my comprehensive flow chart
A few years ago I wrote a large flow chart for SQL Server Magazine, and it’s available in PDF poster form here (archived on a friend’s blog).
This can also form the basis of a run book if you don’t have one.
Understand the extent of the corruption
It is a very good idea to run DBCC CHECKDB on the database (if you haven’t already) to determine the extent of the corruption.
Depending on where the corruption is, you may be able to restore in a lot less time than restoring the entire database. For instance, if only a single page is damaged, you might be able to do a single-page restore. If only a single filegroup is damaged, you might be able to do a single filegroup restore.
Depending on what the corruption is, you may not even have to restore. For instance, if the corruption is confined to nonclustered indexes (all the corruption messages list index IDs higher than 1), you can rebuild the corrupt indexes manually with code like the following:
BEGIN TRANSACTION; GO ALTER INDEX CorruptIndexName ON TableName DISABLE; GO ALTER INDEX CorruptIndexName ON TableName REBUILD WITH (ONLINE = ON); GO COMMIT TRANSACTION; GO
That means you don’t have to restore or use repair, both of which incur downtime.
Consider the ramifications of the actions you’re planning
If you’ve never dealt with corruption before and you’re not an experienced DBA, there are actions that might be tempting that could cause you bigger headaches than just having corruption.
- If you have a corrupt database, don’t try to detach it from the instance as you likely won’t be able to attach it again because of the corruption. This especially applies if the database is marked as SUSPECT. If you ever have this scenario, you can reattach the database using the steps in my post Disaster recovery 101: hack-attach a damaged database.
- If your SQL Server instance is damaged, and the database is corrupt, don’t try to attach it to a newer version of SQL Server, as the upgrade might fail and leave the database in a state where it can’t be attached to either the old or new versions of SQL Server.
- If crash recovery is running, don’t ever be tempted to shut down SQL Server and delete the log file. That is guaranteed to cause at least data inconsistencies and at worst corruption. Crash recovery can sometimes take a long time, depending on the length of open transactions at the time of the crash that must be rolled back.
If you’re planning or have been told to do something, make sure you understand what the ramifications of that thing are.
Don’t just jump to repair
The repair option is called REPAIR_ALLOW_DATA_LOSS because you’re telling DBCC CHECKDB that it can lose data to perform repairs. The repairs (with a few exceptions) are written as ‘delete what’s broken and fix up all the links’. That’s because that’s usually the only way to write a repair algorithm for a certain corruption that fixes it in 100% of cases without making things worse. After running repair, you will likely have lost some data, and DBCC CHECKDB can’t tell you what it was. You really don’t want to run repair unless you can avoid it.
Also, there are some cases of corruption that absolutely cannot be repaired (like corrupt table metadata) so then you *have* to have backups or a redundant copy to use.
There is a last resort that we make a documented feature back in SQL Server 2005 – EMERGENCY-mode repair – for when the transaction log is damaged. That will try to get as much data out of the transaction log as possible and then run a regular repair. Although that may get the database back online, you’ll likely have data loss and data inconsistencies. It really is a last resort, and it’s not infallible either.
You really want to have backups to use or a redundant copy to fail over to instead.
But if you *have* to use repair, try to do it on a copy of the corrupt database. And then go fix your backup strategy so you aren’t forced to use repair again in future.
Be very careful with 3rd-party tools
There are some 3rd-party tools that will try to do repairs or extract data out. I’ve seen them work sometimes and I’ve seen them spectacularly fail and totally trash a database at other times. If you’re going to try one of these out, do it on a copy of the corrupt database in case something goes wrong.
Ask for help (but be careful)
If you don’t know what to do and you’re concerned that you’ll make things worse or make a wrong decision, try asking for help. For free, you could try using the #sqlhelp hashtag on Twitter, you could try posting to a forum like http://dba.stackexchange.com/ or one of the https://www.sqlservercentral.com/Forums/. Sometimes I’ll have time to respond to a quick email giving some direction, and sometimes I’ll recommend that you get some consulting help to work on data recovery.
You can also call Microsoft Customer Support for assistance, but you’ll always need to pay for that unless the source of the corruption turns out to be a SQL Server bug.
Wherever you get the help from though, be careful that the advice seems sound and you can verify the suggestion with well-known and reputable sources.
Do root cause analysis
After you’ve recovered from the corruption, try to figure out why it happened in the first place as the odds are that it will happen again. The overwhelmingly vast majority of corruptions are caused by the I/O subsystem (including all the software under SQL Server), with a very small percentage being caused by memory chip problems, and a smaller percentage being caused by SQL Server bugs. Look in the SQL Server error log, Windows event logs, ask the Storage Admin if anything happened, and so on.
Practice and research
It’s a *really* good idea to practice recovering from corruption before you have to do it for real. You’ll be more comfortable with the procedures involved and you’ll be more confident. I have some corrupt databases that you can download and practice with here.
There’s also a lot of instructional information on my blog under the following categories:
And there are two Pluralsight online training courses I’ve recorded which will give you an enormous boost in practical knowledge:
- SQL Server: Detecting and Correcting Database Corruption
- SQL Server: Advanced Corruption Recovery Techniques
Ok – so it turned out to not be quite as quick as I thought! However, this is all 101-level information that will help you work through a corruption problem or exercise. I’ll be blogging a lot more of these 101-level posts this year. If there’s anything in particular you’d like to see us cover at that level, please leave a comment.