SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
Since my colleague Paul Randal wrote DBCC CHECKDB while he was on the SQL Server Product team at Microsoft, he is an acknowledged expert on SQL Server database corruption and repair techniques. Because of this well-earned reputation, we typically get multiple e-mails each week asking for Paul’s advice and assistance dealing with database corruption and repair issues.
A typical pattern for these e-mails is that a production SQL Server database has become suspect, and running DBCC CHECKDB fails with some specific series of errors. Depending on exactly what errors are being returned from DBCC CHECKDB, it may be a situation where DBCC CHECKDB cannot do anything to resolve the corruption. In some cases, Paul can go in and do some manual repair work (at his regular consulting rate) to help resolve the issue, but in some cases, even Paul cannot fix the corruption (or he is not immediately available to do any work).
This leaves the last line of defense being restoring from your last set of known, good database backups. Unfortunately, in many cases, it turns out that there are no good database backups available that can actually be restored. If this happens, it is likely to be resume/CV updating time for the DBA, and possibly even a catastrophic outcome for the existence of your entire organization. So what can you do to minimize the chance of this happening to you or your organization?
Here are a few steps that you can take:
Keep your main system BIOS and all storage-related firmware and drivers up to date
One of the leading causes of database corruption (and backup corruption) are problems with your storage subsystem. These are often caused by out of date versions of your main system BIOS, storage firmware, or storage drivers. The server and component vendors don’t typically go to the trouble of issuing these types of updates unless they are correcting significant issues.
When these type of updates are available, they are often labeled as critical or urgent updates. Reading the release notes for these updates can often give you more information about the issue and the fix for the issue. As a DBA, you want to make sure someone (perhaps you) is monitoring this situation for your database servers.
Use SQL Server Agent Alerts to detect important errors on your SQL Server instance
Many novice DBAs have never even heard of SQL Server Agent Alerts. In a nutshell, they can be used to more quickly detect and possibly react to some types of hardware and software issues and errors that may happen on a SQL Server instance (or its underlying hardware and storage).
Normally, these types of errors will just get logged to the SQL Server Error Log, where they might not be noticed in a timely manner. Fortunately, I have a T-SQL script that can create a set of SQL Server Agent Alerts for many common issues. I also have a blog post with more details here.
Make sure all of your databases are using CHECKSUM for their Page_Verify option
CHECKSUM is the default page_verify setting for new databases since SQL Server 2005, but you might have older databases that have been upgraded over the years where the page_verify setting was never changed. You also might have a situation where someone has purposely switched the page_verify setting to TORN_PAGE or NONE for some strange reason.
When CHECKSUM is enabled for the PAGE_VERIFY database option, the SQL Server Database Engine calculates a checksum over the contents of the whole page, and stores the value in the page header when a page is written to disk. When the page is read from disk, the checksum is recomputed and compared to the checksum value that is stored in the page header. I previously wrote about this issue here.
Make sure you are using the CHECKSUM option with your database backups
You can (and should) add the CHECKSUM option whenever you run any type of database backup. Since SQL Server 2014, you have had the ability to set an instance-level setting (with sp_configure) to add this option to backup commands by default, just in case someone (or a 3rd party backup solution) does not add the option in the actual backup command. With older versions of SQL Server, you can also get the same effect by adding Trace Flag 3023 as a start-up trace flag You can also enable/disable TF 3023 dynamically.
Adding the CHECKSUM syntax to the backup command forces SQL Server to verify any existing page checksums as it reads pages for the backup, and it calculates a checksum over the entire backup. Adding the CHECKSUM option is not a replacement for actually restoring a database backup to see if it is good or not, but it is a good intermediate step in the process.
Actually restore your database backups on a regular basis to verify that they are good
This is the only way to be absolutely sure that your database backups are good. These other steps will increase the chances that your database backups are good, but an actual database restore is the acid test. You should be doing this on a regular basis, in an automated fashion.
Microsoft has some foundational guidance about backup and restore operations here. Paul Randal has a Pluralsight course called SQL Server: Understanding and Performing Backups.
The whole subject of avoiding database corruption and having an effective database backup and restore strategy to meet your RPO and RTO goals is far more extensive than I want to cover in a single SQL101 blog post. Hopefully the information in this post has been a good starting point.