SQL 2005 SP2 maintenance plan bug masking corruption

Not much to do here in Tokyo while waiting for the Bangkok flight except read and blog, and I've already done enough reading for one day on the flight here – so that leaves blogging!

Here's an interesting case that I got involved in on SQLServerCentral (here's the original thread, linked here with permission of the original poster). The problem was that every night the maintenance job would run and end up producing a stack dump, from a 211 error (which always indicates metadata corruption causing the Engine to trip over). I guessed system table corruption but the consistency checking portion of the maintenance job was apparently running fine. So I asked for one of the stack dumps, the output from the maintenance plan, and the latest SQL error log to have a look at (I have a weakness for these things, as you well know).

After having a look, I saw some strange behavior in the error log – every night there 10 or so DBCC CHECKDBs of master within about 2 minutes – corresponding to the consistency checking part of the maintenance job that was clearing running DBCC CHECKDB against all the system and user databases. Then I remembered that vanilla 2005 SP2 has two nasty maintenance plan bugs – one of which caused the consistency checking part of the maintenance plan to malfunction when run, and run DBCC CHECKDB against master every time instead of the database specified. Fixing that would then lead to the being able to properly run consistency checks and find the corrupt database.

So – moral of the story is to make sure you're not running vanilla 2005 SP2 (or one of the builds close to it, 3150 to 3158) otherwise you could get into problems with maintenance plans – worse still, you may think you've been running consistency checks all this time on your user databases, but if no-one's checking the SQL error logs, you could just be repeatedly consistency checking master.

The KB that describes the bug is 934459 (available at http://support.microsoft.com/kb/934459), or you can install SP3 (available at http://support.microsoft.com/?kbid=955706).

Is running repair on msdb safe?

Just saw this on a forum – running REPAIR_ALLOW_DATA_LOSS on msdb to cope with corruption. Yes, this will work but it's certainly not safe and something I'd only recommend as a last resort.

Here's why: msdb stores your backup history and your Agent jobs. REPAIR_ALLOW_DATA_LOSS works by usually "delete what's broken and fix up all the links" (quoting myself). If you run repair on msdb, you need to pay close attention to what was deleted by repair, as it could lead to big problems later on. Some examples:

  1. What if repair deletes a record which just happened to store the Agent job that runs the transaction log backups for your main production database? Suddenly your log isn't being backed up and you don't know about it. The log starts to grow and eventually runs out of space. The database stops and your application is down until you figure out what's wrong.
  2. What if repair deletes a record which just happened to store the details of a log backup of the production database? Your have a disaster and run the script that looks through the backup history tables and auto-generates RESTORE statements to get you up and running again with up-to-the minute recovery. Because of the missing record, there's a missing RESTORE LOG statement in the middle of the restores of the log-backup-chain for the production database. So the restores fail, and you're down until you figure out what's wrong.

Bottom line, it's not safe to run repair on msdb – proceed with extreme caution. Not that repair won't work, or will cause corruption, but that it's far better to have backups of msdb that you can restore from.

Easy monitoring of high-severity errors: create Agent alerts

In my previous post today I talked about error 825 and how you should have an Agent alert set up to catch it when it happens. There are two ways you can set up Agent alerts – using SSMS or using the stored-proc sp_add_alert. In a post from April 2008 I showed how to create an Operator and alerts using the SSMS wizards (see that post here). To use the SP to create the alert for message 825, the code would be as follows:

USE msdb;
GO

EXEC msdb.dbo.sp_add_alert @name = N'825 – Read-Retry Required',
    @message_id = 825,
    @severity = 0,
    @enabled = 1,
    @delay_between_responses = 0,
    @include_event_description_in = 1;
GO

Note the @include_event_description parameter. This is a tinyint that says which of the notification mechanisms should include the description. The list of possibilities is:

  • 0 – None
  • 1 – Email
  • 2 – Pager
  • 4 – Net send

So my value of 1 only includes the description in email, even if I've set up the Operator to receive email and a net send. You can get more info about this SP from the following link: http://msdn.microsoft.com/en-us/library/ms189531.aspx.

As to which alerts you should have defined, my advice would be to have one for message 825 and a separate alert for each severity 19, 20, 21, 22, 23, 24, and 25. These are the high-severity errors and are also written to the error log. You can get a complete description of all error severities at http://msdn.microsoft.com/en-us/library/ms164086.aspx. Note that if you specify @severity, the @message_id must be 0 (and vice-versa).

Happy alerting!

PS And of course you'll need to use the SP sp_add_notification (see http://technet.microsoft.com/en-us/library/ms173843.aspx) to hook the alert to an Operator. Thanks Tibor (who just blogged about how to determine which errors to alert on) – got distracted by email while writing this and forgot to include it!