This week's been a busy one on the forums and Twitter, with lots of interesting problems people are hitting. One of the things I've noticed is that there are lot's of misconceptions about running repair, so to round out Friday I'm going to run through a list of them for you. Here are the misconceptions, some of which I've had to argue several times with people and eventually resort to 'Look, I wrote the repair code, I'm sorry but you're wrong', which I hate doing:

  • Repair will not cause data loss. It depends. If you have to use REPAIR_ALLOW_DATA_LOSS, then you're going to lose data. That's why the option is named that - seriously.
  • Repair should be run as the default. No. Figure out what's wrong first before deciding what to do about it. If you've got a damaged 1TB clustered index, it's going to get rebuilt by repair. If you don't have an extra 1TB of disk space, it will fail, and then you're back to square one after hours of fruitless effort. You might be able to get away with doing something that doesn't involve taking the database (essentially) offline.
  • You can run repair without running DBCC CHECKDB. No. Repair is an option to one of the consistency-checking commands (DBCC CHECKALLOC, DBCC CHECKTABLE, or DBCC CHECKDB - note that DBCC CHECKFILEGROUP and DBCC CHECKCATALOG don't support repair).
  • As soon as you've run repair, everything's fine. No. You should always run DBCC CHECKDB a second time after running repair, to make sure the first repair fixed everything. Sometimes a corruption prevents some deeper checks being done, and when it's fixed, the next DBCC CHECKDB can run the deeper check and find more corruptions. I call this 'corruption masking'. The other reason is that repair probably just deleted some of your data. What effect is that going to have on the application? What if that's medical patient data? Or insurance records? Or back account details?
  • Repair can always fix everything. No. There are some things that DBCC CHECKDB cannot fix. See CHECKDB From Every Angle: Can CHECKDB repair everything? for the list.
  • Repair is safe to use on system databases. No. It cannot be used on master or tempdb because they cannot be put into single-user mode. You can do it on model, but it's unlikely to have an effect as there aren't any user tables in model (unless you create them) and system tables generally don't get repaired. You can run it on msdb, but it might have strange side-effects. See Is running repair on msdb safe?.
  • You can run repairs online. No. Repairs are always offline, in that the database must be in single-user mode.
  • REPAIR_REBUILD will fix everything. No. REPAIR_REBUILD only fixes problem in nonclustered indexes. In 2005 onwards, REPAIR_FAST does nothing at all.
  • Repairs on a repl Publisher propagate to the Subscribers. No. Anything done by repair is NOT marked for replication. You must reinitialize your Subscribers if you repair a Publisher.
  • Repairs always fix-up constraints. No. It has no idea that constraints exist. After repairing a database with constraints, you should run DBCC CHECKCONSTRAINT to make sure they're still valid.
  • Repairs try to save data. No. It doesn't go out of it's way to delete data, but it doesn't go out of it's way to save data in most cases. 'Delete what's broken and fix up all the links' is my sound-bite explanation of what repair does. Fix things as fast as possible and as provably correct as possible.
  • EMERGENCY mode repair will always work. No. I've seen cases where something broken in the file-system caused it to fail. Don't rely on repair.
  • You can undo repairs. It depends. If you started an explicit transaction, then you can roll everything back if you don't like what it did. People rarely do this though. EMERGENCY mode repair can never be rolled back.

Repair's a dangerous beast and should only be used as a last resort, or to bring a VVVLDB online again potentially much faster than a full restore, when a small amount of data loss can be tolerated. This is just a quick brain-dump of things people get wrong about repair.

Have a great weekend - tomrorow I'll report on last week's survey and kick off a new one. Cheers!

This is a true story, and unfolded over the last few days. It's deliberately written this way, I'm not trying to be patronizing - just illustrating the mistakes people can make if they don't know what not to do.

Once upon a time (well, a few days ago), there was a security person who had access to a 2000 SQL Server instance. He saw that the server had run out of disk space, but had no SQL server knowledge. He decided that the best thing to do was to delete the log file, instead of contact the real DBA to take corrective action. This person should not have had any access to SQL Server, or the ability to delete the log. Nevertheless, the log was deleted. And then the database became suspect. Eventually, with no other option, someone decided to create a new transaction log file using DBCC REBUILD_LOG. This was the right thing to do. The database wasn't being backed up so restore wasn't an option. Some of the details of who did what are unclear, as it so often the case. Needless to say, the real DBA was kept in the dark.

Unfortunately, whoever rebuilt the log didn't run DBCC CHECKDB afterwards to find out what corruption had been caused in the database by having the transaction log unceremoniously ripped out from under the database's feet. Rebuilding a transaction log is one of the worst things that can happen to a database, as all active transactions are thrown away, with no chance of rolling back. This leaves the database in a transactionally inconsistent, and potentially structurally inconsistent state (see Corruption: Last resorts that people try first... filed under my Bad Advice category, and Q4 from TechNet Magazine: February 2009 SQL Q&A column: Is it ever safe to rebuild a transaction log?).

Instead they carried on running. The real DBA thought everything was cool and even upgraded the instance from SQL 2000 to SQL 2005. To be honest, after seeing the 19000+ corruptions they had in the database, I'm surprised that the database upgraded successfully. But it did. Then he tried to rebuild some indexes, which failed with corruption errors. Which brought us to this morning, when DBCC CHECKDB was run. It reported a lot of corruption, but running from SSMS only showed the first 1000 errors (which was the inspiration for one of my previous posts today How to get all the corruption messages from CHECKDB. So they started to run REPAIR_ALLOW_DATA_LOSS, which will cause data loss - we chose the name of the repair option well. I got involved at this point, and saw from the list of errors that massive data loss would be the result.

Once they had re-run DBCC CHECKDB though the command line (luckily only 20 minutes) I could see all the errors. And I must say I was astounded. It was the *worst* case of multiply-allocated extents (see Inside the Storage Engine: Anatomy of an extent) that I've ever seen. Not only were the extents allocated by multiple IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) - in other words, two objects thought they had the same pages in the database allocated, but the two objects had actually each proceeded to overwite the other's updates in the pages - getting the two clustered indexes hopelessly interlinked. Lots of errors (1000s) like:

Msg 8904, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Extent (1:9528) in database ID 5 is allocated by more than one
allocation object.
...
Msg 8978, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: Object ID 1445580188, index ID 1, partition ID
376212519911424, alloc unit ID 94737543200768 (type In-row data). Page
(1:52696) is missing a reference from previous page (1:427112).
Possible chain linkage problem.
...
Msg 8977, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: Object ID 1445580188, index ID 1, partition ID
376212519911424, alloc unit ID 94737543200768 (type In-row data).
Parent node for page (1:143210) was not encountered.
...
Msg 2533, Level 16, State 1, Server NAMESCHANGEDTOPROTECTTHEINNOCENT, Line 1
Table error: page (1:405139) allocated to object ID 1445580188, index
ID 1, partition ID 376212519911424, alloc unit ID 94737543200768 (type
In-row data) was not seen. The page may be invalid or may have an
incorrect alloc unit ID in its header.
...
CHECKDB found 1653 allocation errors and 17646 consistency errors in
database 'NAMESCHANGEDTOPROTECTTHEINNOCENT'.

The repair for all of this? Delete it all and fix up the links. MASSIVE data loss.

Lessons to learn from this:

  • Don't give people with no clue about SQL Server access to SQL Server. 
  • Don't delete a transaction log to reclaim space. Cardinal sin.
  • Don't rebuild a transaction log UNLESS you run a full DBCC CHECKDB afterwards and satify yourself that you know the extent of the damage.
  • Don't upgrade a database without running DBCC CHECKDB first. Best case - the upgrade fails. Worst case - it upgrades and then you might not be able to fix the corruptions.
  • Don't just run repair without doing a little analysis on the errors reported, especially with a *huge* number of errors reported.

Finally, I'm really grateful to the DBA in question for letting me help him out with this, and to post this blog post - we all learn from our own and others' mistakes.

PS And I got involved in this from Twitter - I just *love* it. After initially being skeptical of how much time I'd spend on it, I'm finding the benefits of connecting to the SQL community in 'real-time' vastly outweigh the time I'm putting into it. Follow me on http://twitter.com/paulrandal and you'll see a bunch of other SQL MVPs on there too.

Just saw this on a forum - running REPAIR_ALLOW_DATA_LOSS on msdb to cope with corruption. Yes, this will work but it's certainly not safe and something I'd only recommend as a last resort.

Here's why: msdb stores your backup history and your Agent jobs. REPAIR_ALLOW_DATA_LOSS works by usually "delete what's broken and fix up all the links" (quoting myself). If you run repair on msdb, you need to pay close attention to what was deleted by repair, as it could lead to big problems later on. Some examples:

  1. What if repair deletes a record which just happened to store the Agent job that runs the transaction log backups for your main production database? Suddenly your log isn't being backed up and you don't know about it. The log starts to grow and eventually runs out of space. The database stops and your application is down until you figure out what's wrong.
  2. What if repair deletes a record which just happened to store the details of a log backup of the production database? Your have a disaster and run the script that looks through the backup history tables and auto-generates RESTORE statements to get you up and running again with up-to-the minute recovery. Because of the missing record, there's a missing RESTORE LOG statement in the middle of the restores of the log-backup-chain for the production database. So the restores fail, and you're down until you figure out what's wrong.

Bottom line, it's not safe to run repair on msdb - proceed with extreme caution. Not that repair won't work, or will cause corruption, but that it's far better to have backups of msdb that you can restore from.

This blog post describes the demo "2 - NC Indexes" from my Corruption Survival Techniques conference session from various conferences in 2008. The links to the scripts and databases to use are in this blog post.

The aim of this demo is to show that sometimes its just redundant data (i.e. nonclustered indexes) that get corrupted, and so you don't have to do anything that takes the actual data offline - like restoring from a full backup or running one of the repair options (both of which require the database to be in single-user mode).

Let's look at an example. Extract and restore the DemoNCIndex database, and the NCIndexCorruption.sql script. What do we get from running DBCC CHECKDB on the DemoNCIndex database (lines 39-42)?

DBCC CHECKDB (DemoNCIndex) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 8951, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Data row does not have a matching index row in the index 'CustomerName' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Line 1
Data row (1:45:28) identified by (CustomerID = 29) with index values 'LastName = 'Adams' and CustomerID = 29'.
Msg 8951, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Data row does not have a matching index row in the index 'CustomerName' (ID 2). Possible missing or invalid keys for the index row matching:
Msg 8955, Level 16, State 1, Line 1
Data row (1:180:164) identified by (CustomerID = 2118) with index values 'LastName = 'Adams' and CustomerID = 2118'.

<snip - removed for brevity>

Msg 8952, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Index row in index 'CustomerName' (ID 2) does not match any data row. Possible extra or invalid keys for:
Msg 8956, Level 16, State 1, Line 1
Index row (1:24482:16) with values (LastName = 'Andersen' and CustomerID = 18718) pointing to the data row identified by (CustomerID = 18718).
Msg 8952, Level 16, State 1, Line 1
Table error: table 'Customers' (ID 453576654). Index row in index 'CustomerName' (ID 2) does not match any data row. Possible extra or invalid keys for:
Msg 8956, Level 16, State 1, Line 1
Index row (1:24482:127) with values (LastName = 'Arthur' and CustomerID = 9758) pointing to the data row identified by (CustomerID = 9758).
CHECKDB found 0 allocation errors and 26 consistency errors in table 'Customers' (object ID 453576654).
CHECKDB found 0 allocation errors and 26 consistency errors in database 'DemoNCIndex'
repair_rebuild is the minimum repair level for the errors found by DBCC CHECKDB (DemoNCIndex).
 

Lots of errors. Now, in this example there are only 26 errors, but in cases where there are hundreds of errors it can be really hard to tell whether all the corruptions are in nonclustered indexes (i.e. indexes with IDs > 1). Luckily, there's an undocumented option to all the DBCC CHECK* commands - WITH TABLERESULTS. The option is undocumented because the output can change from release to release, but basically this converts the DBCC output into tabular form. Try running lines 48-50 in the script and you'll see what I mean. One of the columns in the output is IndexId - so you can easily see whether all the errors are in nonclustered indexes. In this case, they are, and all in one index of the Customers table.

You could run lines 55-57 of the script to prove to yourself that repairs can't be run online, and then realize that we can address the problem without having to run repair or restore. First off we need to figure out the name of the index to fix - index ID 2 of the Customers table. Lines 77-80 run sp_HelpIndex on the table (although I should really be using Kimberly's sp_HelpIndex2):

USE DemoNCIndex
GO
EXEC sp_HelpIndex 'Customers';
GO

index_name     index_description                                   index_keys
-------------- --------------------------------------------------- ------------
CustomerName   nonclustered located on PRIMARY                     LastName
CustomerPK     clustered, unique, primary key located on PRIMARY   CustomerID

The nonclustered index is called CustomerName. Plug in the index name to line 82, then try fixing the index by doing an online index rebuild, and run DBCC CHECKDB afterwards (lines 82-89). The corruption hasn't been fixed! Online index rebuild reads the old index to build the new one so the new index has the same missing rows as the old one. We need to do an offline index rebuild - with lines 110-115. After the last DBCC CHECKDB, the index is fixed up. Now, on SQL Server 2008, you may or may not get a query plan for the index rebuild that doesn't use the old index, because the query optimizer has some more plan choices available to it - so on SQL Server 2008 you may need to do an actual drop and create of the broken index (carefully, if its enforcing a constraint).

So - just because DBCC CHECKDB reports a ton of errors, that doesn't necessarily mean that the database needs to be taken (essentially) offline to repair it - check through the errors to see if it's just nonclustered indexes that are affected.

For those of you who couldn't make it to a conference this year where I presented my Corruption Survival Techniques session, the folks at TechEd EMEA have just posted an 80 minute long video of the presentation I did in Barcelona in early November. It walks through I/O errors, what CHECKDB does, how it works, how to run it, CHECKDB FAQ, how to interpret the output, choosing between repair and restore and has a bunch of demos of recovering from corruptions. Lots of fun stuff!

The video is available at http://www.microsoft.com/emea/teched2008/itpro/tv/default.aspx?vid=78. The accompanying scripts and corrupt databases are all posted on our website - see this blog post for details.

Enjoy!

Now, I'm very thick-skinned and I know there are always some people in a conference session who don't agree with everything I say (that's human nature, and I'm totally cool with that) but this one I just couldn't pass up mentioning here on the blog as I *utterly* disagree with the advice in that post, and suspect that the poster didn't "get" what I was trying to explain in the session.

I came across an interesting blog post from someone who attended PASS, describing my Corruption Survival Techniques session as really interesting and fun, but basically useless. The advice was that there are only a handful of people in the world who can run things like single-page restore and emergency mode repair, and as soon as corruption is suspected, the DBA should just call Product Support for help.

The point of my session is to explain two things - that you should pro-actively be looking for corruption, and you should know what to do when corruption occurs. Both of these enable your business to experience less down-time and data-loss when corruption does occur. So turning on page checksums and running DBCC CHECKDB regularly are easy. So is planning a decent backup strategy (based on what you want to be able to restore - see my previous post on this - Planning a backup strategy - where to start?).

The more tricky part is knowing what to do when corruption does occur. That's why I discuss some of the output of DBCC CHECKDB, in terms of high-level tips and tricks rather than what each and every error means (see my previous post on this - Tips and tricks for interpreting CHECKDB output). I also recommend backups as the best way to limit data-loss, but not necessarily down-time - depending on the backups you have available. The last part of the session shows some tricks for getting around worst-case scenarios, like someone detaching a suspect database or needing to run emergency mode repair. I don't expect everyone to run off and start hacking the 2005 system tables with a single-user booted server and using the DAC (but if you do, see this post Wink) but having some of this knowledge can make DBAs more confident to tackle problems themselves and increase their skills.

Since I've been blogging about this stuff and presenting it at conferences, I've heard from *countless* people who've used these techniques themselves to recover from disasters, and learned a ton of information and good practices in the process. Any production DBA with half a brain (a great Scottish expression Smile) should be able to use restore, single-page restore, or run a repair - otherwise, with all due respect, they shouldn't be running a production system. Now, for "involuntary" DBAs, who (through no fault of their own) may not know anything about backups, restores, or repairs - it's a totally different story, and help should be sought through Product Support or forums.

But to come out with a blanket statement that knowing how to run restores, repairs and do first-level interpretation of DBCC CHECKDB output is useless? And that potentially wasting time and money with front-line Product Support is the best course of action when corruption occurs, when you can work out most of it for yourself? That's *bad advice* as far as I'm concerned.

Maybe I'm just cranky as I'm sitting here with a very sore mouth after getting a filling at the dentist this morning Cry

What do you think? Comments please!

(PS I'm not fishing for praise - I want to know what you think of the argument)

This blog post explains the demo scripts and databases I've posted to cover all the Corruption Survival Techniques and DBCC CHECKDB sessions I've presented at conferences this year. There are two zip files you need to download: the example corrupt databases (36-MB zip) and the demo scripts. These are also both available through our past events pages.

The databases zip contains the following databases:

  • DemoDataPurity
    • 192-MB SQL Server 2005 database with a single 2570 (data purity) error
  • DemoFatalCorruption1
    • 1-MB SQL Server 2005 database with a corrupt system table (that allows CHECKDB to complete)
  • DemoFatalCorruption2
    • 1-MB SQL Server 2005 database with a corrupt system table (that terminates CHECKDB)
  • DemoNCIndex
    • 192-MB SQL Server 2005 database with a bunch of nonclustered index corruptions
  • DemoRestoreOrRepair
    • 1-MB SQL Server 2005 database with a page checksum failure (in fact a zero'd out page)
  • DemoCorruptMetadata
    • 1-MB SQL Server *2000* database with corrupt syscolumns table

The scripts zip contains the following directories:

  • 1 - Fatal Errors
    • This makes use of the DemoFatalCorruption1 and DemoFatalCorruption2 databases. The FatalErrors.sql script has the steps to follow and see this blog post for a complete walk-through.
  • 2 - NC Indexes
    • This makes use of the DemoNCIndex database. The NCIndexCorruption.sql script has the steps to follow and see this blog post for a complete walk-through.
  • 3 - Data Purity
    • This makes use of the DemoDataPurity database. The DataPurityCorruption.sql script has the steps to follow. I'll blog more details on this script sometime in the next week (and then update this post).
  • 4 - Metadata
    • This makes use of the DemoCorruptMetadata database. The CorruptMetadata.sql script has the steps to follow and see this blog post for a complete walk-through.
  • 5 - Restore or Repair
    • This makes use of the DemoRestoreOrRepair database. There's a setup script (01CreateRestoreOrRepair.sql) and two demo scripts - FixUsingPageRestore.sql and FixUsingRepair.sql. I'll blog more details on these scripts sometime in the next week (and then update this post).
  • 6 - Suspect Database
    • This demo doesn't come with a corrupt database - you create your own one. The SuspectDatabase.sql script has the steps to follow and see this blog post for a complete walk-through.

Enjoy!

It's been almost two weeks since my last post as we've been offline in Scotland (another photo post to follow - I owe you two now...) but now we're back for the crazy Fall conference and teaching season.

The latest installment of my regular Q&A column in TechNet Magazine is available at http://technet.microsoft.com/en-us/magazine/cc895648.aspx. This month I cover the following topics:

  • How backups and restores work and why the times for each may differ
  • The difference between log shipping and database mirroring around BULK_LOGGED operations
  • How other factors apart from log backups can contribute to excessively large transaction logs
  • Why database repair exists and why it shouldn't be used
  • Tracking index usage with sys.dm_db_index_usage_stats

Enjoy!

This is one I've been trying to get to since I started blogging a couple of years ago: how to re-attach a detached suspect database. This is a pretty common scenario I see on the forums - a database goes suspect so the DBA tries to detach/attach, which fails. I wrote a demo for my corruption session at TechEd this year that shows how to create a suspect database with hex editor, then detaches it and shows how to re-attach and fix it. It's going to be a long blog post, but I've got everything wrapped up before we fly out to the UK tomorrow so I've got a bit of spare time.

Creating a Suspect Database

First off I'm going to create a simple database to use, called DemoSuspect with a table and some random data.

USE MASTER
GO

CREATE DATABASE DemoSuspect
GO

USE DemoSuspect;
GO

CREATE TABLE Employees (FirstName VARCHAR (20), LastName VARCHAR (20), YearlyBonus INT);
GO
INSERT INTO Employees VALUES ('Paul', 'Randal', 10000);
INSERT INTO Employees VALUES ('Kimberly', 'Tripp', 10000);
GO

Now I'll perform an update in an explicit transaction and force it to be written out to disk with a CHECKPOINT. I've accidentally deleted Kimberly's bonus!

-- Simulate an in-flight transaction
BEGIN TRAN;
UPDATE Employees SET YearlyBonus = 0 WHERE LastName = 'Tripp';
GO

CHECKPOINT;
GO

Then in another window, I'll simulate a crash using:

SHUTDOWN WITH NOWAIT;
GO

Now that SQL Server is shutdown, I'm going to simulate an I/O failure that corrupts the log file. I'm going to use a hex editor to do this - my editor of choice is the immensely popular and useful XVI32, written by Christian Maas. I opened the log file, filled the first section with zeroes, and then saved it again. See the screenshot below.

When I startup SQL Server again, it will try to run recovery on the DemoSuspect database and fail. This will put the database into the SUSPECT state.

So I restarted SQL Server, let's try getting in to the DemoSuspect database.

USE DemoSuspect;
GO

Msg 945, Level 14, State 2, Line 1
Database 'DemoSuspect' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Now let's check the database status:

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS 'Status';
GO

Status
--------

SUSPECT

Now at this point, the correct procedure is to restore from backups. If there are no backups available, then the next best thing is to get the database into EMERGENCY mode and extract as much data as possible, or run emergency-mode repair. However, I'm going to try the detach/attach route instead.

Detaching the Database

I'll try detaching the database using sp_detach_db:

EXEC sp_detach_db 'DemoSuspect';
GO

Msg 947, Level 16, State 1, Line 1
Error while closing database 'DemoSuspect'. Check for previous additional errors and retry the operation.

Hmm - did it work or didn't it?

SELECT * FROM sys.databases WHERE NAME = 'DemoSuspect';
GO

And there are no results, so the detach must have succeeded.

Re-attaching the Database

Let's try the obvious sp_attach_db:

EXEC sp_attach_db @dbname = N'DemoSuspect', 
   
@filename1 = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf', 
   
@filename2 = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf';
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf' is not a valid database file header. The PageAudit property is incorrect.

Hmm. How about using the ATTACH_REBUILD_LOG option on CREATE DATABASE? That should create a new log file for me:

CREATE DATABASE DemoSuspect ON
   (NAME = DemoSuspect, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf')
FOR ATTACH_REBUILD_LOG;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf' is not a valid database file header. The PageAudit property is incorrect.
File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database 'DemoSuspect'. CREATE DATABASE is aborted.

Hmm. The database knows that there was an active transaction. Using the ATTACH_REBUILD_LOG command only works if the database was cleanly shut down and the log is missing. What about if I actually remove the log file and try that - maybe I could fool it? I took a copy of the data and log file, and then deleted the original log file (this is getting more and more dodgy...). Let's try again:

CREATE DATABASE DemoSuspect ON
   
(NAME = DemoSuspect, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf')
FOR ATTACH_REBUILD_LOG;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database 'DemoSuspect'. CREATE DATABASE is aborted.

Well, that was a long-shot but SQL Server's smarter.

Basically the problem is that the database wasn't cleanly shutdown, which means that recovery HAS to run and complete before the database can be attached again. Given that our log file is corrupt, that's impossible. So, never detach a suspect database.

The only way to get the database back into SQL Server is to use a hack. I'm going to create a new dummy database with the exact same file layout and as close as possible to the file sizes of the detached database. Then I'm going to shutdown SQL Server, swap in the corrupt database files, and re-start SQL Server. If all goes well, the corrupt, suspect database will be attached again.

The one major downside of this is that if the SQL Server instance doesn't have instant initialization enabled (see How to tell if you have instant initialization enabled?), then creating the dummy database could take a long time if the data files are very big. This means that your application is offline while the files are created and zero'd out.

So, I've already got a copy of the corrupt database files, so now I need to delete the data file too. Before doing this you want to make absolutely sure you've got multiple copies of the corrupt database files... just in case. After deleting the data file, I can create my dummy database.

CREATE DATABASE DemoSuspect
GO

If you forget to delete the existing corrupt files first, you'll get the following error:

Msg 1802, Level 16, State 4, Line 1
CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
Msg 5170, Level 16, State 1, Line 1
Cannot create file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf' because it already exists. Change the file path or the file name, and retry the operation.

Ok - delete the data file and try again. Now I need to check the dummy database's files are there (at this point I'm totally paranoid), shutdown SQL Server, delete the dummy databases' files (this just seems wrong!), and swap in the corrupt files. I took another copy of the corrupt files before swapping them back in, just in case something goes wrong.

After restarting SQL Server, I can check the database state:

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS 'Status';
GO

Status
--------
SUSPECT

Woo-hoo - I'm back to having a suspect database attached again - after having to shutdown the server and mess about deleting and copying files around. Not good. Now I can actually fix it.

Fixing the Database

If you don't have any backups, then the only way to get into the database is to use EMERGENCY mode. This lets you into the database but you need to be aware that recovery has not completed so the contents of the database are transactionally (and possibly structurally) inconsistent. I'm going to choose to repair the database using emergency-mode repair. See CHECKDB From Every Angle: EMERGENCY mode repair - the very, very last resort for a detailed description of this tool.

ALTER DATABASE DemoSuspect SET EMERGENCY;
GO
ALTER DATABASE DemoSuspect SET SINGLE_USER;
GO
DBCC CHECKDB (DemoSuspect, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\DemoSuspect_log.LDF' is not a valid database file header. The PageAudit property is incorrect.
File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\DemoSuspect_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
The Service Broker in database "DemoSuspect" will be disabled because the Service Broker GUID in the database (9E879BFC-B742-4A69-AB14-4D6BD6F99E02) does not match the one in sys.databases (B4568D23-7018-40CF-B189-9C29DE697C09).
Warning: The log for database 'DemoSuspect' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

First off it tries to do the regular ATTACH_REBUILD_LOG. When that fails, DBCC CHECKDB takes over and forces the log to be rebuilt, after trying to force as much of the damaged log to be processed for recovery as it can. It then runs a full repair, in case there's anything corrupt in the database - in this case there isn't so there are no corruption messages in the output.

Notice the line about the Service Broker GUID being wrong. I had to use the hack method to get the database attached again, but when I created the dummy database, it created a Service Broker GUID for the DemoSuspect database in master.sys.databases. When I swapped in the corrupt database, it has a different GUID - so now I can't use Service Broker! And all because I detached the suspect database instead of repairing it.

So what's the state of the data?

USE DemoSuspect;
GO

SELECT * FROM Employees;
GO

FirstName   LastName   YearlyBonus
----------  ---------  ------------
Paul        Randal     10000
Kimberly    Tripp      0

Kimberly doesn't get a bonus this year - she won't be happy! This is contrived and flippant, of course, but it illustrates the point that after doing an emergency-mode repair, transactions that were active at the time the log was damaged will not get a chance to roll-back, most likely. In this case, I know what was going on when the crash occured, but what about on a busy OLTP system with hundreds or thousands of active transactions? What state will the data be in?

Summary

Yes, you can recover from a detached suspect database, but it's not pretty and you have to be very careful. The best course of action is always to have a comprehensive backup strategy that allows you to restore as quickly as possible. If you do have a suspect database and no backups, use EMERGENCY mode to access and/or repair the database. Hopefully this article will help people that find themselves in these situations.

Every so often I'll see posts on the various data corruption forums discussing causes of corruption. In this post I want to debunk some of the myths around what can cause corruption. There are really two types of corruption to deal with, physical corruption and logical corruption.

Physical corruption

This is where something has altered the contents of a data or log file sector with no regard for what is being stored there. Possible causes of physical corruption are:

  • Problem with the I/O subsystem (99.8% of all cases I've ever seen - only 3 nines as I'd estimate I've seen around about a thousand corruption cases). Remember the I/O subsystem is everything underneath SQL Server in the I/O stack - including the OS, 3rd-party file system filter drivers, device drivers, RAID controllers, SAN controllers, network hardware, drives themselves, and so on. Millions of lines of code and lots of moving parts spinning very fast, very close to very fragile pieces of metal oxide (I once heard Jim Gray liken a disk drive head to a 747 jumbo jet flying at 500 mph at a height of 1/4 inch from the ground...)
  • Problem with the host machine hardware (0.1% of cases). Most of the time this is a memory error.
  • SQL Server bugs (0.1% of cases). Yes, there have been corruption bugs. Every piece of software has bugs. There are KB articles describing bugs.
  • Deliberate introduction of corruption using a hex editor or other means.

Physical corruption is what DBCC CHECKDB usually reports and the majority of cases are caused by a physical failures of some kind, with the minority caused by humans - software bugs.

Logical corruption

This is where something has altered some data so that a data relationship is broken. Possible causes of logical corruption are:

  • Humans

:-) Okay...

  • Application bug. The application deletes one part of an inherent data relationship but not the other. Or the application designer doesn't implement a constraint properly. Or the application designer doesn't cope with a transaction roll-back properly. You get the idea.
  • Accidental update/delete. Someone deletes or updates some data incorrectly.
  • SQL Server bug. See above.
  • DBCC CHECKDB when using the REPAIR_ALLOW_DATA_LOSS option. As is documented in Books Online, and I've blogged about and mentioned when lecturing, if you run repair, it doesn't take into account any inherent or explicit constraints on the data.

The point here is that a physical failure of a component does not cause logical corruption, it causes physical corruption. Conversely, application errors cause logical corruption, not physical corruption. DBCC CHECKDB errors are about physical corruption (okay, with the inclusion of DBCC CHECKCATALOG code in 2005, it will find cases where the DBA has manually altered the system tables, causing logical corruption) and applications cannot cause physical corruption as they can only manipulate data through SQL Server. If an application hits a SQL Server bug which causes physical corruption, that's still not the application causing physical corruption, it's SQL Server.

So - on to the myths.

  • Can an application cause physical corruption? No.
  • Can stopping a shrink operation cause corruption of any kind? No.
  • Can stopping an index rebuild cause corruption of any kind? No.
  • Can running DBCC CHECKDB without repair cause corruption of any kind? No.
  • Can creating a database snapshot cause corruption of any kind? No.

Hope this helps.

Today I presented my brand new session Surviving Corruption: From Detection to Recovery at TechEd. I had a lot of fun putting together the demos, presenting the session, and talking to people afterwards. During the session, I promised to blog each of the demos so that everyone can run through them - here's the first one.

On SQL 2000, it was pretty easy to get into the system tables and manually change them - all you had to do was:

EXEC sp_configure 'allow updates', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

And then you could insert, update, and delete whatever you wanted in the all the system tables, including the critical three - sysindexes, sysobjects, and syscolumns. The problem was that sometimes people actually did this and messed things up - for instance, by manually deleting an object from sysobjects, but leaving around all the other info about the object - such as indexes and columns. DBCC CHECKCATALOG in SQL 2000 would find this, but DBCC CHECKDB would not - as it didn't run the DBCC CHECKCATALOG code - any most people do not run DBCC CHECKCATALOG at all. Many times now, I've seen databases upgraded to 2005 and suddenly DBCC CHECKDB is reporting metadata corruption errors - all because someone had manually changed the system tables on 2000, and I changed DBCC CHECKDB in 2005 to include the DBCC CHECKCATALOG checks.

This demo is all about that. I created a 2000 database, manually deleted a row in sysobjects and then upgraded the database to 2005. The corrupt database is available in a zip file - DemoCorruptMetadata.zip. If you unzip it into a folder C:\SQLskills then you can attach it using:

RESTORE DATABASE DemoCorruptMetadata FROM DISK = 'C:\SQLskills\DemoCorruptMetadata.bak'
   
WITH MOVE 'DemoCorruptMetadata' TO 'C:\SQLskills\DemoCorruptMetadata.mdf',
   
MOVE 'DemoCorruptMetadata_log' TO 'C:\SQLskills\DemoCorruptMetadata_log.ldf',
   
REPLACE;
GO

So what does the corruption look like on 2005?

DBCC CHECKDB (DemoCorruptMetadata) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=1) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=2) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB found 0 allocation errors and 2 consistency errors in database 'DemoCorruptMetadata'.

This is what we expect. Notice that there's no recommended repair level at the end of the output - this is because CHECKDB can't repair metadata corruptions. We can't fix this with a backup - unless we have a backup from 2000 from before the manual delete in the system tables. To fix this we'd need to go back to 2000, fix the corruption, and then upgrade again - usually not feasible.

Instead, we're going to fix it by manually altering the system tables in 2005 - something that's purportedly not possible. First let's see what tables there are that could include column information (remembering that the system catalogs were completely rewritten between 2000 and 2005):

SELECT [name] FROM DemoCorruptMetadata.sys.objects WHERE [name] LIKE '%col%';
GO

name
------------------
sysrowsetcolumns
syshobtcolumns
syscolpars
sysiscols

I know that sysrowsetcolumns and syshobtcolumns are involved at low-levels of the Storage Engine and don't contain relational metadata, so let's try syscolpars. I want to see what columns there are to see if one of the looks like an object ID, and another looks like a column ID. This query will just return the table columns, with no rows (because the condition 1=0 is always false:

SELECT * FROM DemoCorruptMetadata.sys.syscolpars WHERE 1 = 0;
GO

Msg 208, Level 16, State 1, Line 1
Invalid object name 'DemoCorruptMetadata.sys.syscolpars'.

I can't bind to internal system tables in 2005. But - I can bind to internal system tables using the Dedicated Admind Connection (or DAC for short). This is documented in Books Online at http://msdn.microsoft.com/en-us/library/ms179503.aspx. You can get to the DAC through SQLCMD using the /A switch. So - assuming I'm now connected through the DAC, I'll try that command again:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> SELECT * FROM sys.syscolpars WHERE 1=0;
2> GO
id          number colid       name

xtype utype       length prec scale collationid status      maxinrow xmlns
 dflt        chk         idtval

----------- ------ ----------- -------------------------------------------------
-------------------------------------------------------------------------------
----- ----------- ------ ---- ----- ----------- ----------- -------- -----------
 ----------- ----------- -------------------------------------------------------
-----------

(0 rows affected)
1>

This looks like the table. Now I'll query against it using the object ID from the original corruption message:

1> SELECT colid, name FROM sys.syscolpars WHERE id = 1977058079;
2> GO
colid       name
----------- --------------------------------------------------------------------
------------------------------------------------------------
          1 SalesID
         
2 CustomerID
(2 rows affected)
1>

Cool. So I'll try deleting the orphaned columns:

1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO
Msg 259, Level 16, State 1, Server ROADRUNNERPR, Line 1
Ad hoc updates to system catalogs are not allowed.
1>

Hmm. And it doesn't help if I set 'allow updates' to 1, or try putting the database into single-user mode.

There IS a way though. You can put the SERVER into single-user mode, then connect with the DAC and you can then update the system tables. This particular twist on using the DAC isn't documented anywhere except in an MSDN forum thread answered by someone from Microsoft (see here).

BEWARE (if I could put little flashing lights around this too then I would...) that this is undocumented and unsupported - misuse will lead to unrepairable corruption of your databases.

The sequence of events to follow is:

  • make a backup of the database just in case something goes wrong
  • shutdown the server
  • go to the binaries directory (e.g. C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn) and start the server in single-user mode using 'sqlservr -m'
  • connect back in using SQLCMD /A, and run the deleta again. This time it will work, but will give an error about metadata cache consistency:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO

(2 rows affected)
Warning: System table ID 41 has been updated directly in database ID 12 and cache coherence may not have been maintained. SQL Server should be restarted.
1>

  • The system table has been updated, but the in-memory cache of metadata is now out-of-sync with the system tables. So, shutdown the server again as the message suggests and restart it normally
  • run CHECKDB again and you'll see the corruption has been fixed.

Hope this helps some of you. Watch this space for the next demo from TechEd of repairing corruption when no backup is available.

(I'm actually on-stage here at TechEd doing the  DAT track pre-con with Kimberly - she's on now until lunch so I'm catching up on forum problems...)

Here's a question that came up on of the SQLServerCentral.com corruption forums I monitor that I think is worth blogging about. To paraphrase:

I have a bunch of corruptions in a database, that look like they've been there for a while. Repair is my only option - it works but I'd like to know what data is being deleted. How can I do that? Here are some of the errors:

Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168576) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168577) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168578) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168579) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168580) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168581) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168582) could not be processed. See other errors for details.
Server: Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168576) was not seen in the scan although its parent (1:165809) and previous (1:168575) refer to it. Check any previous errors.
Server: Msg 8978, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168583) is missing a reference from previous page (1:168582). Possible chain linkage problem.

This is a clustered index that CHECKDB  will repair by deleting pages at the leaf-level - essentially deleting a bunch of records. The pages look to be trashed (there were a bunch more errors that I didn't include here that said the page headers were all corrupted - looked like the IO subsystem trashde a whole 64KB chunk of the disk) so there's nothing much else you can do. As the table has a clustered index, you can use the error messages to find the pages on either 'logical' side of the pages being deleted - and hence figure out the range of records that have been deleted.

The errors show that pages 168576 through 168582 in file 1 are corrupt. There are also errors that say the previous page of 168576 is 168575, and the next page of 168582 is 168583. If you do a DBCC PAGE of these two pages, you can find the lower and upper bound of the clustered index key values that have been lost. Think of three ranges:

  • the lower range of records that are intact, logically before the corrupt pages in the index
  • the range of records that will be deleted by repair
  • the upper range of records that are intact, logically after the corrupt pages in the index

To find the upper bound of the lower range:

DBCC TRACEON (3604); -- allows the output to come to the console
DBCC PAGE ('dbname', 1, 168575, 3);
GO

The key value in the slot at the end of output is the upper bound of the bottom range that's intact.

Then do:

DBCC PAGE ('dbname', 1, 168583, 3);
GO

The key value in the slot at the beginning of the output is the lower bound of the upper range that's intact.

Everything in the middle will be deleted. You could also try a DBCC PAGE on the corrupt pages themselves too - you might be able to see some data in them.

I'll be blogging a bunch more about repair after my corruption session this week at TechEd - watch this space!

In my previous post on interpreting CHECKDB output, plus in my DBCC Internals session at TechEd IT Forum yesterday, I mentioned there are some things that CHECKDB can’t repair. In this post I want to go into a bit more detail – based on a post from my old Storage Engine blog.

Before anyone takes this the wrong way - what do I mean by "can't be repaired"? Remember that that purpose of repair is to make the database structurally consistent, and that to do this usually means deleting the corrupt data/structure (that's why the option to do this was aptly named REPAIR_ALLOW_DATA_LOSS – see this post for more explanation on why repair can be bad). A corruption is deemed unrepairable when it doesn't make sense to repair it given the damage the repair would cause, or the corruption is so rare and so complicated to repair correctly that it's not worth the engineering effort to provide a repair. Remember also that recovery from corruptions should be based on a sound backup strategy, not on running repair, so making this trade-off in functionality makes sense.

Here's a few of the more common unrepairable corruptions that people run into along with the reasons they can't be repaired by DBCC.

PFS page header corruption

An example of this is on SQL Server 2005:

Msg 8946, Level 16, State 12, Line 1
Table error: Allocation page (1:13280496) has invalid PFS_PAGE page header values.
Type is 0. Check type, alloc unit ID and page ID on the page.

CHECKDB uses the PFS pages to determine which pages are allocated - and so which pages to read to drive the various consistency checks. The only repair for a PFS page is to reconstruct it - they can't simply be deleted as they're a fixed part of the fabric of the database. PFS pages cannot be rebuilt because there is no infallible way to determine which pages are allocated or not. There are various algorithms I've experimented with to rebuild them, with optimistic or pessimistic setting of page allocation statuses and then re-running the various consistency checks to try to sort out the incorrect choices, but they all require very long run-times. Given the frequency with which these corruptions are seen, and the engineering effort required to come up with an (imperfect) solution, I made the choice to leave this as unrepairable, and I don’t think that will change in future.

Critical system table clustered-index leaf-page corruption

An example of this is on SQL Server 2000:

Server: Msg 8966, Level 16, State 1, Line 1
Could not read and latch page (1:18645) with latch type SH. sysindexes failed.

And on SQL Server 2005:

Msg 7985, Level 16, State 2, Server SUNART, Line 1
System table pre-checks: Object ID 4. Could not read and latch page (1:51) with
latch type SH. Check statement terminated due to unrepairable error.

In a previous post in the series I described why how and why we do special checks of the clustered indexes of the critical system tables. If any of the pages at the leaf-level of these indexes are corrupt, we cannot repair them. Repairing would mean deallocating the page, wiping out the most important metadata for potentially hundreds of user tables and so effectively deleting all of these tables. That's obviously an unpalatable repair for anyone to allow and so CHECKDB doesn't do it.

Column value corruption

Here's an example of this on SQL Server 2005:

Msg 2570, Level 16, State 3, Line 1
Page (1:152), slot 0 in object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type "In-row data"). Column "c1" value is out of range for data type "datetime".  Update column to a legal value.

This is where a column has a stored value that is outside the valid range for the column type. There are a couple of repairs we could do for this:

  1. delete the entire record
  2. insert a dummy value

#1 isn't very palatable because then data is lost and it’s not a structural problem in the database so doesn't have to be repaired. #2 is dangerous - what value should be chosen as the dummy value? Any value put in may adversely affect business logic, or fire a trigger, or have some unwelcome meaning in the context of the table - even a NULL. Given these problems, I chose to allow people to fix the corrupt values themselves.

Metadata corruption

Here's an example of this on SQL Server 2005:

Msg 3854, Level 16, State 1, Line 2
Attribute (referenced_major_id=2089058478) of row (class=0,object_id=2105058535,column_id=0,referenced_major_id=2089058478,referenced_minor_id=0) in sys.sql_dependencies has a matching row (object_id=2089058478) in sys.objects (type=SN) that is invalid.

This example is relatively benign. There are other examples that will cause CHECKDB to terminate - not as bad as the critical system table corruption example above, but enough that CHECKDB doesn't trust the metadata enough to use it to drive consistency checks. Repairing metadata corruption has the same problems as repairing critical system table corruption - any repair means deleting metadata about one or more tables, and hence deleting the tables themselves. It's far better to leave the corruption unrepaired so that as much data as possible can be extracted from the remaining tables.

Summary

Repair can't fix everything. You may end up having to perform manual and time-consuming data extraction from the corrupt database and losing lots of data because of, say, a critical system table corruption. Bottom line (as usual) - make sure you have valid backups so you don't get into this state!

Last week at SQL Connections someone said that CHECKDB’s output is ‘useless’.  Given that I wrote CHECKDB for SQL Server 2005 I was only mildly offended :-) But there’s a lot of truth in that statement – the error messages from CHECKDB do tell you exactly what’s corrupt in the database but they’re not exactly easy to read and understand unless you’re intimately familiar with the on-disk structures used to store a SQL Server database. It’s pretty easy to understand the output if there’s only one or two error messages, but once you get past 10 or so it becomes extremely difficult to figure out what the actual problem is for normal people (i.e. not CHECKDB geeks like me, Ryan or a handful of senior people in Product Support)

So, how are you supposed to figure out what’s wrong? In this post I want to run through a few tips and tricks you can use.

  • If the recommended repair level (at the very bottom of the output) is REPAIR_REBUILD then only non-clustered indexes are damaged. This meanscyou don’t need to run repair (which needs the database in single-user mode) or restore from backups. You can just manually rebuild the damaged indexes to fix the problem, possibly even online if you're running Enterprise Edition and the index doesn't have any LOB columns.
  • There are a couple of cases where even if it’s only non-clustered indexes that are damaged, REPAIR_ALLOW_DATA_LOSS may still be the recommended repair option. In this case, check the index ID in all the error messages – if all index IDs are 2 or higher, then it’s only non-clustered indexes that are damaged and you can rebuild them yourself instead of running repair/restore.
  • Figure out a list of tables that are affected by looking at the summary messages about how many corruptions were found per table. If each of the tables is isolated on a separate filegroup then these filegroups can be taken offline and restored individually, possibly without interrupting the application workload. If there's no backup, then you can run DBCC CHECKTABLE with repair, for a faster repair operation than running a full CHECKDB.
  • Figure out the list of damaged pages. There are a number of error messages that could point to a damaged page – usually 8928 or 2537 – and then you can choose to do single-page restores from your backups to minimize downtime, possibly even online in Enterprise Edition.
  • Are there any errors that CHECKDB can’t repair?  If so, you have no choice but to restore from backups or to extract as much info as possible into a new database. Things to look for that mean repair won’t be able to fix everything are:
    • CHECKDB stops early and complains about system table pre-checks failing (errors 7984 – 7988 inclusive)
    • CHECKDB reports any metadata corruption (8992, 8995 errors)
    • CHECKDB reports any errors on PFS page headers (8939, 8946 errors with a possible 8998 error as well)

While I was at Microsoft I wrote two 200-page+ documents detailing all the CHECKDB error messages for SQL Server 2000 and 2005 – the team says they should all be translated into Books Online entries by next summer – a bunch of them are there already. In the meantime, I hope the hints above will take a little of the mystery out of CHECKDB’s output.

This is a follow-on article from two posts:

People get themselves into situations where they have no backups (or damaged backups) and the data or log files are damaged such that the only way to access the database is with EMERGENCY mode. In these situations, prior to SQL Server 2005, there was no documented or supported way to fix a database while in EMERGENCY mode - the only guidance could be found on the Internet or from calling Product Support and paying for help. The sequence of events was:

  1. Hack the system tables to get the database into 'emergency' mode.
  2. Use the undocumented and unsupported DBCC REBUILD_LOG command to build a new transaction log.
  3. Run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option to fix up corruptions in the data files - both those that may have caused the issue, and those caused by rebuilding the transaction log (e.g. because an active transaction altering the database structure was lost).
  4. Figure out what data was lost or is transactionally inconsistent (e.g. because a transaction altering multiple tables was lost) as far as your business logic is concerned
  5. Take the database out of emergency mode
  6. And then all the other stuff like root-cause analysis and getting a better backup strategy

I decided to add a new feature to SQL Server 2005 called EMERGENCY mode repair that will do steps 2 and 3 as an atomic operation. The reasons for this were:

  • Much of the advice of how to do this on the Internet missed steps out (particularly missing step 3!)
  • The DBCC REBUILD_LOG command was unsupported and undocumented and we didn't like advising customers to use it
  • Adding a documented last-resort method of recovering from this situation would reduce calls to Product Support - saving time and money for customers and Microsoft.

So, when in EMERGENCY mode, you can use DBCC CHECKDB to bring the database back online again. The only repair option allowed in EMERGENCY mode is REPAIR_ALLOW_DATA_LOSS and it does a lot more than usual:

  • Forces recovery to run on the transaction log (if it exists). You can think of this as 'recovery with CONTINUE_AFTER_ERROR' - see this post for more details on the real CONTINUE_AFTER_ERROR option for BACKUP and RESTORE. The idea behind this is that the database is already inconsistent because either the transaction log is corrupt or something in the database is corrupt in such a way that recovery cannot complete. So, given that the database is inconsistent and we're about to rebuild the transaction log, it makes sense to salvage as much transactional information as possible from the log before we throw it away and build a new one.
  • Rebuild the transaction log - but only if the transaction log is corrupt.
  • Run DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option.
  • Set the database state to ONLINE.

It's a one-way operation and can't be rolled back. I always advise taking a copy of the database files before doing this in case something goes wrong or there are unrepairable errors. And if it does? Probably time to update your resume for not having a water-tight backup and disaster-recovery strategies in place. Saying that, I've never seen it fail. I can think of some pathalogical cases where it would fail though (involving the file system itself having problems) but that's really unlikely.

Let's walk-through an example of using it. I'm assuming there's a database called emergencydemo that's in the same state as at the end of the Search Engine Q&A #4 blog post - the database has no log file, is in EMERGENCY mode and the salaries table is corrupt.

First off I'll try bringing the database online, just to see what happens:

ALTER DATABASE emergencydemo SET ONLINE;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 945, Level 14, State 2, Line 1
Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.
Failed to restart the current database. The current database is switched to master.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.

The first message makes sense - the database knows it needs to be recovered because it wasn't cleanly shut down, but the log file simply isn't there. The second message is from the new feature in 2005 that will automatically create a log file if one is missing on startup or attach - as long as the database was cleanly shut down. The 945 and 5069 errors are self-explanatory but notice that the database gets switched to master underneath us. This has bitten me several times in the past.

Well, I expected that not to work. Let's run emergency-mode repair:

DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
GO

Msg 945, Level 14, State 2, Line 1
Database 'emergencydemo' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Hmm - looks like the failed ALTER DATABASE statement did change the state - but what to?

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';
GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

That makes sense I guess. Ok - back to emergency mode and run repair:

ALTER DATABASE emergencydemo SET EMERGENCY;
GO
DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
GO

Msg 7919, Level 16, State 3, Line 1
Repair statement not processed. Database needs to be in single user mode.

EMERGENCY mode is not SINGLE_USER mode - a database needs to be in SINGLE_USER mode for repair to run and EMERGENCY mode allows multiple connections from members of the sysadmin role. You can set SINGLE_USER mode as well as EMERGENCY mode - however, the sys.databases field state_desc will still just say EMERGENCY.

ALTER DATABASE emergencydemo SET SINGLE_USER;
GO
DBCC CHECKDB (emergencydemo, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

This time it worked. First of all we get the same error as if we tried to bring the database online - that's from the code that's trying to run 'recovery with CONTINUE_AFTER_ERROR' on the transaction log. Next we get a nice long warning that the transaction log has been rebuilt and the consequences of doing that (basically that you need to start a new log backup chain by taking a full backup). If there had been any corruptions we'd have seen the usual output from DBCC CHECKDB about what errors it found and fixed. There's also a bunch of stuff in the error log

2007-10-02 17:21:20.95 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:20.96 spid51      Error: 17207, Severity: 16, State: 1.
2007-10-02 17:21:20.96 spid51      FileMgr::StartLogFiles: Operating system error 2(The system cannot find the file specified.) occurred while creating or opening file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\emergencydemo_log.LDF'. Diagnose and correct the operating system error, and retry the operation.
2007-10-02 17:21:20.96 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:21.10 spid51      Starting up database 'emergencydemo'.
2007-10-02 17:21:21.18 spid51      Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
2007-10-02 17:21:21.18 spid51      Warning: The log for database 'emergencydemo' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.
2007-10-02 17:21:21.99 spid51      EMERGENCY MODE DBCC CHECKDB (emergencydemo, repair_allow_data_loss) WITH no_infomsgs executed by ROADRUNNERPR\paul found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 1 seconds.

Note that the usual error log entry from running DBCC CHECKDB is preceded by 'EMERGENCY MODE' this time.

Checking the database state:

SELECT state_desc FROM sys.databases WHERE name='emergencydemo';
GO

state_desc
------------------------------------------------------------
ONLINE

we find that it's been brought back online again because everything worked. It's still SINGLE_USER though so let's make it MULTI_USER and see what happened to our table:

ALTER DATABASE emergencydemo SET MULTI_USER;
GO
USE EMERGENCYDEMO;
GO
SELECT * FROM salaries;
GO

FirstName            LastName             Salary
-------------------- -------------------- -----------
John                 Williamson           10000
Stephen              Brown                0
Jack                 Bauer                10000

(3 row(s) affected)

And of course its still corrupt - because even though the transaction log was rebuilt and repaired, the original transaction that changed the salary to 0 never got a chance to rollback becuase I deleted the transaction log (in the previous post).

Now remember, you should only use this as a last resort, but if you do get yourself into trouble, you know there's a command that should be able to help you.

These next few posts are based off part of my Secrets of Fast Detection and Recovery from Database Corruptions session from TechEd and various user groups around the world (see here for a video recording from TechEd). I'll also be doing this session at ITForum in Barcelona in November, and as a live webcast for Microsoft sometime over the next month or so (I'll publicize the date nearer the time). I did a few posts on these subjects last year but now I want to reorder them, add a post about using emergency mode and walk you through some emergency mode demo scripts.

In this post I want to describe the two worst things I think you can do to your database - rebuilding your transaction log and running REPAIR_ALLOW_DATA_LOSS - that people often try doing instead of restoring from their backups.

Rebuilding your transaction log

It's pretty well known that in SQL Server 2000 (and before) there's an undocumented and unsupported command called DBCC REBUILD_LOG. What this does is pretty simple - it deletes the transaction log file(s) and creates a new one. It completely disregards any uncommitted transactions that may exist - it just deletes them. This means that these uncommitted transactions don't get a chance to roll back.

What does this mean? Well, in the best case, the only in-flight transactions were altering user data - so your business logic, inherent and constraint-enforced relationships between tables, and the basic logical integrity of your user data are all broken. In the worst case, the in-flight transactions were altering the structure of the database (e.g. doing a page split) so that fact that they didn't get achance to rollback means the databse may be structurally corrupt!

Here's are two examples (somewhat contrived) that illustrate the possible consequences of rebuilding a transaction log.

  1. Logical data loss.

    Imagine you're at an ATM transferring $1000 from your checking account to your savings account. On the SQL Server in the bank's datacenter, the transaction happens in two parts - update the record in the checking accounts table with the balance minus $1000 and then update the record in the savings account table with the balance plus $1000. The process is half-way through - $1000 has been debited from your checking account, but not yet credited to your savings account, when disaster strikes! A work crew outside the datacenter accidentally cuts the power and the machine hosting SQL Server powers down. The ATM displays one of those infuriating "We're sorry, our computers are unavailable at present" and you walk away grumbling but think nothing more about it.

    Meanwhile, the power's been restored to the datacenter and SQL Server is going through crash recovery. The partially completed transaction on your account should rollback and credit back the $1000 to your checking account. But the new DBA at the bank decides that its taking too long for the system to come back up. He delets the transaction log and rebuilds it to get the system up faster. Unfortunately, the portion of the transaction log that had not had a chance to recover included the transaction involving your bank account. Even more unfortunately, a checkpoint occured right before the power loss, and the database page containing the updated checking account balance was flushed to disk. Now, when the transaction log is deleted and rebuilt, your transaction can't rollback - because it's simply gone. So the $1000 debit from your checking account is not rolled back - you've lost $1000!!

  2. Physical database corruption.

Imagine an insert into a table with a single non-clustered index. Under the covers, the insert happens in two parts - insert the record into the table and then insert the corresponding non-clustered index record. Imagine a similar disaster recovery situation as I described above occuring after the table insert has been done but not the non-clustered index insert. If the log is rebuilt, then the table and its index are not in sync - actual phsyical corruption!

Whenever a transaction log is rebuilt, a message is output to the SQL Server error log and the Windows event log. In SQL Server 2005, the message is:

2007-09-14 15:50:48.82 spid52      Warning: The log for database 'test' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

In SQL Server 2005, the old DBCC command has been removed and replaced with undocumented (and still unsupported) syntax. However, there is now a fully documented and supported way to do this - which I'll get to in a couple of posts.

Sometimes there's no alternative to rebuilding a transaction log - when the log is physically damaged and there's no backup. In this case, Product Support will walk you through the process of correctly rebuilding the transaction log, running repair to fix up any corruptions, and ensuring you know that the logical integrity of your data could be broken. Unfortunately, all too often I see people simply rebuilding the transaction log and continuing with regular operations - no checks, no repairs, no root-cause analysis.

REPAIR_ALLOW_DATA_LOSS

The vast majority of the time, REPAIR_ALLOW_DATA_LOSS is the repair level that CHECKDB recommends when it finds corruptions. This is because fixing nearly anything that's not a minor non-clustered index issue requires deleting something to repair it. So, REPAIR_ALLOW_DATA_LOSS will delete things. This means it will probably delete some of your data as well. If, for instance it finds a corrupt record on a data page, it may end up having to delete the entire data page, including all the other records on the page, to fix the corruption. That could be a lot of data. For this reason, the repair level name was carefully chosen. You can't type in REPAIR_ALLOW_DATA_LOSS without realizing that you're probably going to lose some data as part of the operation.

I've been asked why this is. The purpose of repair is not to save user data. The purpose of repair is to make the database structurally consistent as fast as possible (to limit downtime) and correctly (to avoid making things worse). This means that repairs have to be engineered to be fast and reliable operations that will work in every circumstance. The simple way to do this is to delete what's broken and fix up everything that linked to (or was linked from) the thing being deleted - whether a record or page. Trying to do anything more complicated would increase the chances of the repair not working, or even making things worse.

The ramifications of this are that running REPAIR_ALLOW_DATA_LOSS can lead to the same effect on your data as rebuilding a transaction log with in-flight transactions altering user data - your business logic, inherent and constraint-enforced relationships between tables, and the basic logical integrity of your user data could all be broken. BUT, the database is now structurally consistent and SQL Server can run on it without fear of hitting a corruption that could cause a crash.

To continue the contrived example from above, imagine your bank checking and savings accounts just happen to be stored on the same data page in the bank's SQL Server database. The new DBA doesn't realize that backups are necessary for disaster recovery and data preservation and so isn't taking any. Disaster strikes again in the form of the work crew outside the datacenter accidentally cutting the power and the machine hosting SQL Server powers down. This time, one of the drives has a problem while powering down and a page write doesn't complete - causing a torn page. Unfortunately, it's the page holding your bank accounts. As the DBA doesn't have any backups, the only alternative to fix the torn-page is to run REPAIR_ALLOW_DATA_LOSS. For this error, it will delete the page, and does so. In the process, everything else on the page is also lost, including your bank accounts!!

Summary

So, you can see how these two operations are really very, very bad things to do to a database and can cause havoc with your data. And yet people still have to  use these operations because they don't have valid backups...

In the next post I'll introduce EMERGENCY mode and how to use it.

Now that I have a little more time on my hands I've been jumping back into some of the online forums. Last summer I posted on a few bits of bad advice I've seen in the forums but yesterday I was stunned by some of the terrible advice I saw being given out. So, I'd like to post a new and longer collection of some of the bad advice I've seen over the last couple of years (and yesterday!) with some reasoning and better advice.

Run CHECKALLOC then CHECKDB then CHECKTABLES...

There's a common misconception around what CHECKDB actually does. According to the Books Online entry I wrote for SQL Server 2005 (see http://msdn2.microsoft.com/en-us/library/ms176064.aspx), it does the following:

  • Runs DBCC CHECKALLOC on the database.
  • Runs DBCC CHECKTABLE on every table and view in the database.
  • Runs DBCC CHECKCATALOG on the database.
  • Validates the contents of every indexed view in the database.
  • Validates the Service Broker data in the database.

This means that the DBCC CHECKALLOC, DBCC CHECKTABLE, or DBCC CHECKCATALOG commands do not have to be run separately from DBCC CHECKDB in SQL Server 2005. Note that in SQL Server 2000, CHECKCATALOG wasn't part of CHECKDB.

Run CHECKALLOC to find out what's wrong...

Scenario: Any kind of corruption, but commonly advised for 823/824 errors. CHECKALLOC only checks the allocation bitmaps so it won't discover corruptions in the vast majority of the database. You should always run a DBCC CHECKDB (with the PHYSICAL_ONLY option if need be to save time/resources) to find out the full extent of corruption in the database.

Just restore from your backups and carry on...

Scenario: corruption. The theory is that restoring from your backups fixes the problem and you'll be fine. In reality this only fixes the symptom of the problem - the real problem is what caused the corruption in the first place. Its perfectly ok to restore from a backup to get the database back up and running again quickly, but you have to make sure that you investigate the cause of the corruption and take steps to make sure it won't happen again - root-cause analysis.

Just run repair...

Most of the time running repair means losing data. Running repair should only really be done when you don't have any valid backups. I can also see a case where you've got corruption and your maximum allowable downtime doesn't allow you to restore from your backups because your backup strategy doesn't allow for quick, fine-grained restores. In that case you may be forced to run repair (and then fix your backup strategy), but really it should always be your last resort as you'll most likely lose data.

Also, if you do end up having to run repair, make a backup of your database first - just in case something goes wrong. You can also wrap the repair statement in an explicit transaction so you can rollback the whole repair operation if you think its losing too much data - but then you're out of options. Once you're done, don't forget to do root-cause analysis too.

Only restore from a backup if repair doesn't work...

This is one I saw yesterday and I couldn't believe it. The advice was to run all the various types of repair and only resort to restoring from a backup if the repairs didn't work. I really don't get this one - why bother taking backups at all if you're always going to exhaust repair options first? Repair should be your last option, not your first.

Just rebuild your transaction log...

Scenario: log corruption. Rebuilding a transaction log is almost guaranteed to lose data and cause corruption so you need to be in really dire need to do this. Again, the solution here should be to restore from your backups. If you don't have backups then you're going to have to do this. For anything before SQL Server 2005, you should contact Product Support to have them walk you through the correct set of steps to do this. For SQL Server 2005, you should use Emergency Mode Repair. I'll discuss this in the next post.

Unbelievably, I've seen this recommended when recovery is taking too long - shut the server down, delete the transaction log and then rebuild it. Yikes!

Try running the other repair options before REPAIR_ALLOW_DATA_LOSS...

Scenario: CHECKDB says to use REPAIR_ALLOW_DATA_LOSS but would like to avoid it. At the bottom of CHECKDB's output is the minimum repair option needed to fix all the corruptions that CHECKDB found. If it says you need to use REPAIR_ALLOW_DATA_LOSS, then that's the only option that will fix all the errors. There's no point trying REPAIR_FAST (which I only left in SQL Server 2005 for backwards compatibility - it does nothing) or REPAIR_REBUILD. What you really should do is restore from your backups, but if you don't have any then you're going to have to bite the bullet and run REPAIR_ALLOW_DATA_LOSS.

Drop all the indexes and create them all again...

Scenario: corruption in non-clustered indexes. The theory is that dropping and recreating the indexes will fix the corruption. Sure - it'll probably get rid of the corruption for a while at least (until whatever caused the corruption happens again). The problem here is that if any of the indexes are enforcing constraints, dropping them means that something could happen that breaks the constraint (e.g. someone inserting a duplicate value). This means that you won't be able to recreate the constraint-enforcing index again.

At the very least try simply rebuilding the indexes. If that doesn't work, you may need to resort to rebuilding them using CHECKDB and the REPAIR_REBUILD option. Regular index rebuilds can read the data from the index using a logical-order scan. However, if the index b-tree structure is corrupted then that won't work. CHECKDB always forces the query processor to ask for an allocation-order scan to avoid corruption problems.

Detach then re-attach your database...

Scenario: suspect database. The theory is that recovery will run again and somehow work the next time and fix whatever caused the database to go suspect in the first place. Here's the bad part - once you detach a suspect database, it's almost inevitable that the attach process will fail because of the original problem. This means you're then stuck with a detached database and you need to resort to hacks to get the database attached again.

The database either went suspect because:

  1. the transaction log ran out of space (either the drive ran out of space or the log was not set to autogrow)
  2. a page was corrupted that was required for transaction rollback/recovery
  3. an internal system operation (e.g. allocating a page) came across a corrupt page (e.g. an allocation bitmap)

None of these things can be fixed by a simple detach/attach. In the first case, you need to give the log more space. Either grow the log file and bring the database online, or detach the database and move it to a new location with more space. See http://support.microsoft.com/kb/224071/ for how to move databases around using detach/attach.

The last two cases require restoration from backups or running some form of repair - no amount of rerunning recovery will fix a corruption.

Options for creating a corrupt database...

Scenario: people want to test their disaster recovery plans and so need to know how to create a corrupt database. The best way to do this is to use an already corrupt database - see my recent post that provides a corrupt 2000 and 2005 database as well as some things to try with them. One of the most common suggestions I've seen for corrupting a SQL Server 2000 database is to manually delete something from the sysindexes or sysobjects tables. I've also described how to do it using a hex editor:

Shutdown the database so the data files aren't locked (don't detach the database because if you corrupt the 'wrong' page you may not be able to attach it again). Pick an offset more than, say, 100 pages into the file (at least 819200 bytes) but make sure its aligned on an 8192 byte boundary (a page boundary). This avoids critical metadata pages and allocation bitmaps so that you'll be able to start the database and run DBCC CHECKDB on it. Write a few bytes of zeroes into the file at the chosen offset and you're almost guaranteed some page header corruption errors.

However, the very worst piece of advice I've ever seen on the Internet was another method that I don't recommend. The advice was to go into the data-center (scary as this shows the poster works for a fairly large company), go up to one of the hard-drives and flick the power switch on and off a few times. Wow! Not only will that cause corruptions but it will also fry the hard-drive...

Summary

The bottom-line is that you need to be very careful when following anyone's advice on the Internet. Many people out there DO know exactly what they're talking about, but many also do not and may help you get more deeply into trouble.

Theme design by Nukeation based on Jelle Druyts