Here's another corruption bug that people are hitting on 2005 SP2 - something I didn't know about until today.

The situation is this: a table with a non-unique clustered index (i.e. so a hidden uniquifier column is created), and then rebuilt using DBCC DBREINDEX. Sometimes an incorrect query plan is chosen so the rows in the index get sorted incorrectly. A subsequent DBCC CHECKDB (or DBCC CHECKTABLE) will return errors like:

Msg 2511, Level 16, State 1, Line 1
Table error: Object ID 2073058421, index ID 1, partition ID 72057594044678144, alloc unit ID 72057594048348160 (type In-row data). Keys out of order on page (1:16042), slots 47 and 48.

The workaround is to use the new ALTER INDEX command in 2005 to do the rebuild (remember that DBCC DBREINDEX is deprecated). There's also a hotfix available in CU9 (and the upcoming SP3 I guess). Checkout KB 954734 here for details.

Hopefully you won't hit this, but if you do, knowing this may save you a lot of head-scratching.

PS Last post from our 22-day road-trip. Tonight's a date-night in Las Vegas (dinner at Le Cirque, then tickets to Cirque du Soleil's "O") before heading back to our (vacation!) home in Redmond and PASS next week.

(Posted with permission of the dev team)

Here's an interesting bug that people are hitting. I found out about this while here in Barcelona at TechEd when I got roped into a discussion with a couple of Microsoft colleagues. Their customer was seeing errors like the following:

Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:35244) in object ID 1683128146, index ID 1, partition ID 223091033422352, alloc unit ID 81405523118118176 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL

This error says that the PFS page (see this blog post) has the wrong free-space tracking bits for a text page. In SQL Server 2000, the algorithm to keep track of free space in the PFS pages wasn't perfect so CHECKDB never reported these errors - it would silently fix them. In 2005 we fixed the algorithm (supposedly), so I turned on the reporting of these errors again in CHECKDB.

When I first saw the description from the customer, my first reaction was that it was an I/O subsystem problem causing corruption, but the customer has page checksums turned on, so a corruption would result in an 824 error before an 8914 error. My conclusion then was that there's a bug in the new free-space tracking algorithm. After checking with the dev team, it turns out my suspicions were correct - there is a bug in 2005 SP2. It's fixed in 2005 SP3 and in 2008, but you may see these 8914 errors if you're not running one of those.

Here are the technical details of the problem (slightly edited from the dev team explanation):

The issue was that when minimal logging for LOBs was used (under the SIMPLE recovery model, during BULK INSERT/BCP/Large insert with TABLOCK), extents are being pre-allocated and the pages were being marked 100% full in the PFS page when the pages were allocated. The idea was that all pages eventually will be filled up with LOB data, and by marking them full during allocation we avoid an extra update to the PFS pages when the data is actually put on the page.

Suppose now that 64 pages are pre-allocated, and only 40 or so pages are used and have rows on them. When the transaction commits, the Storage Engine is supposed to deallocate the 24 extra pages that were pre-allocated, and marked 100% full, even though they don’t have any rows on them yet. There was a bug where in a certain case the deallocation wouldn't happen, so you end up with empty pages that have a PFS state of 100%, but don’t have any rows on them.

In general, even though this is a bug in the code, functionality wise, there is nothing wrong with the database, besides a number of additional pages that are empty and allocated to the LOB tree, so as long as the message tells you that the page is supposed to be empty, but is actually marked 100%, nothing can really go wrong with that page. Unfortunately, DBCC will keep reporting these errors.

Although I haven't tried this, my guess is that you can get rid of the empty LOB pages using ALTER INDEX ... REORGANIZE WITH (LOB_COMPACTION = ON). So, if you see some of these errors with no other errors, you may have hit this bug and have nothing really to worry about.

(Quickie post #3 while it's Kimberly's turn to lecture this morning - better pay attention now before I get into trouble)

Gail asked a (paraphrased) question about the trick to hacking a detached suspect database into the server again - will it work for a detached database with multiple data files in the primary filegroup where one of the secondary data files has a corrupt file header page?

Well, based on my experiences and investigations of file header corruption (see my previous post here) my initial reaction was to say "it will attach but you won't be able to access the database". However, I'd like to try it first - so I'm trying it while writing this post. I've taken the DemoSuspect and added a file, then detached the database and corrupted the first page in the ndf file using my trusty hex editor. Trying to attach the database again gives:

EXEC sp_attach_db @dbname = N'DemoSuspect', 
   
@filename1 = N'c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf', 
   
@filename2 = N'c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_log.ldf',
   
@filename3 = N'c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.ndf';
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'c:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.ndf' is not a valid database file header. The PageAudit property is incorrect.

Ok - just what I expected. Now I'll copy off the files, create the dummy database, shutdown the server, swap in the corrupt files, and restart the server. So what happens?

USE DemoSuspect;
GO

Msg 945, Level 14, State 2, Line 1
Database 'DemoSuspect' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS Status;
GO

Status
--------
SUSPECT

ALTER DATABASE DemoSuspect SET EMERGENCY;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.ndf' is not a valid database file header. The PageAudit property is incorrect.
Msg 5120, Level 16, State 9, Line 1
Unable to open the physical file "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.ndf". Operating system error 0: "(null)".

Well, the database gets hacked back into the server, but the corrupt secondary file header trumps everything else - the whole database is unusable because that data file in the primary filegroup is essentially offline. This is because an offline (or inaccessible) file means the filegroup it is part of is offline (or inaccessible) - and an offline primary filegroup means the whole database is offline.

If the secondary file is in a non-primary filegroup, then that filegroup can be manually set offline and the rest of the database is available, in Enterprise Edition - partial database availability.

So, the hack method will work, but whether the database is accessible or not depends on how the database is corrupt in the first place.

(Quickie post #1 while it's Kimberly's turn to lecture this morning...)

I had a comment on my post dealing with suspect databases, saying that the 2008 behavior is much better. I hadn't tried this so it was (really good) news to me.

So, trying the same steps from that blog post on 2008 - everything's the same up to the point where the server has been restarted and the database is suspect. Now, if I try to detach the corrupt database, I get the following:

EXEC sp_detach_db 'DemoSuspect';
GO

Msg 3707, Level 16, State 2, Line 1
Cannot detach a suspect or recovery pending database. It must be repaired or dropped.

VERY COOL!!!! This should *really* cut down the number of times people accidentally get themselves into trouble by detaching the database.

For the sake of completeness, if I really want to detach the suspect database, I can do it in 2008 if I put the database into EMERGENCY mode first (doesn't mean I would though!)

This is one I've been trying to get to since I started blogging a couple of years ago: how to re-attach a detached suspect database. This is a pretty common scenario I see on the forums - a database goes suspect so the DBA tries to detach/attach, which fails. I wrote a demo for my corruption session at TechEd this year that shows how to create a suspect database with hex editor, then detaches it and shows how to re-attach and fix it. It's going to be a long blog post, but I've got everything wrapped up before we fly out to the UK tomorrow so I've got a bit of spare time.

Creating a Suspect Database

First off I'm going to create a simple database to use, called DemoSuspect with a table and some random data.

USE MASTER
GO

CREATE DATABASE DemoSuspect
GO

USE DemoSuspect;
GO

CREATE TABLE Employees (FirstName VARCHAR (20), LastName VARCHAR (20), YearlyBonus INT);
GO
INSERT INTO Employees VALUES ('Paul', 'Randal', 10000);
INSERT INTO Employees VALUES ('Kimberly', 'Tripp', 10000);
GO

Now I'll perform an update in an explicit transaction and force it to be written out to disk with a CHECKPOINT. I've accidentally deleted Kimberly's bonus!

-- Simulate an in-flight transaction
BEGIN TRAN;
UPDATE Employees SET YearlyBonus = 0 WHERE LastName = 'Tripp';
GO

CHECKPOINT;
GO

Then in another window, I'll simulate a crash using:

SHUTDOWN WITH NOWAIT;
GO

Now that SQL Server is shutdown, I'm going to simulate an I/O failure that corrupts the log file. I'm going to use a hex editor to do this - my editor of choice is the immensely popular and useful XVI32, written by Christian Maas. I opened the log file, filled the first section with zeroes, and then saved it again. See the screenshot below.

When I startup SQL Server again, it will try to run recovery on the DemoSuspect database and fail. This will put the database into the SUSPECT state.

So I restarted SQL Server, let's try getting in to the DemoSuspect database.

USE DemoSuspect;
GO

Msg 945, Level 14, State 2, Line 1
Database 'DemoSuspect' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Now let's check the database status:

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS 'Status';
GO

Status
--------

SUSPECT

Now at this point, the correct procedure is to restore from backups. If there are no backups available, then the next best thing is to get the database into EMERGENCY mode and extract as much data as possible, or run emergency-mode repair. However, I'm going to try the detach/attach route instead.

Detaching the Database

I'll try detaching the database using sp_detach_db:

EXEC sp_detach_db 'DemoSuspect';
GO

Msg 947, Level 16, State 1, Line 1
Error while closing database 'DemoSuspect'. Check for previous additional errors and retry the operation.

Hmm - did it work or didn't it?

SELECT * FROM sys.databases WHERE NAME = 'DemoSuspect';
GO

And there are no results, so the detach must have succeeded.

Re-attaching the Database

Let's try the obvious sp_attach_db:

EXEC sp_attach_db @dbname = N'DemoSuspect', 
   
@filename1 = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf', 
   
@filename2 = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf';
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf' is not a valid database file header. The PageAudit property is incorrect.

Hmm. How about using the ATTACH_REBUILD_LOG option on CREATE DATABASE? That should create a new log file for me:

CREATE DATABASE DemoSuspect ON
   (NAME = DemoSuspect, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf')
FOR ATTACH_REBUILD_LOG;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf' is not a valid database file header. The PageAudit property is incorrect.
File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database 'DemoSuspect'. CREATE DATABASE is aborted.

Hmm. The database knows that there was an active transaction. Using the ATTACH_REBUILD_LOG command only works if the database was cleanly shut down and the log is missing. What about if I actually remove the log file and try that - maybe I could fool it? I took a copy of the data and log file, and then deleted the original log file (this is getting more and more dodgy...). Let's try again:

CREATE DATABASE DemoSuspect ON
   
(NAME = DemoSuspect, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf')
FOR ATTACH_REBUILD_LOG;
GO

File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect_LOG.ldf" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
Msg 1813, Level 16, State 2, Line 1
Could not open new database 'DemoSuspect'. CREATE DATABASE is aborted.

Well, that was a long-shot but SQL Server's smarter.

Basically the problem is that the database wasn't cleanly shutdown, which means that recovery HAS to run and complete before the database can be attached again. Given that our log file is corrupt, that's impossible. So, never detach a suspect database.

The only way to get the database back into SQL Server is to use a hack. I'm going to create a new dummy database with the exact same file layout and as close as possible to the file sizes of the detached database. Then I'm going to shutdown SQL Server, swap in the corrupt database files, and re-start SQL Server. If all goes well, the corrupt, suspect database will be attached again.

The one major downside of this is that if the SQL Server instance doesn't have instant initialization enabled (see How to tell if you have instant initialization enabled?), then creating the dummy database could take a long time if the data files are very big. This means that your application is offline while the files are created and zero'd out.

So, I've already got a copy of the corrupt database files, so now I need to delete the data file too. Before doing this you want to make absolutely sure you've got multiple copies of the corrupt database files... just in case. After deleting the data file, I can create my dummy database.

CREATE DATABASE DemoSuspect
GO

If you forget to delete the existing corrupt files first, you'll get the following error:

Msg 1802, Level 16, State 4, Line 1
CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
Msg 5170, Level 16, State 1, Line 1
Cannot create file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\DemoSuspect.mdf' because it already exists. Change the file path or the file name, and retry the operation.

Ok - delete the data file and try again. Now I need to check the dummy database's files are there (at this point I'm totally paranoid), shutdown SQL Server, delete the dummy databases' files (this just seems wrong!), and swap in the corrupt files. I took another copy of the corrupt files before swapping them back in, just in case something goes wrong.

After restarting SQL Server, I can check the database state:

SELECT DATABASEPROPERTYEX ('DemoSuspect', 'STATUS') AS 'Status';
GO

Status
--------
SUSPECT

Woo-hoo - I'm back to having a suspect database attached again - after having to shutdown the server and mess about deleting and copying files around. Not good. Now I can actually fix it.

Fixing the Database

If you don't have any backups, then the only way to get into the database is to use EMERGENCY mode. This lets you into the database but you need to be aware that recovery has not completed so the contents of the database are transactionally (and possibly structurally) inconsistent. I'm going to choose to repair the database using emergency-mode repair. See CHECKDB From Every Angle: EMERGENCY mode repair - the very, very last resort for a detailed description of this tool.

ALTER DATABASE DemoSuspect SET EMERGENCY;
GO
ALTER DATABASE DemoSuspect SET SINGLE_USER;
GO
DBCC CHECKDB (DemoSuspect, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\DemoSuspect_log.LDF' is not a valid database file header. The PageAudit property is incorrect.
File activation failure. The physical file name "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\DemoSuspect_log.LDF" may be incorrect.
The log cannot be rebuilt because the database was not cleanly shut down.
The Service Broker in database "DemoSuspect" will be disabled because the Service Broker GUID in the database (9E879BFC-B742-4A69-AB14-4D6BD6F99E02) does not match the one in sys.databases (B4568D23-7018-40CF-B189-9C29DE697C09).
Warning: The log for database 'DemoSuspect' has been rebuilt. Transactional consistency has been lost. The RESTORE chain was broken, and the server no longer has context on the previous log files, so you will need to know what they were. You should run DBCC CHECKDB to validate physical consistency. The database has been put in dbo-only mode. When you are ready to make the database available for use, you will need to reset database options and delete any extra log files.

First off it tries to do the regular ATTACH_REBUILD_LOG. When that fails, DBCC CHECKDB takes over and forces the log to be rebuilt, after trying to force as much of the damaged log to be processed for recovery as it can. It then runs a full repair, in case there's anything corrupt in the database - in this case there isn't so there are no corruption messages in the output.

Notice the line about the Service Broker GUID being wrong. I had to use the hack method to get the database attached again, but when I created the dummy database, it created a Service Broker GUID for the DemoSuspect database in master.sys.databases. When I swapped in the corrupt database, it has a different GUID - so now I can't use Service Broker! And all because I detached the suspect database instead of repairing it.

So what's the state of the data?

USE DemoSuspect;
GO

SELECT * FROM Employees;
GO

FirstName   LastName   YearlyBonus
----------  ---------  ------------
Paul        Randal     10000
Kimberly    Tripp      0

Kimberly doesn't get a bonus this year - she won't be happy! This is contrived and flippant, of course, but it illustrates the point that after doing an emergency-mode repair, transactions that were active at the time the log was damaged will not get a chance to roll-back, most likely. In this case, I know what was going on when the crash occured, but what about on a busy OLTP system with hundreds or thousands of active transactions? What state will the data be in?

Summary

Yes, you can recover from a detached suspect database, but it's not pretty and you have to be very careful. The best course of action is always to have a comprehensive backup strategy that allows you to restore as quickly as possible. If you do have a suspect database and no backups, use EMERGENCY mode to access and/or repair the database. Hopefully this article will help people that find themselves in these situations.

Every so often I'll see posts on the various data corruption forums discussing causes of corruption. In this post I want to debunk some of the myths around what can cause corruption. There are really two types of corruption to deal with, physical corruption and logical corruption.

Physical corruption

This is where something has altered the contents of a data or log file sector with no regard for what is being stored there. Possible causes of physical corruption are:

  • Problem with the I/O subsystem (99.8% of all cases I've ever seen - only 3 nines as I'd estimate I've seen around about a thousand corruption cases). Remember the I/O subsystem is everything underneath SQL Server in the I/O stack - including the OS, 3rd-party file system filter drivers, device drivers, RAID controllers, SAN controllers, network hardware, drives themselves, and so on. Millions of lines of code and lots of moving parts spinning very fast, very close to very fragile pieces of metal oxide (I once heard Jim Gray liken a disk drive head to a 747 jumbo jet flying at 500 mph at a height of 1/4 inch from the ground...)
  • Problem with the host machine hardware (0.1% of cases). Most of the time this is a memory error.
  • SQL Server bugs (0.1% of cases). Yes, there have been corruption bugs. Every piece of software has bugs. There are KB articles describing bugs.
  • Deliberate introduction of corruption using a hex editor or other means.

Physical corruption is what DBCC CHECKDB usually reports and the majority of cases are caused by a physical failures of some kind, with the minority caused by humans - software bugs.

Logical corruption

This is where something has altered some data so that a data relationship is broken. Possible causes of logical corruption are:

  • Humans

:-) Okay...

  • Application bug. The application deletes one part of an inherent data relationship but not the other. Or the application designer doesn't implement a constraint properly. Or the application designer doesn't cope with a transaction roll-back properly. You get the idea.
  • Accidental update/delete. Someone deletes or updates some data incorrectly.
  • SQL Server bug. See above.
  • DBCC CHECKDB when using the REPAIR_ALLOW_DATA_LOSS option. As is documented in Books Online, and I've blogged about and mentioned when lecturing, if you run repair, it doesn't take into account any inherent or explicit constraints on the data.

The point here is that a physical failure of a component does not cause logical corruption, it causes physical corruption. Conversely, application errors cause logical corruption, not physical corruption. DBCC CHECKDB errors are about physical corruption (okay, with the inclusion of DBCC CHECKCATALOG code in 2005, it will find cases where the DBA has manually altered the system tables, causing logical corruption) and applications cannot cause physical corruption as they can only manipulate data through SQL Server. If an application hits a SQL Server bug which causes physical corruption, that's still not the application causing physical corruption, it's SQL Server.

So - on to the myths.

  • Can an application cause physical corruption? No.
  • Can stopping a shrink operation cause corruption of any kind? No.
  • Can stopping an index rebuild cause corruption of any kind? No.
  • Can running DBCC CHECKDB without repair cause corruption of any kind? No.
  • Can creating a database snapshot cause corruption of any kind? No.

Hope this helps.

I've had a few follow-ups on my two posts about boot page and file header page corruption - asking if its possible to do single-page restore operations for these pages. Let's try:

CREATE DATABASE BootPageTest;
GO

-- Single page restore is only possible using the FULL recovery model
ALTER DATABASE BootPageTest SET RECOVERY FULL;
GO

BACKUP DATABASE BootPageTest TO DISK = 'C:\sqlskills\BootPageTest.bck';
GO
BACKUP LOG BootPageTest TO DISK = 'C:\sqlskills\BootPageTest.trn';
GO

RESTORE DATABASE BootPageTest PAGE = '1:9' FROM DISK = 'C:\sqlskills\BootPageTest.bck';
GO

Msg 3111, Level 16, State 1, Line 2
Page (1:9) is a control page which cannot be restored in isolation. To repair this page, the entire file must be restored.
Msg 3013, Level 16, State 1, Line 2
RESTORE DATABASE is terminating abnormally.

The answer is no. The following page types cannot be restored using single-page restore:

  • File header pages (see here)
  • Boot page (see here)
  • GAM, SGAM, DIFF map, ML map pages (see here)

Following on from my previous post on boot pages and boot page corruption, I've been asked about file header pages - and I was already planning this post as the next in the series.

So what's a file header page? Every data file in a database has the very first 8kb page (i.e. page 0 in the file) set aside as the place to store all the metadata info about the file. As with the boot page, you can look at the contents with DBCC PAGE and it will interpret all the fields for you, or you can use the DBCC FILEHEADER command, which does a better job. This is undocumented and unsupported, just like DBCC DBINFO for looking at the database boot page, but it's been discussed and posted about on the Internet before so it's existence is no secret.

The command take a database name or database ID plus the file ID to dump. Here I've created a database called FileHeaderTest and used SSMS with results-to-text, plus a bunch of editing of the results to make it blog-able:

DBCC FILEHEADER ('FileHeaderTest', 1);
GO

FileId                : 1
LogicalName           : FileHeaderTest
BindingId             : D30AE3EF-14A6-47D5-B267-96F38238D882
FileGroup             : 1
Size                  : 152
MaxSize               : -1
MinSize               : 152
UserShrinkSize        : -1
Growth                : 128
BackupLSN             : 0
RedoStartLSN          : 0
FirstLSN              : 0
MaxLSN                : 0
FirstUpdateLSN        : 0
CreateLSN             : 0
SectorSize            : 512
RecoveryForkGUID      : 00000000-0000-0000-0000-000000000000
RecoveryForkLSN       : 0
DifferentialBaseLsn   : 19000000048800037
DifferentialBaseGuid  : 279A8EF4-4431-4CA5-8939-F613E5BC3033
Status                : 2
RestoreStatus         : 0
ReadOnlyLsn           : 0
ReadWriteLsn          : 0
MaxLsnBranchId        : 00000000-0000-0000-0000-000000000000
RedoTargetPointLsn    : 0
RedoTargetPointGuid   : 00000000-0000-0000-0000-000000000000
RestoreDiffBaseLsn    : 0
RestoreDiffBaseGuid   : 00000000-0000-0000-0000-000000000000
RestorePathOriginLsn  : 0
RestorePathOriginGuid : 00000000-0000-0000-0000-000000000000
OldestRestoredLsn     : 0

Lots of interesting stuff in here, such as:

  • BindingId: used to make sure a file is really part of this database
  • SectorSize: the disk sector size
  • Status: what kind of file and what state is it in (e.g. 2 = regular disk file)
  • Various sizes in number-of-8kb-pages (e.g. MaxSize of -1 means file growth is unlimited)
  • Growth: the number of pages to grow the file by if the 0x100000 bit is NOT set in the Status field. If it is set, the Growth is in percent.

And you can watch things change. For instance, if I change the file growth to 10%:

ALTER DATABASE FileHeaderTest MODIFY FILE (NAME = FileHeaderTest, FILEGROWTH = 10%);
GO

And then dump the file header page contents again, the Status and Growth fields have changed to:

.
.
Growth                : 10
.
.
Status                : 1048578
.
.

So what if a file header page is corrupt? I corrupted the file header page of my database and then started up SQL Server.

USE FileHeaderTest;
GO

Msg 945, Level 14, State 2, Line 1
Database 'FileHeaderTest' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Let's try EMERGENCY mode:

ALTER DATABASE FileHeaderTest SET EMERGENCY;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\FileHeaderTest.mdf' is not a valid database file header. The PageAudit property is incorrect.

In this case, there's nothing to do except restore from backups, and the database is inaccessible because the PRIMARY filegroup could not be brought online. If the corruption were in a file in a secondary filegroup, things are a little bit different. Now I've added a secondary filegroup with a single file (called CorruptFile) and corrupted it's file header page. After starting up SQL Server we get the same behavior - but this time we can set the file to be offline and access the rest of the database. This is called partial database availability and works in Enterprise (and Developer) Edition only.

ALTER DATABASE FileHeaderTest MODIFY FILE (NAME = CorruptFile, OFFLINE);
GO

Note that the only way to bring an offline file back online is to restore it from a backup - see this post from my old Storage Engine blog for more details. Another corruption that can only be repaired using backups...

 

Now that I've done all the business-related blog posts, back to the good stuff to stop people complaining!

Something that's cropped up a few times over the summer so far is people trying to repair boot page corruptions.

First off, what's a boot page? Every database has a single page that stores critical information about the database itself. It's always page 9 in file 1 (the first file in the PRIMARY filegroup). You can examine the page using DBCC PAGE and it will interpret all the fields for you, but there's another command, DBCC DBINFO, that also dumps all this info (in fact the DBCC PAGE code calls the same underlying dumping code). This command is undocumented and unsupported but widely known and 'documented' in lots of places on the web - given that it uses the same code as DBCC PAGE, it's just as safe to use IMHO.

So what's on the boot page?

DBCC DBINFO ('BootPageTest');
GO

DBINFO STRUCTURE:


DBINFO @0x5BF6EF84

dbi_dbid = 19                        dbi_status = 65536                   dbi_nextid = 2073058421
dbi_dbname = BootPageTest            dbi_maxDbTimestamp = 2000            dbi_version = 611
dbi_createVersion = 611              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 2008-07-10 15:53:18.843
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 41                        m_blockOffset = 29                   m_slotId = 55
dbi_RebuildLogs = 0                  dbi_dbccFlags = 2                   
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000                          
dbi_dbbackupLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_oldestBackupXactLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_LastLogBackupTime = 1900-01-01 00:00:00.000                          
dbi_differentialBaseLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_createIndexLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_versionChangeLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_familyGUID = a4e88c13-b4cf-4320-834e-92b237244d4b                    
dbi_recoveryForkNameStack


entry 0

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = a4e88c13-b4cf-4320-834e-92b237244d4b                            

entry 1

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = 00000000-0000-0000-0000-000000000000                            
dbi_differentialBaseGuid = 00000000-0000-0000-0000-000000000000           dbi_firstSysIndexes = 0001:00000014
dbi_collation = 872468488            dbi_category = 0                     dbi_maxLogSpaceUsed = 231936
dbi_localState = 0                   dbi_roleSequence = 0                
dbi_failoverLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmRedoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmOldestXactLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_dbMirrorId = 00000000-0000-0000-0000-000000000000                    
dbi_pageUndoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_disabledSequence = 0            
dbi_dvSplitPoint

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_CloneCpuCount = 0                dbi_CloneMemorySize = 0             
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

There's all kinds on interesting things in there, for instance:

  • dbi_version and dbi_createversion: the physical version number of the database (and when it was created). See question 1 in the August 2008 SQL Q&A column in TechNet Magazine for an explanation (see here).
  • dbi_RebuildLogs: a count of the number of times the transaction log has been rebuilt for the database. PSS can use this to tell whether corruption problems could have been caused by DBAs rebuilding the log
  • dbi_dbccLastKnownGood: the completion time of the last 'clean' run of DBCC CHECKDB
  • a bunch of different LSNs related to checkpoint, backups, database mirroring
  • dbi_LastLogBackupTime: self-explanatory
  • dbi_differentialBaseGuid: the GUID generated by the last full database backup. Differential backups can only be restored on top of a matching full backup - so an out-of-band full backup could screw-up your disaster recovery - see this blog post for more info.

Now, what about if this page is corrupt in some way? I corrupted the BootPageTest database to have a corrupt boot page. Let's see what happens:

USE BootPagetest;
GO

Msg 913, Level 16, State 4, Line 1
Could not find database ID 19. Database may not be activated yet or may be in transition. Reissue the query once the database is available. If you do not think this error is due to a database that is transitioning its state and this error continues to occur, contact your primary support provider. Please have available for review the Microsoft SQL Server error log and any additional information relevant to the circumstances when the error occurred.

Okay - let's try setting the database into EMERGENCY mode:

ALTER DATABASE BootPageTest SET EMERGENCY;
GO

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xcdee22fa; actual: 0xcb6ea2fa). It occurred during a read of page (1:9) in database ID 19 at offset 0x00000000012000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\BootPageTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Hmm. What about running DBCC CHECKDB?

DBCC CHECKDB ('BootPageTest') WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 922, Level 14, State 1, Line 1
Database 'BootPageTest' is being recovered. Waiting until recovery is finished.

It's not looking good. Obviously the change to EMERGENCY mode couldn't complete properly. What's the database status?

SELECT [state_desc] FROM sys.databases WHERE [name] = 'BootPageTest';
GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

The boot page is inaccessible so in effect the database is inaccessible too, this is what the database state means in this case.

So what does this mean? If the boot page is corrupt, you can't run DBCC CHECKDB so you can't possibly run repair, and you can't put the database into EMERGENCY mode so you can't extract data into a new database. It means that there's NO WAY to recover from a corrupt boot page EXCEPT to restore from backups. One more reason to have backups...

Well, we're just back from vacation (photo blog post to follow) and I've heard that the feature article on Effective Database Maintenance I wrote for the August issue of TechNet Magazine is live on the web. It also includes a 5 minute long screencast I recorded where I demo the effect of database shrink on index fragmentation.

You can get to the article at http://technet.microsoft.com/en-us/magazine/cc671165.aspx. The topics covered are:

  • Managing data and transaction log files
  • Eliminating index fragmentation
  • Ensuring accurate, up-to-date statistics
  • Detected corrupted database pages
  • Establishing an effective backup strategy

It's written around 2-300 level and presents a good overview (well, at least I think so :-)) of the concepts involved.

Also, the August SQL Q&A column is available at http://technet.microsoft.com/en-us/magazine/cc671180(TechNet.10).aspx. This month's topics on the web (more in the print magazine) are:

  • Database version changes with upgrades
  • Benefits of partitioning
  • Consistency checking options for VLDBs

Enjoy!

Today I presented my brand new session Surviving Corruption: From Detection to Recovery at TechEd. I had a lot of fun putting together the demos, presenting the session, and talking to people afterwards. During the session, I promised to blog each of the demos so that everyone can run through them - here's the first one.

On SQL 2000, it was pretty easy to get into the system tables and manually change them - all you had to do was:

EXEC sp_configure 'allow updates', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

And then you could insert, update, and delete whatever you wanted in the all the system tables, including the critical three - sysindexes, sysobjects, and syscolumns. The problem was that sometimes people actually did this and messed things up - for instance, by manually deleting an object from sysobjects, but leaving around all the other info about the object - such as indexes and columns. DBCC CHECKCATALOG in SQL 2000 would find this, but DBCC CHECKDB would not - as it didn't run the DBCC CHECKCATALOG code - any most people do not run DBCC CHECKCATALOG at all. Many times now, I've seen databases upgraded to 2005 and suddenly DBCC CHECKDB is reporting metadata corruption errors - all because someone had manually changed the system tables on 2000, and I changed DBCC CHECKDB in 2005 to include the DBCC CHECKCATALOG checks.

This demo is all about that. I created a 2000 database, manually deleted a row in sysobjects and then upgraded the database to 2005. The corrupt database is available in a zip file - DemoCorruptMetadata.zip. If you unzip it into a folder C:\SQLskills then you can attach it using:

RESTORE DATABASE DemoCorruptMetadata FROM DISK = 'C:\SQLskills\DemoCorruptMetadata.bak'
   
WITH MOVE 'DemoCorruptMetadata' TO 'C:\SQLskills\DemoCorruptMetadata.mdf',
   
MOVE 'DemoCorruptMetadata_log' TO 'C:\SQLskills\DemoCorruptMetadata_log.ldf',
   
REPLACE;
GO

So what does the corruption look like on 2005?

DBCC CHECKDB (DemoCorruptMetadata) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=1) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=2) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB found 0 allocation errors and 2 consistency errors in database 'DemoCorruptMetadata'.

This is what we expect. Notice that there's no recommended repair level at the end of the output - this is because CHECKDB can't repair metadata corruptions. We can't fix this with a backup - unless we have a backup from 2000 from before the manual delete in the system tables. To fix this we'd need to go back to 2000, fix the corruption, and then upgrade again - usually not feasible.

Instead, we're going to fix it by manually altering the system tables in 2005 - something that's purportedly not possible. First let's see what tables there are that could include column information (remembering that the system catalogs were completely rewritten between 2000 and 2005):

SELECT [name] FROM DemoCorruptMetadata.sys.objects WHERE [name] LIKE '%col%';
GO

name
------------------
sysrowsetcolumns
syshobtcolumns
syscolpars
sysiscols

I know that sysrowsetcolumns and syshobtcolumns are involved at low-levels of the Storage Engine and don't contain relational metadata, so let's try syscolpars. I want to see what columns there are to see if one of the looks like an object ID, and another looks like a column ID. This query will just return the table columns, with no rows (because the condition 1=0 is always false:

SELECT * FROM DemoCorruptMetadata.sys.syscolpars WHERE 1 = 0;
GO

Msg 208, Level 16, State 1, Line 1
Invalid object name 'DemoCorruptMetadata.sys.syscolpars'.

I can't bind to internal system tables in 2005. But - I can bind to internal system tables using the Dedicated Admind Connection (or DAC for short). This is documented in Books Online at http://msdn.microsoft.com/en-us/library/ms179503.aspx. You can get to the DAC through SQLCMD using the /A switch. So - assuming I'm now connected through the DAC, I'll try that command again:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> SELECT * FROM sys.syscolpars WHERE 1=0;
2> GO
id          number colid       name

xtype utype       length prec scale collationid status      maxinrow xmlns
 dflt        chk         idtval

----------- ------ ----------- -------------------------------------------------
-------------------------------------------------------------------------------
----- ----------- ------ ---- ----- ----------- ----------- -------- -----------
 ----------- ----------- -------------------------------------------------------
-----------

(0 rows affected)
1>

This looks like the table. Now I'll query against it using the object ID from the original corruption message:

1> SELECT colid, name FROM sys.syscolpars WHERE id = 1977058079;
2> GO
colid       name
----------- --------------------------------------------------------------------
------------------------------------------------------------
          1 SalesID
         
2 CustomerID
(2 rows affected)
1>

Cool. So I'll try deleting the orphaned columns:

1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO
Msg 259, Level 16, State 1, Server ROADRUNNERPR, Line 1
Ad hoc updates to system catalogs are not allowed.
1>

Hmm. And it doesn't help if I set 'allow updates' to 1, or try putting the database into single-user mode.

There IS a way though. You can put the SERVER into single-user mode, then connect with the DAC and you can then update the system tables. This particular twist on using the DAC isn't documented anywhere except in an MSDN forum thread answered by someone from Microsoft (see here).