Before I start, I want to make it clear that you can only hit this bug if you ALREADY have corruption, that it's quite rare, and that there is a workaround.

I've noticed a few more people in the forums having CHECKDB fail with this particular error in the last month

Msg 8967, Level 16, State 216, Line 1
An internal error occured in DBCC which prevented further processing. Please contact Product Support.

instead of completing properly and listing the corruptions in the database.

Whenever CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence CHECKDB would be working from an inconsistent view of the database. If this case is discovered, CHECKDB stops immediately. When I rewrote CHECKDB for SQL Server 2005, I changed a bunch of code assertions into seperate states of the 8967 error, so that CHECKDB would fail gracefully if some condition occured that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I've just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that CHECKDB uses by default. However, I've come to learn that this is a bug in CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I've seen this a handful of times in the last few weeks - hence the need for a blog post. I've discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They're going to put a KB article together too - but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

Now let's repro the problem. Starting with a simple database and table, I'll find the first page so I can corrupt it.

CREATE DATABASE TestCheckdbBug;
GO
USE TestCheckdbBug;
GO
CREATE TABLE test (c1 INT, c2 CHAR (5000));
INSERT INTO test VALUES (1, 'a');
GO
EXEC sp_AllocationMetadata 'test';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
-----------  --------  -----------------  ---------------  ----------  ---------  --------------
test         0         72057594042318848  IN_ROW_DATA      (1:143)     (0:0)      (1:152)

Now I'm going to corrupt the page type on page (1:143) to be 255 (an invalid page type), which will guarantee the page fails the audit inside CHECKDB.

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Now I'm going to corrupt the LSN on that page such that it's guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there's some diagnostic information so we can tell which page caused the problem:

2008-05-22 14:55:01.95 spid53   DBCC encountered a page with an LSN greater than the current end of log LSN (31:0:1) for its internal database snapshot. Could not read page (1:143), database 'TestCheckdbBug' (database ID 15), LSN = (-1:65535:18), type = 255, isInSparseFile = 0. Please re-run this DBCC command.
2008-05-22 14:55:01.95 spid53   DBCC CHECKDB (TestCheckdbBug) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 0 seconds.

Note the page ID (in black bold above) tells us the bad page and the LSN (in blue bold above) reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn't be possible to tell from these diagnostics which page is corrupt.

However, all is not lost. This bug means that under these circumstances the default online behavior of CHECKDB can't run. The workaround is to use the WITH TABLOCK option of CHECKDB, which does offline checking and doesn't need the snapshot - but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

DBCC CHECKDB ('TestCheckdbBug') WITH TABLOCK, ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page's LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So - a scary little bug that has caused some people headaches, but I want to stress again - this can only happen if the database is ALREADY corrupt, and that it's quite rare. Hope this helps some of you picking this up from search engines in the future.

A couple of weeks ago I blogged about the three tracks of the SQL Server 2008 JumpStart course that SQLskills.com taught internally for Microsoft and some MVPs - see here for details. Well, the content is now available to download! Note that this was based on CTP-5 (November 2007 CTP) and there have been *lots* of behavioral changes since then (with more planned for CTP-6 Refresh and RTM), but if you want a high-level overview of a bunch of the features (albeit in bullet-point summaries on slides) then this is a good place to start.

For me, what's *REALLY* cool is that the site also has a downloadable VPC plus lab manuals for all of the AlwaysOn High-Availability hands-on labs that SQLskills.com wrote. We originally wrote these labs for SQL Server 2005 and I updated them all for CTP-5. The VPC has a long lab on each of the following:

  • Database Snapshots
  • Data Recovery and Preventative Techniques
  • Instant Initialization
  • Peer-to-Peer Replication (including the new Topology Wizard I blogged about here)
  • Table and Index Partitioning
  • Snapshot Isolation
  • Online Operations
  • Database Mirroring (including a demo I wrote of Automatic Page Repair, described here)
  • Service Oriented Database Architecture

There is some great depth in each of these - Kimberly blogged more info about the exercises in each lab here. There's also another VPC image with some higher-level labs on a variety of 2008 features and written by a number of different people- including some labs on Policy-Based Management and Performance Data Collection that Kimberly wrote.

So - where can you get these from? Go to http://sqlserver2008jumpstart.microsofttraining.com/ and hit the Download link on the right-hand side. Register and then you can get to the materials. The AlwaysOn VPC image is Collection 2 at the bottom of the page, and you'll see all the slide decks as you scroll down the page.

Enjoy!

Here's an issue that I thought was a one-off but it just popped up again over the weekend so I want to publicize it.

DBA runs a DBCC CHECKDB and gets output like the following:

Msg 5172, Level 16, State 15, Line 1
The header for file 'E:\Data\namechanged.mdf:MSSQL_DBCC14' is not a valid database file header. The PageAudit property is incorrect.
Msg 5120, Level 16, State 9, Line 1
Unable to open the physical file "E:\Data\namechanged.mdf:MSSQL_DBCC14". Operating system error 0: "(null)".
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

etc

Wow! Looks like something's seriously wrong with that database. Until you find that this happens on *all* the databases on the server. And on all the other servers in the data center too. What on earth's going on? I saw a similar issue once before, back in 2004 or 2005 while customers were testing pre-RTM IDW builds of SQL Server 2005. After running a test and doing some investigation for me, the person who just hit this issue over the weekend confirmed my hypothesis that he'd hit the problem I saw a few years ago.

The problem is a 3rd-party file encryption solution. It installs a file-system filter driver that filters all reads and writes to the database files so it can perform the encryption/decryption. Unfortunately it doesn't cope with NTFS alternate streams - which is how the internal database snapshots that DBCC CHECKDB uses are created - and so returns garbage for all reads from the database snapshot. The simple workaround is to create your own database snapshot (whose constituent files will exist in their own right, rather than being alternate streams of the source database's files) and then run DBCC CHECKDB against that.

Now - I don't want to cry 'wolf' here, but be careful of attributing massive corruption problems to the real IO subsystem if there's any kind of file-system filter driver installed.

This is a post I've been meaning to do for a while - detailing some of the problems you can run into when using database snapshots.

Reverting to a snapshot

Many people use snapshots as a way of protecting against mistakes during a complicated set of changes to a database - you can just revert to the snapshot if something goes wrong. However, reverting to a snapshot has a little-known (but documented) problem. Let's see. I'm going to create a database plus a snapshot of it and then start taking backups.

CREATE DATABASE SnapshotTest;
GO
ALTER DATABASE SnapshotTest SET RECOVERY FULL;
GO
BACKUP DATABASE SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest.bak' WITH INIT;
GO

CREATE TABLE SnapshotTest.dbo.MyTable (c1 INT);
GO

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak' WITH INIT;
GO

-- Imagine a bunch of things happen here

-- Create the snapshot database, by first checking which files exist
SELECT * FROM SnapshotTest.sys.database_files;
GO

CREATE DATABASE ST_Snap ON
(NAME = SnapshotTest, FILENAME = 'C:\SQLskills\SnapshotTest_snap.snp')
AS SNAPSHOT OF SnapshotTest;
GO

Now I'm going to do some stuff that's a mistake:

DROP TABLE SnapshotTest.dbo.MyTable;
GO

Not a problem as I can revert to my database snapshot:

RESTORE DATABASE SnapshotTest FROM DATABASE_SNAPSHOT = 'ST_Snap';
GO

Oops! I forgot to take a log backup to capture everything that happened since the last log backup...

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak';
GO
Msg 4214, Level 16, State 1, Line 1
BACKUP LOG cannot be performed because there is no current database backup.
Msg 3013, Level 16, State 1, Line 1
BACKUP LOG is terminating abnormally.

Reverting to a snapshot rebuilds the transaction log and breaks the log backup chain. The only thing I can do is take a full or differential backup and then start taking log backups again. So - I lost the ability to do point-in-time restores in the period from the last log backup to the time when I reverted to the snapshot.

So how can I tell when the database was reverted? Books Online documents that the restorehistory table in msdb should have an entry with restore_type = 'R'. Let's try:

SELECT * FROM msdb.dbo.restorehistory WHERE [destination_database_name] = 'SnapshotTest';
GO

Nope - nothing. That functionality seems to be broken. The only way I could find to trace reverting was in the error log:

2008-01-30 11:09:21.73 spid53      Reverting database 'SnapshotTest' to the point in time of database snapshot 'ST_Snap' with split point LSN 26000000013800001 (0x0000001a:0000008a:0001). This is an informational message only. No user action is required.
2008-01-30 11:09:21.74 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.74 spid53      The database 'SnapshotTest' is marked RESTORING and is in a state that does not allow recovery to be run.
2008-01-30 11:09:21.85 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.87 spid53      Starting up database 'ST_Snap'.

Note that the snapshot database is still there and continues to work.

Running out of space in a snapshot

The next problem that can happen is if you create a database snapshot on a volume that doesn't have much disk space. In that case you can actually run out of space and the snapshot will go suspect and be unusable until it's dropped and recreated. Let's see what happens:

-- Create snapshot of large database on a volume with not much space
SELECT * FROM SalesDB.sys.database_files;
GO

CREATE DATABASE SalesDB_Snap ON
(NAME = SalesDBData, FILENAME = 'D:\sqlskills\SalesDB_snap.snp')
AS SNAPSHOT OF SalesDB;
GO

ALTER INDEX ALL ON SalesDB.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'D:\sqlskills\SalesDB_snap.snp' failed due to lack of disk space.

The query that causes the sparse file to run out of space doesn't fail, but the user running the query will see this error. Note that it doesn't say the snapshot has gone suspect. If you try to use it then it will tell you, plus there info written to the error log:

2008-01-30 11:50:29.14 spid53      Error: 17053, Severity: 16, State: 1.
2008-01-30 11:50:29.14 spid53      D:\sqlskills\SalesDB_snap.snp: Operating system error 112(There is not enough space on the disk.) encountered.
2008-01-30 11:50:29.15 spid53      Error: 3420, Severity: 21, State: 1.
2008-01-30 11:50:29.15 spid53      Database snapshot 'SalesDB_Snap' has failed an IO operation and is marked suspect.  It must be dropped and recreated.

Running out of space while running DBCC CHECKDB

One thing that isn't common knowledge is that DBCC CHECKDB uses database snapshots as it's mechanism for running online in SQL Server 2005. You can't control where the snapshot is created (it's actually created in the same location as the files comprising the database being checked - see my previous post on the DBCC CHECKDB stages here for more info) but you can create your own snapshot and check that. Anyway - it's possible to run out of space in the snapshot while DBCC CHECKDB is running. I setup a situation with a large database on a drive with only 1MB of space left, started a large index rebuild in the database and then tried to run DBCC CHECKDB. Let's see what happened:

The index rebuild statement:

ALTER INDEX ALL ON SalesDB2.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.

So some poor unsuspecting user gets this error and has no idea why. For DBAs its a bit obscure too. The syntax is describing an alternate stream on the existing data file and the snapshot in question can't be accessed at all.

The DBCC CHECKDB statement:

DBCC CHECKDB (salesdb2) WITH NO_INFOMSGS;
GO
Msg 1823, Level 16, State 1, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.
Msg 3313, Level 21, State 2, Line 1
During redoing of a logged operation in database 'salesdb2', an error occurred at log record ID (1628:252:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.

Wow - major amounts of errors - and it all boils down to running out of space.

One of the sessions we did yesterday at IT Forum was an Instructor-Led Lab on database snapshots, an Enterprise Edition feature of SQL Server 2005. Database snapshots are not particularly well known in the DBA community and there are many misconceptions about them. The session generated a bunch of questions, some of which I'll answer here.

Q1) Can you change the data in table in a database snapshot?

A1) No.

Q2) Can you change the permissions in a database snapshot?

A2) Nope, sorry.

Q3) Can you backup a database snapshot?

A3) Afraid not.

A4) Can you detach a database snapshot?

A4) Err, nope.

You may have noticed a pattern here :-) Basically, the only things you can do with a database snapshot are select from it, and revert to it - i.e. rollback the entire database on which it is based to the point-in-time at which the database snapshot was created. Database snapshots are not updateable in any way, and at the time I left MS, there were no plans to change that in the future. Regardless of these limitations, database snapshots are very useful. I searched both our blogs for some examples to link to and couldn't find any so I'll put together a post on using them sometime over the next week or so.

The final question I *was* able to answer successfully for a very happy conference attendee:

Q5) Can you create a partial snapshot on a database, say a single filegroup, for reporting?

A5) No - HOWEVER, if you're on Enterprise Edition, you can make use of partial database availability. Hopefully you have your database structured so the primary filegroup has nothing but system tables in, and you have the table you're interested in isolated in a seperate filegroup - let's call it filegroup X. (He answered 'yes' to all three!). If that is that case, backup the primary filegroup and filegroup X. Then restore the primary filegroup followed by filegroup X in a separate location. As long as the primary filegroup of a database is online in SQL Server 2005 Enterprise Edition, then the database is online and any other online filegroups are available - partial database availability. This means you've effectively created a point-in-time snapshot of a single filegroup, albeit at the expense of having to have a full copy of the filegroup.

Theme design by Nukeation based on Jelle Druyts