Backups, trains, birds, and painfully contrived movie plots


Totally off-topic blog post this time. I haven’t blogged in a week as we’ve been in Chicago and Illinois for the memorial for Kimberly’s Father (see here). Everything went really well at the memorial and then the ashes scattering in Lake Michigan – perfect sailing weather! The only fly in the ointment came back to backups again. I bought a very cool video camera to make sure we captured the memorial for posterity (actually I probably went a little over-the-top but the HD picture quality is awesome – Canon XA H1). I video’d the whole memorial, and then out on the boat the next day. The only problem was that I didn’t check the tapes before taping on the second day and managed to overwrite half of the memorial video. Should have taken a backup onto my laptop in the evening on the first day but too much rum was drunk in the Columbia Yacht Club in Chicago and I didn’t think to check in the morning. Oops. After we’ve been burned so badly with Kimberly’s computer mishaps (see my diatribe here), you’d think we’d have learned by now…


The last few days we’ve been in Galesburg, IL visiting Kimberly’s Mom and Fort Madison, IA visiting her Grandmother. As a bird-watcher, this was paradise as I managed to pick-up eight new bird species for my life-list. Galesburg is the home of the largest railroad switch-yard in the world, and much as I like trains, it seems like most of the 150+ trains per day that go through Galesburg actually go through at night, making lots of noise as the do so – which doesn’t make for the best sleep.


Here are some pictures from Galesburg… (click for larger versions)



Okay – so why does the title mention movie plots? And why is this blog post filed under the Bad Advice tag? Well, it would be bad advice for me to recommend you go to see Indiana Jones and the Kingdom of the Crystal Skull, which we saw this evening. We’re both *huge* Indiana Jones fans, but this movie was pretty bad. Contrived plot, boring dialog, wooden characters, and a predictable ending. Without giving anything away, the refrigerator scene is totally unbelievable, the accents of the baddies are cliched and awful, and what’s with the cutsie gophers at the start? Well, I enjoyed a few bits here and there but I was ready to leave after about half an hour. I can’t believe Harrison Ford made this movie… Oh well – I’m sure opinions will vary but I think they should have left the series to end with The Last Crusade back in 1989.


Tomorrow we fly home and next week we’re back to work for a little bit before flying out again to TechEd on Friday. And I’ll be finishing up some exciting 2008 whitepapers for the SQL teaam and back to blogging about technical stuff.

CHECKDB bug that people are hitting – Msg 8967, Level 16, State 216

Before I start, I want to make it clear that you can only hit this bug if you ALREADY have corruption, that it’s quite rare, and that there is a workaround.

I’ve noticed a few more people in the forums having DBCC CHECKDB fail with this particular error in the last month:

Msg 8967, Level 16, State 216, Line 1
An internal error occured in DBCC which prevented further processing. Please contact Product Support.

Whenever DBCC CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence DBCC CHECKDB would be working from an inconsistent view of the database. If this case is discovered, DBCC CHECKDB stops immediately. When I rewrote DBCC CHECKDB for SQL Server 2005, I changed a bunch of code assertions into separate states of the 8967 error, so that DBCC CHECKDB would fail gracefully if some condition occurred that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I’ve just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that DBCC CHECKDB uses by default. However, I’ve come to learn that this is a bug in DBCC CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside DBCC CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I’ve seen this a handful of times in the last few weeks – hence the need for a blog post. I’ve discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They’re going to put a KB article together too – but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

[Edit: June 24 2010 – this still hasn’t been fixed AFAIK – see KB 960791 which just discusses the workaround.)

Now let’s repro the problem. Starting with a simple database and table, I’ll find the first page so I can corrupt it.

CREATE DATABASE [TestCheckdbBug];
GO
USE [TestCheckdbBug];
GO
CREATE TABLE [test] ([c1] INT, [c2] CHAR (5000));
INSERT INTO [test] VALUES (1, 'a');
GO
EXEC sp_AllocationMetadata 'test';
GO
Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
-----------  --------  -----------------  ---------------  ----------  ---------  --------------
test         0         72057594042318848  IN_ROW_DATA      (1:143)     (0:0)      (1:152)

Now I’ve corrupted the page type on page (1:143) to be 255 (an invalid page type), which will guarantee the page fails the audit inside CHECKDB.

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE &amp;&amp; level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Now I’m going to corrupt the LSN on that page such that it’s guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO
Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there’s some diagnostic information so we can tell which page caused the problem:

2008-05-22 14:55:01.95 spid53   DBCC encountered a page with an LSN greater than the current end of log LSN (31:0:1) for its internal database snapshot. Could not read page (1:143), database 'TestCheckdbBug' (database ID 15), LSN = (-1:65535:18), type = 255, isInSparseFile = 0. Please re-run this DBCC command.
2008-05-22 14:55:01.95 spid53   DBCC CHECKDB (TestCheckdbBug) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 0 seconds.

Note the page ID tells us the bad page and the LSN reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn’t be possible to tell from these diagnostics which page is corrupt. However, all is not lost. This bug means that under these circumstances the default online behavior of DBCC CHECKDB can’t run. The workaround is to use the WITH TABLOCK option, which does offline checking and doesn’t need the snapshot – but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS, TABLOCK;
GO
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page’s LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So – a scary little bug that has caused some people headaches, but I want to stress again – this can only happen if the database is ALREADY corrupt, and that it’s quite rare. Hope this helps some of you picking this up from search engines in the future.

In Print: TechNet Magazine June 2008 SQL Q&A Column


My first magazine article is in print! I’ve taken over the bi-monthly SQL Q&A column for TechNet Magazine and I just received the June magazine in the mail today with my first column in it. Topics covered are:



  • Creating corruption and using page checksums
  • The shrink-grow-shrink-grow trap
  • How many databases can be mirrored per instance
  • A tip on changing the default server port, from Jens Suessmeyer

I’ve also just completed a feature article for either the July or August issue dealing with database maintenance for the ‘involuntary’ DBA – more details when it gets published.


If you don’t get the print version of TechNet Magazine, you can get to this month’s SQL Q&A column at http://technet.microsoft.com/en-us/magazine/cc510328.aspx. There may not be anything new if you’ve been following my blog for a while, but if you’ve just started, it’s worth a quick look.


Enjoy!


PS Let me know if you’ve got any good questions – I’ve already completed the August column but I’d like to hear of any questions you may have for later columns.