Wednesday, August 27, 2008

Every so often I'll see posts on the various data corruption forums discussing causes of corruption. In this post I want to debunk some of the myths around what can cause corruption. There are really two types of corruption to deal with, physical corruption and logical corruption.

Physical corruption

This is where something has altered the contents of a data or log file sector with no regard for what is being stored there. Possible causes of physical corruption are:

  • Problem with the I/O subsystem (99.8% of all cases I've ever seen - only 3 nines as I'd estimate I've seen around about a thousand corruption cases). Remember the I/O subsystem is everything underneath SQL Server in the I/O stack - including the OS, 3rd-party file system filter drivers, device drivers, RAID controllers, SAN controllers, network hardware, drives themselves, and so on. Millions of lines of code and lots of moving parts spinning very fast, very close to very fragile pieces of metal oxide (I once heard Jim Gray liken a disk drive head to a 747 jumbo jet flying at 500 mph at a height of 1/4 inch from the ground...)
  • Problem with the host machine hardware (0.1% of cases). Most of the time this is a memory error.
  • SQL Server bugs (0.1% of cases). Yes, there have been corruption bugs. Every piece of software has bugs. There are KB articles describing bugs.
  • Deliberate introduction of corruption using a hex editor or other means.

Physical corruption is what DBCC CHECKDB usually reports and the majority of cases are caused by a physical failures of some kind, with the minority caused by humans - software bugs.

Logical corruption

This is where something has altered some data so that a data relationship is broken. Possible causes of logical corruption are:

  • Humans

:-) Okay...

  • Application bug. The application deletes one part of an inherent data relationship but not the other. Or the application designer doesn't implement a constraint properly. Or the application designer doesn't cope with a transaction roll-back properly. You get the idea.
  • Accidental update/delete. Someone deletes or updates some data incorrectly.
  • SQL Server bug. See above.
  • DBCC CHECKDB when using the REPAIR_ALLOW_DATA_LOSS option. As is documented in Books Online, and I've blogged about and mentioned when lecturing, if you run repair, it doesn't take into account any inherent or explicit constraints on the data.

The point here is that a physical failure of a component does not cause logical corruption, it causes physical corruption. Conversely, application errors cause logical corruption, not physical corruption. DBCC CHECKDB errors are about physical corruption (okay, with the inclusion of DBCC CHECKCATALOG code in 2005, it will find cases where the DBA has manually altered the system tables, causing logical corruption) and applications cannot cause physical corruption as they can only manipulate data through SQL Server. If an application hits a SQL Server bug which causes physical corruption, that's still not the application causing physical corruption, it's SQL Server.

So - on to the myths.

  • Can an application cause physical corruption? No.
  • Can stopping a shrink operation cause corruption of any kind? No.
  • Can stopping an index rebuild cause corruption of any kind? No.
  • Can running DBCC CHECKDB without repair cause corruption of any kind? No.
  • Can creating a database snapshot cause corruption of any kind? No.

Hope this helps.

Wednesday, August 27, 2008 9:11:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, August 11, 2008

I've had a few follow-ups on my two posts about boot page and file header page corruption - asking if its possible to do single-page restore operations for these pages. Let's try:

CREATE DATABASE BootPageTest;
GO

-- Single page restore is only possible using the FULL recovery model
ALTER DATABASE BootPageTest SET RECOVERY FULL;
GO

BACKUP DATABASE BootPageTest TO DISK = 'C:\sqlskills\BootPageTest.bck';
GO
BACKUP LOG BootPageTest TO DISK = 'C:\sqlskills\BootPageTest.trn';
GO

RESTORE DATABASE BootPageTest PAGE = '1:9' FROM DISK = 'C:\sqlskills\BootPageTest.bck';
GO

Msg 3111, Level 16, State 1, Line 2
Page (1:9) is a control page which cannot be restored in isolation. To repair this page, the entire file must be restored.
Msg 3013, Level 16, State 1, Line 2
RESTORE DATABASE is terminating abnormally.

The answer is no. The following page types cannot be restored using single-page restore:

  • File header pages (see here)
  • Boot page (see here)
  • GAM, SGAM, DIFF map, ML map pages (see here)
Monday, August 11, 2008 11:32:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, July 25, 2008

Following on from my previous post on boot pages and boot page corruption, I've been asked about file header pages - and I was already planning this post as the next in the series.

So what's a file header page? Every data file in a database has the very first 8kb page (i.e. page 0 in the file) set aside as the place to store all the metadata info about the file. As with the boot page, you can look at the contents with DBCC PAGE and it will interpret all the fields for you, or you can use the DBCC FILEHEADER command, which does a better job. This is undocumented and unsupported, just like DBCC DBINFO for looking at the database boot page, but it's been discussed and posted about on the Internet before so it's existence is no secret.

The command take a database name or database ID plus the file ID to dump. Here I've created a database called FileHeaderTest and used SSMS with results-to-text, plus a bunch of editing of the results to make it blog-able:

DBCC FILEHEADER ('FileHeaderTest', 1);
GO

FileId                : 1
LogicalName           : FileHeaderTest
BindingId             : D30AE3EF-14A6-47D5-B267-96F38238D882
FileGroup             : 1
Size                  : 152
MaxSize               : -1
MinSize               : 152
UserShrinkSize        : -1
Growth                : 128
BackupLSN             : 0
RedoStartLSN          : 0
FirstLSN              : 0
MaxLSN                : 0
FirstUpdateLSN        : 0
CreateLSN             : 0
SectorSize            : 512
RecoveryForkGUID      : 00000000-0000-0000-0000-000000000000
RecoveryForkLSN       : 0
DifferentialBaseLsn   : 19000000048800037
DifferentialBaseGuid  : 279A8EF4-4431-4CA5-8939-F613E5BC3033
Status                : 2
RestoreStatus         : 0
ReadOnlyLsn           : 0
ReadWriteLsn          : 0
MaxLsnBranchId        : 00000000-0000-0000-0000-000000000000
RedoTargetPointLsn    : 0
RedoTargetPointGuid   : 00000000-0000-0000-0000-000000000000
RestoreDiffBaseLsn    : 0
RestoreDiffBaseGuid   : 00000000-0000-0000-0000-000000000000
RestorePathOriginLsn  : 0
RestorePathOriginGuid : 00000000-0000-0000-0000-000000000000
OldestRestoredLsn     : 0

Lots of interesting stuff in here, such as:

  • BindingId: used to make sure a file is really part of this database
  • SectorSize: the disk sector size
  • Status: what kind of file and what state is it in (e.g. 2 = regular disk file)
  • Various sizes in number-of-8kb-pages (e.g. MaxSize of -1 means file growth is unlimited)
  • Growth: the number of pages to grow the file by if the 0x100000 bit is NOT set in the Status field. If it is set, the Growth is in percent.

And you can watch things change. For instance, if I change the file growth to 10%:

ALTER DATABASE FileHeaderTest MODIFY FILE (NAME = FileHeaderTest, FILEGROWTH = 10%);
GO

And then dump the file header page contents again, the Status and Growth fields have changed to:

.
.
Growth                : 10
.
.
Status                : 1048578
.
.

So what if a file header page is corrupt? I corrupted the file header page of my database and then started up SQL Server.

USE FileHeaderTest;
GO

Msg 945, Level 14, State 2, Line 1
Database 'FileHeaderTest' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

Let's try EMERGENCY mode:

ALTER DATABASE FileHeaderTest SET EMERGENCY;
GO

Msg 5172, Level 16, State 15, Line 1
The header for file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\FileHeaderTest.mdf' is not a valid database file header. The PageAudit property is incorrect.

In this case, there's nothing to do except restore from backups, and the database is inaccessible because the PRIMARY filegroup could not be brought online. If the corruption were in a file in a secondary filegroup, things are a little bit different. Now I've added a secondary filegroup with a single file (called CorruptFile) and corrupted it's file header page. After starting up SQL Server we get the same behavior - but this time we can set the file to be offline and access the rest of the database. This is called partial database availability and works in Enterprise (and Developer) Edition only.

ALTER DATABASE FileHeaderTest MODIFY FILE (NAME = CorruptFile, OFFLINE);
GO

Note that the only way to bring an offline file back online is to restore it from a backup - see this post from my old Storage Engine blog for more details. Another corruption that can only be repaired using backups...

 

Friday, July 25, 2008 6:10:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Thursday, July 10, 2008

Now that I've done all the business-related blog posts, back to the good stuff to stop people complaining!

Something that's cropped up a few times over the summer so far is people trying to repair boot page corruptions.

First off, what's a boot page? Every database has a single page that stores critical information about the database itself. It's always page 9 in file 1 (the first file in the PRIMARY filegroup). You can examine the page using DBCC PAGE and it will interpret all the fields for you, but there's another command, DBCC DBINFO, that also dumps all this info (in fact the DBCC PAGE code calls the same underlying dumping code). This command is undocumented and unsupported but widely known and 'documented' in lots of places on the web - given that it uses the same code as DBCC PAGE, it's just as safe to use IMHO.

So what's on the boot page?

DBCC DBINFO ('BootPageTest');
GO

DBINFO STRUCTURE:


DBINFO @0x5BF6EF84

dbi_dbid = 19                        dbi_status = 65536                   dbi_nextid = 2073058421
dbi_dbname = BootPageTest            dbi_maxDbTimestamp = 2000            dbi_version = 611
dbi_createVersion = 611              dbi_ESVersion = 0                   
dbi_nextseqnum = 1900-01-01 00:00:00.000                                  dbi_crdate = 2008-07-10 15:53:18.843
dbi_filegeneration = 0              
dbi_checkptLSN

m_fSeqNo = 41                        m_blockOffset = 29                   m_slotId = 55
dbi_RebuildLogs = 0                  dbi_dbccFlags = 2                   
dbi_dbccLastKnownGood = 1900-01-01 00:00:00.000                          
dbi_dbbackupLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_oldestBackupXactLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_LastLogBackupTime = 1900-01-01 00:00:00.000                          
dbi_differentialBaseLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_createIndexLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_versionChangeLSN

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_familyGUID = a4e88c13-b4cf-4320-834e-92b237244d4b                    
dbi_recoveryForkNameStack


entry 0

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = a4e88c13-b4cf-4320-834e-92b237244d4b                            

entry 1

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
m_guid = 00000000-0000-0000-0000-000000000000                            
dbi_differentialBaseGuid = 00000000-0000-0000-0000-000000000000           dbi_firstSysIndexes = 0001:00000014
dbi_collation = 872468488            dbi_category = 0                     dbi_maxLogSpaceUsed = 231936
dbi_localState = 0                   dbi_roleSequence = 0                
dbi_failoverLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmRedoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0

dbi_dbmOldestXactLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_dbMirrorId = 00000000-0000-0000-0000-000000000000                    
dbi_pageUndoLsn

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_disabledSequence = 0            
dbi_dvSplitPoint

m_fSeqNo = 0                         m_blockOffset = 0                    m_slotId = 0
dbi_CloneCpuCount = 0                dbi_CloneMemorySize = 0             
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

There's all kinds on interesting things in there, for instance:

  • dbi_version and dbi_createversion: the physical version number of the database (and when it was created). See question 1 in the August 2008 SQL Q&A column in TechNet Magazine for an explanation (see here).
  • dbi_RebuildLogs: a count of the number of times the transaction log has been rebuilt for the database. PSS can use this to tell whether corruption problems could have been caused by DBAs rebuilding the log
  • dbi_dbccLastKnownGood: the completion time of the last 'clean' run of DBCC CHECKDB
  • a bunch of different LSNs related to checkpoint, backups, database mirroring
  • dbi_LastLogBackupTime: self-explanatory
  • dbi_differentialBaseGuid: the GUID generated by the last full database backup. Differential backups can only be restored on top of a matching full backup - so an out-of-band full backup could screw-up your disaster recovery - see this blog post for more info.

Now, what about if this page is corrupt in some way? I corrupted the BootPageTest database to have a corrupt boot page. Let's see what happens:

USE BootPagetest;
GO

Msg 913, Level 16, State 4, Line 1
Could not find database ID 19. Database may not be activated yet or may be in transition. Reissue the query once the database is available. If you do not think this error is due to a database that is transitioning its state and this error continues to occur, contact your primary support provider. Please have available for review the Microsoft SQL Server error log and any additional information relevant to the circumstances when the error occurred.

Okay - let's try setting the database into EMERGENCY mode:

ALTER DATABASE BootPageTest SET EMERGENCY;
GO

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xcdee22fa; actual: 0xcb6ea2fa). It occurred during a read of page (1:9) in database ID 19 at offset 0x00000000012000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\BootPageTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Hmm. What about running DBCC CHECKDB?

DBCC CHECKDB ('BootPageTest') WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO

Msg 922, Level 14, State 1, Line 1
Database 'BootPageTest' is being recovered. Waiting until recovery is finished.

It's not looking good. Obviously the change to EMERGENCY mode couldn't complete properly. What's the database status?

SELECT [state_desc] FROM sys.databases WHERE [name] = 'BootPageTest';
GO

state_desc
------------------------------------------------------------
RECOVERY_PENDING

The boot page is inaccessible so in effect the database is inaccessible too, this is what the database state means in this case.

So what does this mean? If the boot page is corrupt, you can't run DBCC CHECKDB so you can't possibly run repair, and you can't put the database into EMERGENCY mode so you can't extract data into a new database. It means that there's NO WAY to recover from a corrupt boot page EXCEPT to restore from backups. One more reason to have backups...

Thursday, July 10, 2008 4:01:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Tuesday, July 08, 2008

Well, we're just back from vacation (photo blog post to follow) and I've heard that the feature article on Effective Database Maintenance I wrote for the August issue of TechNet Magazine is live on the web. It also includes a 5 minute long screencast I recorded where I demo the effect of database shrink on index fragmentation.

You can get to the article at http://technet.microsoft.com/en-us/magazine/cc671165.aspx. The topics covered are:

  • Managing data and transaction log files
  • Eliminating index fragmentation
  • Ensuring accurate, up-to-date statistics
  • Detected corrupted database pages
  • Establishing an effective backup strategy

It's written around 2-300 level and presents a good overview (well, at least I think so :-)) of the concepts involved.

Also, the August SQL Q&A column is available at http://technet.microsoft.com/en-us/magazine/cc671180(TechNet.10).aspx. This month's topics on the web (more in the print magazine) are:

  • Database version changes with upgrades
  • Benefits of partitioning
  • Consistency checking options for VLDBs

Enjoy!

Tuesday, July 08, 2008 9:35:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, June 11, 2008

Today I presented my brand new session Surviving Corruption: From Detection to Recovery at TechEd. I had a lot of fun putting together the demos, presenting the session, and talking to people afterwards. During the session, I promised to blog each of the demos so that everyone can run through them - here's the first one.

On SQL 2000, it was pretty easy to get into the system tables and manually change them - all you had to do was:

EXEC sp_configure 'allow updates', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

And then you could insert, update, and delete whatever you wanted in the all the system tables, including the critical three - sysindexes, sysobjects, and syscolumns. The problem was that sometimes people actually did this and messed things up - for instance, by manually deleting an object from sysobjects, but leaving around all the other info about the object - such as indexes and columns. DBCC CHECKCATALOG in SQL 2000 would find this, but DBCC CHECKDB would not - as it didn't run the DBCC CHECKCATALOG code - any most people do not run DBCC CHECKCATALOG at all. Many times now, I've seen databases upgraded to 2005 and suddenly DBCC CHECKDB is reporting metadata corruption errors - all because someone had manually changed the system tables on 2000, and I changed DBCC CHECKDB in 2005 to include the DBCC CHECKCATALOG checks.

This demo is all about that. I created a 2000 database, manually deleted a row in sysobjects and then upgraded the database to 2005. The corrupt database is available in a zip file - DemoCorruptMetadata.zip. If you unzip it into a folder C:\SQLskills then you can attach it using:

RESTORE DATABASE DemoCorruptMetadata FROM DISK = 'C:\SQLskills\DemoCorruptMetadata.bak'
   
WITH MOVE 'DemoCorruptMetadata' TO 'C:\SQLskills\DemoCorruptMetadata.mdf',
   
MOVE 'DemoCorruptMetadata_log' TO 'C:\SQLskills\DemoCorruptMetadata_log.ldf',
   
REPLACE;
GO

So what does the corruption look like on 2005?

DBCC CHECKDB (DemoCorruptMetadata) WITH NO_INFOMSGS, ALL_ERRORMSGS;
GO
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=1) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
Msg 8992, Level 16, State 1, Line 1
Check Catalog Msg 3853, State 1: Attribute (object_id=1977058079) of row (object_id=1977058079,column_id=2) in sys.columns does not have a matching row (object_id=1977058079) in sys.objects.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB found 0 allocation errors and 2 consistency errors in database 'DemoCorruptMetadata'.

This is what we expect. Notice that there's no recommended repair level at the end of the output - this is because CHECKDB can't repair metadata corruptions. We can't fix this with a backup - unless we have a backup from 2000 from before the manual delete in the system tables. To fix this we'd need to go back to 2000, fix the corruption, and then upgrade again - usually not feasible.

Instead, we're going to fix it by manually altering the system tables in 2005 - something that's purportedly not possible. First let's see what tables there are that could include column information (remembering that the system catalogs were completely rewritten between 2000 and 2005):

SELECT [name] FROM DemoCorruptMetadata.sys.objects WHERE [name] LIKE '%col%';
GO

name
------------------
sysrowsetcolumns
syshobtcolumns
syscolpars
sysiscols

I know that sysrowsetcolumns and syshobtcolumns are involved at low-levels of the Storage Engine and don't contain relational metadata, so let's try syscolpars. I want to see what columns there are to see if one of the looks like an object ID, and another looks like a column ID. This query will just return the table columns, with no rows (because the condition 1=0 is always false:

SELECT * FROM DemoCorruptMetadata.sys.syscolpars WHERE 1 = 0;
GO

Msg 208, Level 16, State 1, Line 1
Invalid object name 'DemoCorruptMetadata.sys.syscolpars'.

I can't bind to internal system tables in 2005. But - I can bind to internal system tables using the Dedicated Admind Connection (or DAC for short). This is documented in Books Online at http://msdn.microsoft.com/en-us/library/ms179503.aspx. You can get to the DAC through SQLCMD using the /A switch. So - assuming I'm now connected through the DAC, I'll try that command again:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> SELECT * FROM sys.syscolpars WHERE 1=0;
2> GO
id          number colid       name

xtype utype       length prec scale collationid status      maxinrow xmlns
 dflt        chk         idtval

----------- ------ ----------- -------------------------------------------------
-------------------------------------------------------------------------------
----- ----------- ------ ---- ----- ----------- ----------- -------- -----------
 ----------- ----------- -------------------------------------------------------
-----------

(0 rows affected)
1>

This looks like the table. Now I'll query against it using the object ID from the original corruption message:

1> SELECT colid, name FROM sys.syscolpars WHERE id = 1977058079;
2> GO
colid       name
----------- --------------------------------------------------------------------
------------------------------------------------------------
          1 SalesID
         
2 CustomerID
(2 rows affected)
1>

Cool. So I'll try deleting the orphaned columns:

1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO
Msg 259, Level 16, State 1, Server ROADRUNNERPR, Line 1
Ad hoc updates to system catalogs are not allowed.
1>

Hmm. And it doesn't help if I set 'allow updates' to 1, or try putting the database into single-user mode.

There IS a way though. You can put the SERVER into single-user mode, then connect with the DAC and you can then update the system tables. This particular twist on using the DAC isn't documented anywhere except in an MSDN forum thread answered by someone from Microsoft (see here).

BEWARE (if I could put little flashing lights around this too then I would...) that this is undocumented and unsupported - misuse will lead to unrepairable corruption of your databases.

The sequence of events to follow is:

  • make a backup of the database just in case something goes wrong
  • shutdown the server
  • go to the binaries directory (e.g. C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn) and start the server in single-user mode using 'sqlservr -m'
  • connect back in using SQLCMD /A, and run the deleta again. This time it will work, but will give an error about metadata cache consistency:

C:\Documents and Settings\paul>sqlcmd /A
1> USE DemoCorruptMetadata;
2> GO
Changed database context to 'DemoCorruptMetadata'.
1> DELETE FROM sys.syscolpars WHERE id = 1977058079;
2> GO

(2 rows affected)
Warning: System table ID 41 has been updated directly in database ID 12 and cache coherence may not have been maintained. SQL Server should be restarted.
1>

  • The system table has been updated, but the in-memory cache of metadata is now out-of-sync with the system tables. So, shutdown the server again as the message suggests and restart it normally
  • run CHECKDB again and you'll see the corruption has been fixed.

Hope this helps some of you. Watch this space for the next demo from TechEd of repairing corruption when no backup is available.

Wednesday, June 11, 2008 5:42:23 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Monday, June 09, 2008

(I'm actually on-stage here at TechEd doing the  DAT track pre-con with Kimberly - she's on now until lunch so I'm catching up on forum problems...)

Here's a question that came up on of the SQLServerCentral.com corruption forums I monitor that I think is worth blogging about. To paraphrase:

I have a bunch of corruptions in a database, that look like they've been there for a while. Repair is my only option - it works but I'd like to know what data is being deleted. How can I do that? Here are some of the errors:

Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168576) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168577) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168578) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168579) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168580) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168581) could not be processed. See other errors for details.
Server: Msg 8928, Level 16, State 1, Line 2
Object ID 645577338, index ID 0: Page (1:168582) could not be processed. See other errors for details.
Server: Msg 8976, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168576) was not seen in the scan although its parent (1:165809) and previous (1:168575) refer to it. Check any previous errors.
Server: Msg 8978, Level 16, State 1, Line 2
Table error: Object ID 645577338, index ID 1. Page (1:168583) is missing a reference from previous page (1:168582). Possible chain linkage problem.

This is a clustered index that CHECKDB  will repair by deleting pages at the leaf-level - essentially deleting a bunch of records. The pages look to be trashed (there were a bunch more errors that I didn't include here that said the page headers were all corrupted - looked like the IO subsystem trashde a whole 64KB chunk of the disk) so there's nothing much else you can do. As the table has a clustered index, you can use the error messages to find the pages on either 'logical' side of the pages being deleted - and hence figure out the range of records that have been deleted.

The errors show that pages 168576 through 168582 in file 1 are corrupt. There are also errors that say the previous page of 168576 is 168575, and the next page of 168582 is 168583. If you do a DBCC PAGE of these two pages, you can find the lower and upper bound of the clustered index key values that have been lost. Think of three ranges:

  • the lower range of records that are intact, logically before the corrupt pages in the index
  • the range of records that will be deleted by repair
  • the upper range of records that are intact, logically after the corrupt pages in the index

To find the upper bound of the lower range:

DBCC TRACEON (3604); -- allows the output to come to the console
DBCC PAGE ('dbname', 1, 168575, 3);
GO

The key value in the slot at the end of output is the upper bound of the bottom range that's intact.

Then do:

DBCC PAGE ('dbname', 1, 168583, 3);
GO

The key value in the slot at the beginning of the output is the lower bound of the upper range that's intact.

Everything in the middle will be deleted. You could also try a DBCC PAGE on the corrupt pages themselves too - you might be able to see some data in them.

I'll be blogging a bunch more about repair after my corruption session this week at TechEd - watch this space!

Monday, June 09, 2008 7:54:32 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, May 22, 2008

Before I start, I want to make it clear that you can only hit this bug if you ALREADY have corruption, that it's quite rare, and that there is a workaround.

I've noticed a few more people in the forums having CHECKDB fail with this particular error in the last month

Msg 8967, Level 16, State 216, Line 1
An internal error occured in DBCC which prevented further processing. Please contact Product Support.

instead of completing properly and listing the corruptions in the database.

Whenever CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence CHECKDB would be working from an inconsistent view of the database. If this case is discovered, CHECKDB stops immediately. When I rewrote CHECKDB for SQL Server 2005, I changed a bunch of code assertions into seperate states of the 8967 error, so that CHECKDB would fail gracefully if some condition occured that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I've just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that CHECKDB uses by default. However, I've come to learn that this is a bug in CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I've seen this a handful of times in the last few weeks - hence the need for a blog post. I've discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They're going to put a KB article together too - but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

Now let's repro the problem. Starting with a simple database and table, I'll find the first page so I can corrupt it.

CREATE DATABASE TestCheckdbBug;
GO
USE TestCheckdbBug;
GO
CREATE TABLE test (c1 INT, c2 CHAR (5000));
INSERT INTO test VALUES (1, 'a');
GO
EXEC sp_AllocationMetadata 'test';
GO

Object Name  Index ID  Alloc Unit ID      Alloc Unit Type  First Page  Root Page  First IAM Page
-----------  --------  -----------------  ---------------  ----------  ---------  --------------
test         0         72057594042318848  IN_ROW_DATA      (1:143)     (0:0)      (1:152)

Now I'm going to corrupt the page type on page (1:143) to be 255 (an invalid page type), which will guarantee the page fails the audit inside CHECKDB.

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Now I'm going to corrupt the LSN on that page such that it's guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there's some diagnostic information so we can tell which page caused the problem:

2008-05-22 14:55:01.95 spid53   DBCC encountered a page with an LSN greater than the current end of log LSN (31:0:1) for its internal database snapshot. Could not read page (1:143), database 'TestCheckdbBug' (database ID 15), LSN = (-1:65535:18), type = 255, isInSparseFile = 0. Please re-run this DBCC command.
2008-05-22 14:55:01.95 spid53   DBCC CHECKDB (TestCheckdbBug) WITH all_errormsgs, no_infomsgs executed by ROADRUNNERPR\paul terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 0 seconds.

Note the page ID (in black bold above) tells us the bad page and the LSN (in blue bold above) reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn't be possible to tell from these diagnostics which page is corrupt.

However, all is not lost. This bug means that under these circumstances the default online behavior of CHECKDB can't run. The workaround is to use the WITH TABLOCK option of CHECKDB, which does offline checking and doesn't need the snapshot - but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

DBCC CHECKDB ('TestCheckdbBug') WITH TABLOCK, ALL_ERRORMSGS, NO_INFOMSGS;
GO

Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:143) could not be processed. See other errors for details.
Msg 8939, Level 16, State 6, Line 1
Table error: Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data), page (1:143). Test ((m_type >= DATA_PAGE && m_type <= UNDOFILE_HEADER_PAGE) || (m_type == UNKNOWN_PAGE && level == BASIC_HEADER)) failed. Values are 255 and 255.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'test' (object ID 2073058421).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestCheckdbBug'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestCheckdbBug).

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page's LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So - a scary little bug that has caused some people headaches, but I want to stress again - this can only happen if the database is ALREADY corrupt, and that it's quite rare. Hope this helps some of you picking this up from search engines in the future.

Thursday, May 22, 2008 2:20:58 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Monday, May 19, 2008

My first magazine article is in print! I've taken over the bi-monthly SQL Q&A column for TechNet Magazine and I just received the June magazine in the mail today with my first column in it. Topics covered are:

  • Creating corruption and using page checksums
  • The shrink-grow-shrink-grow trap
  • How many databases can be mirrored per instance
  • A tip on changing the default server port, from Jens Suessmeyer

I've also just completed a feature article for either the July or August issue dealing with database maintenance for the 'involuntary' DBA - more details when it gets published.

If you don't get the print version of TechNet Magazine, you can get to this month's SQL Q&A column at http://technet.microsoft.com/en-us/magazine/cc510328.aspx. There may not be anything new if you've been following my blog for a while, but if you've just started, it's worth a quick look.

Enjoy!

PS Let me know if you've got any good questions - I've already completed the August column but I'd like to hear of any questions you may have for later columns.

Monday, May 19, 2008 11:06:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Saturday, April 19, 2008

Many times I've been asked to do a blog post about creating Agent alerts, and given that today I demo'd it as part of our Accidental DBA workshop at Connections, it seemed a good time to do the blog post too!

I demo this in the context of alerting a DBA when an 823 or 824 IO error occurs. One of my early blog posts (see here) explains what these are, as well as providing a corrupt database that you can use to see these errors happening.

The idea is that I want to know as soon as an IO error occurs so I can start recovering and take preventative action to stop it happening again. I don't want to rely on users telling me when a query hits an IO error, and I don't want to have to scan the SQL error logs to find them. So I'm going to create an alert.

The first step is to fire up Management Studio and make sure SQL Server Agent is running. Next we need to make there's actually an Operator defined - so the new alert has someone to actually alert! - so we'll use the New Operator wizard (see below for how to get there).

In the New Operator Wizard that appears, I've created an operator named 'SysAdmin'. There are three Notification Options you can use - email, net send, and pager. I've setup SysAdmin to use net send to my local machine. You need to make sure the Messenger service is enabled otherwise net send will not work. Also, be aware the net sends will NOT work unless the machine has a network connection - even if the net send source and destination are the same machine! Given the various issues with net send, it's better to use email or pager alerts - but for the purposes of this blog post its the easiest option.

Now let's create the new alert - using the New Alert wizard (see the below for how to get there).

In the New Alert Wizard that appears, I've created an alert named 'IO Errors' for all severity 24 errors on all databases. Below is a portion of the General tab of the wizard showing these settings:

I also need to specify what happens. In the Response tab of the wizard I've set the SysAdmin operator to be notified using net send. Again, see below.

In the Options tab I've checked the box to include the error text in the net send.

Now let's test it. Using the database called 'broken' that I provide as an example (see here), I'll force an IO error to occur. In my query window I get:

SELECT * from broken..brokentable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 10 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

And a few seconds later I get the net send:

Pretty cool!

You can also use the WMI Provider to do this - see Creating a SQL Server Agent Alert by Using the WMI Provider for Server Events.

Saturday, April 19, 2008 7:05:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, March 21, 2008

This is a really interesting question that came up in the Microsoft Certified Architect class I'm teaching at present - if a database has torn-page protection enabled, and page checksums are enabled, is all the existing torn-page detection lost?

This is an important question, because enabling page checksums doesn't suddenly make all allocated pages be protected by page checksums (it's not until a page is read into the buffer pool, modified, and then written back to disk, that it gets a page checksum). If all the existing torn-page protection is discarded when page checksums are enabled, then the pages would be unprotected until they got page checksums on. I couldn't remember the answer, so I experimented!

My idea was to create a database with torn-page protection, create a table with a simulated torn-page in it, then enable page checksums and see if the torn-page was still reported.

-- Create the test database
USE master;
GO
CREATE DATABASE ChecksumTest;
GO
USE ChecksumTest;
GO

-- Explicitly set the database to have torn-page detection
ALTER DATABASE ChecksumTest SET PAGE_VERIFY TORN_PAGE_DETECTION;
GO

-- Create a test table and insert a row.
CREATE TABLE BrokenTable (c1 INT, c2 CHAR (1000));
INSERT INTO BrokenTable VALUES (1, 'a');
GO

-- Ensure the page is written to disk and then tossed from the buffer pool
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO

Now I'm going to examine the page. There are two bits in the page header that specify whether the page is protected by torn-page detection or with a page checksum. Specifically, the m_flagBits field will have 0x100 set if the page is encoded for torn-page protection, and 0x200 set if the page has a page-checksum stored on it, and the page has not been modified (i.e. the checksum is stillvalid). You should not see the 0x100 bit set as torn-page encoding is removed when the page is read into the buffer pool - UNLESS the page IS actually torn, in which case the encoding is NOT removed.

sp_allocationmetadata 'BrokenTable';
GO
DBCC TRACEON (3604);
GO
DBCC PAGE ('ChecksumTest', 1, 143, 3);
GO

<snip>

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 2                        m_freeCnt = 6070
m_freeData = 2118                    m_reservedCnt = 0                    m_lsn = (28:183:2)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 770
      

<snip>     

In this case the torn-page encoding has been removed, and the page is fine. Once I've corrupted the page on disk, it's tricky to be able to see it with DBCC PAGE. I managed to catch it once and saw the following:

m_pageId = (1:143)                   m_headerVersion = 1                  m_type = 1
m_typeFlagBits = 0x4                 m_level = 0                          m_flagBits = 0x8100
m_objId (AllocUnitId.idObj) = 67     m_indexId (AllocUnitId.idInd) = 256 
Metadata: AllocUnitId = 72057594042318848                                
Metadata: PartitionId = 72057594038321152                                 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 1008                       m_slotCnt = 1                        m_freeCnt = 7083
m_freeData = 1107                    m_reservedCnt = 0                    m_lsn = (28:81:20)
m_xactReserved = 0                   m_xdesId = (0:0)                     m_ghostRecCnt = 0
m_tornBits = 41949233

Now if I try to select from the table I get:         

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

The crux of the question is whether this will still be reported if the database switches to page checksums - let's try:

ALTER DATABASE checksumtest SET PAGE_VERIFY CHECKSUM;
GO

SELECT * FROM BrokenTable;
GO

Msg 824, Level 24, State 2, Line 1

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xaaaaa82a). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\ChecksumTest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Cool! The answer is YES - the torn-page is still detected, because the bit in the page header specifies which page protection algorithm the page is using. In fact, it even works if you turn off page checksums and torn-page detection completely.

Friday, March 21, 2008 3:23:07 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, March 13, 2008

This came up several times during the week so I thought it was about time to blog about it. One of the new features we put into SQL Server 2005 was storing the last time that DBCC CHECKDB completed successfully (called the last-known good time). What does successfully mean? This is the catch - if DBCC CHECKDB runs to completion then it considers that a successful run - EVEN if it found some corruptions. However, the last-known good time is ONLY updated if there were NO corruptions found. Slightly confusing I know.

Cool - but how can you see it? Well, the only time it's ever reported is when the database starts up. Not too useful if the database has been running for months. So how to see it??? The trick is to use DBCC PAGE. The last-known good time is stored in the boot page of the database - page 9. The following code will dump page 9 for you:

-- you need this to get the DBCC PAGE output to your console
DBCC TRACEON (3604);
GO

-- page 9 is the boot page
DBCC PAGE (dbname, 1, 9, 3);
GO

You need to look for the dbi_dbccLastKnownGood field. That was the last time that DBCC CHECKDB ran without finding any corruptions.

Now - what about if you're trusting your SQL Agent jobs to run DBCC CHECKDB for you every so often (how often is a whole other discussion...) and relying on the Agent job failing if DBCC CHECKDB finds corruptions. Well, if all you do is run the DBCC command, you're never going to know about corruption unless DBCC CHECKDB itself fails for some reason - remember, it returns successfully even it it found corruptions.The key is to add another statement after you run DBCC that checks the value of @@ERROR (the last error code reported by SQL Server). Although DBCC CHECKDB doesn't stop with an error when it finds corruption, it will set @@ERROR if there are any error messages in its output.

Edit: The stuff I've struck-through above is not true. I thought I'd remembered it from testing previously and I've heard plenty of anecdotal evidence from customers too BUT an Agent job WILL fail if a DBCC CHECKDB within it fails. I wen