CHECKDB From Every Angle: Complete description of all CHECKDB stages

On the Storage Engine blog last year I started two series that I got distracted from – one on DBCC CHECKDB and one on fragmentation. With the benefit of hindsight, I’m going to start 3 series on my new blog here – one on DBCC CHECKDB (‘CHECKDB from every angle’), one on indexes (‘Indexes from every angle’), and one on internals (‘Inside the Storage Engine’). The first few posts of each will be updated reposts of a few from the previous blog, just for completeness.

I realize that I promised to finalize the DBCC whitepaper this summer – well, that’s been delayed a little by the wedding and other stuff. Once I’ve done a few blog posts in this series, I’ll tie everything up into a PDF and add a link to it. This post is a good start – 6500 words. Phew.

First up is describing just what DBCC CHECKDB does. As with all things related to DBCC, this topic has its share of misinformation. In this post I’ll set the record straight by running through all the stages of DBCC CHECKDB in SQL Server 2000 and 2005. This was originally split into 5 posts but I’ll bring them all together today, add some more info, and make things a little clearer. There are a lot of terms in this post that may be new to some of you – I’ll cover them in the internals series over time.

1. Get a transactionally consistent view of the database

DBCC CHECKDB needs a consistent view of the database. Why? Well, usually its running on a live database with in-flight transactions. It needs to read and analyze the whole database but it can’t do it instantaneously so it has to take steps to ensure that what it reads is transactionally consistent.

Here’s an example. Consider a transaction to insert a record into a table that is a heap and has a non-clustered index, with a concurrent DBCC CHECKDB that doesn’t enforce a consistent view. The table record is inserted first, and then the non-clustered index record is inserted (that’s just the way the operations are split up in the database engine). Because this hypothetical DBCC CHECKDB doesn’t have a consistent view, it could read the record in the table but not that in the index, conclude that the non-clustered index is out of sync with the table and flag an 8951 missing-index-record error.

How could this happen? Depending on the order in which the pages are read by the DBCC CHECKDB process, the page on which the new non-clustered index record should go could be read before the page on which the new heap record should go. If the index page read happens before the insert transaction, and the table page read happens after the insert transaction, then we see the inconsistent state.

The easy way to get the consistent state is through locking, which is what SQL Server 7.0 did. You can still do that in SQL Server 2000 and 2005 using the WITH TABLOCK option. Another way to do it is to put the database into single-user or read-only mode. However, locking causes workload blocking and setting a databases into read-only or single-use mode is essentially taking it (and your application) offline.

For SQL Server 2000 we came up with a neat way to get the consistent view and be able to run DBCC CHECKDB online – post-read log analysis. In a nutshell, after we’ve read through all the database pages, we read the transaction log to make sure we didn’t miss any changes that occured while the full-database read was happening. This is more complex than it sounds – here’s what it does:

  • When DBCC CHECKDB starts, it turns on ‘replication-style’ logging – where everything is fully logged – and prevents log truncation. Even if you’re in full recovery mode and you’re taking log backups – the log will not truncate until DBCC CHECKDB releases it. This can be a problem in high-volume systems with limited log space.
  • Once DBCC CHECKDB has read through all the allocated pages in the database, it reads through all the active transaction log – from the LSN of the ‘BEGIN TRAN’ log record of the oldest transaction that is active at the time the database scan started, to the LSN at the time the database scan stops.
  • Log records from transactions that commit during that time are used to generate REDO facts. (A DBCC CHECKDB ‘fact’ is a piece of information about something in the database – e.g. page (1:345) is allocated to IAM chain 865398993 – I’ll explain how and why we use these later in the series.) A REDO fact either reinforces something we’ve already seen (in which case it is ignored) or provides information on something we haven’t seen. For example:
    • A page allocation log record would produce a REDO fact of ‘page X is allocated to IAM chain Y’
    • A row insertion record (such as from the index example above) would produce a REDO fact of ‘a row with these index keys was inserted into page A of table B, index C at slot position S’
  • Log records from transactions that rollback or don’t commit during that time are used to generate UNDO facts. An UNDO fact either cancels something that we’ve already seen (e.g. the first half of the index example above, if it didn’t commit while DBCC CHECKDB was doing the database scan) or reference something we haven’t seen (in which case we ignore it). For example:
    • Page allocation log record would produce an UNDO fact of ‘page X was deallocated from IAM chain Y’. So, if we’d seen the actual page after it was allocated, we would have generated a ‘page X is allocated to IAM chain Y’ fact, and so this UNDO fact would cancel out that piece of information.
    • A row insert record would produce an UNDO fact of ‘a row with these index keys was removed from page A of table B, index C at slot position S’

As you may have realized, what we’re essentially doing is our own log recovery, inside DBCC CHECKDB, but without actually affecting the database. This can get excruciatingly complicated (e.g. having to generate UNDO facts from the compensation log records that wrap sections of a cancelled index rebuild transaction…) I spent too many days of 2000 working out what was going on in the log and making tweaks to this code. However, it worked really well and we had online DBCC CHECKDB finally. The kudos for writing most of that stuff goes to Steve Lindell – while he was busy writing the online code I was up to my eyes writing DBCC INDEXDEFRAG (another story).

There are some problems with this mechanism however.

  • As I mentioned above, the log is switched to everything being logged and is prevented from being truncated until DBCC CHECKDB has read and processed the active portion.
  • The log-reading part of DBCC CHECKDB is single-threaded. On a large, heavily-loaded multi-proc system, this can mean that the log-reading portion of the DBCC CHECKDB process can take a very long time as the single CPU struggles to catch-up with all the other CPUs generating lots of transaction log.
  • There are a few, rare pathological cases where not absolutely everything I needed to work out what had happened was logged in the transaction log. This meant I essentially had to make an educated guess – and the guess isn’t always right. So – sometimes DBCC CHECKDB could give you false indication of corruption. This led to a certain level of paranoia about DBCC CHECKDB in SQL Server 2000 – people sometimes ran it twice if any errors were reported the first time round.

Back in late 2000, it became apparent that with all the new features we were planning for SQL Server 2005, including some changes to the transaction log to allow for fast recovery and deferred transactions and stuff like versioning and online index build, the transaction log analysis wasn’t going to work any more. While it had given us the holy-grail of online consistency checks, with all the added complications of SQL Server 2005 features, it would become unfeasible to maintain and get right.

For SQL Server 2005, DBCC CHECKDB uses an internal database snapshot to provide the required transactional consistency. The in-depth details of database snapshots are beyond the scope of this post. A few things to be aware of:

  • Database snapshots use NTFS sparse-file technology (so that a database snapshot does not occupy the same amount of disk space as the source database)
  • Database snapshots use copy-on-write technology. A page is only copied from the source database to the database snapshot when it changes in the source database – and it only needs to be copied once – the first time it’s changed. The only pages stored in a database snapshot are those that have changed since the snapshot was created.
  • Database recovery is run when the snapshot is created, but recovery is run into the snapshot, not the source database. This means any pages that need to be changed as part of recovery are read from the source database, altered by recovery, and then written into the snapshot.
  • Books Online has some more info about their use by DBCC – look up ‘DBCC Statements’ in the index, and go to the ‘DBCC Internal Database Snapshot Usage’ section.

By moving to database snapshots we changed to using mainline server code to get our transactionally consistent view, rather than our own version of the transaction log recovery code. This vastly reduced the complexity of the code and meant that someone else was responsible for finding and fixing its bugs during development.

There are a few slight gotchas (all documented) with this approach:

  • Sparse files are only available with NTFS so online checks can’t be run on databases stored on FAT or FAT32 volumes (in fact database snapshots cannot be created on these databases either)
  • Recovery cannot be run on tempdb, so online checks can’t be run on tempdb (DBCC CHECKDB automatically switches to locking in that case)

So when DBCC CHECKDB starts in SQL Server 2005, the first thing it does is work out whether it can run online – if so it creates a hidden database snapshot of the source database (i.e. DBCC CHECKDB‘s target database). As the database snapshot is hidden, there is no control over where the snapshot files are placed – in fact they are created as alternate streams of the files comprising the source database. That could cause a problem – depending on the transaction load concurrent with DBCC CHECKDB, the hidden database snapshot can grow in size. This means its possible for the disk to run out of space on high volume systems, which means the database snapshot (or database files) cannot grow – bringing your workload and DBCC CHECKDB to a halt. If this does become a problem, you can easily create your own database snapshot and run DBCC CHECKDB on that – it’s exactly the same as running DBCC CHECKDB normally and letting it create its own database snapshot.

[Edit 6/19/14: From SQL Server 2014 onward, alternate streams are not used, although the database snapshot is created in the same location as the existing database.]

Once the database snapshot is created, DBCC CHECKDB is guaranteed a transactionally consistent view of the database and can run the various check algorithms against the database snapshot. Although by creating a database snapshot DBCC CHECKDB is checking the database as it was at some point in the past, that point is the start time of the DBCC CHECKDB, just as it was with the log analysis mechanism in SQL Server 2000 – so there’s no real difference.

2. Primitive checks of critical system tables

This stage is in SQL Server 2000 and 2005. First of all, what are critical system tables? These are the system tables that hold Storage Engine metadata. Without these DBCC CHECKDB have no idea where any data was stored in the database files or how to interpret records.

In SQL Server 2000, the critical system tables are:

  • sysindexes
  • sysobjects
  • syscolumns

These tables have to be checked first because DBCC CHECKDB uses the metadata they contain to access all the other tables and indexes in the database. These tables are freely queryable so poke about and see what’s stored in there. In SQL Server 2005 these table names still exist, but they’re actually views over new system tables, maintained for backwards compatibility.

In SQL Server 2005, the metadata layer has been rewritten and the critical system tables are:

  • sys.sysallocunits
  • sys.syshobts
  • sys.syshobtcolumns
  • sys.sysrowsets
  • sys.sysrowsetcolumns

In SQL Server 2008 R2 and higher, these are now just:

  • sys.sysallocunits
  • sys.sysrowsets
  • sys.sysrscols

More on allocation units, hobts, and rowsets in the internals series – for now you can assume they serve the same function as the three critical system tables in SQL Server 2000. You can’t see these new system tables because they’re ‘hidden’ – the parser won’t allow them to be bound to in a query (except in some DBCC commands like CHECKTABLE). Try running the following to see what I mean:

SELECT * FROM sys.sysallocunits;

The primitive checks are designed to check that internal queries on the metadata tables won’t throw errors. Each of the critical system tables has a clustered index. The primitive checks verify that the leaf-level data pages of the clustered indexes are valid. For every one of these pages, the following is done:

  • Read and latch the page (a latch is a lightweight internal version of a lock).  This makes sure that there aren’t any IO problems with the page such as a torn-page or bad page checksum and ensures that the page can be read into the buffer pool correctly. This is the most common cause of failure of the primitive system table checks and results in error 8966, which in SQL Server 2000 could look something like:
Server: Msg 8966, Level 16, State 1, Line 1
Could not read and latch page (1:33245) with latch type SH. sysobjects failed.
  • Audit the page. This is a series of checks of the page structures which I’ll cover in a seperate post. If these pass, the page looks like a SQL Server page of the type its supposed to be.
  • Check the basic page linkage. Pages in each level of a clustered index are linked together in a doubly-linked list to allow range scans to work. At this stage we only check the left-to-right linkage to ensure the next page in the logical ordering actually exists.
  • Check the page linkage for loops. This is simple to do – have two pointers into the page linked-list with one advancing at every step and one advancing at every second step. If they ever point to the same thing before the faster-advancing pointer reaches the right-hand side of the leaf level then there’s a loop. Its important that there are no linkage loops otherwise a range scan may turn into an infinite loop. I’ve never seen this occur in the field.

Any error found at this stage cannot be repaired so you must restore from a backup. This is because the repair would have to deallocate the page, effectively deleting metadata for a bunch of tables and indexes. As databases get larger and more complex (thousands of tables and indexes), the percentage of pages that comprise these critical system tables rises and so the chance of a hardware problem corrupting one of these pages also rises – I see several of these a month on the forums. Without a backup, the only alternative is to try to export as much data as you can – not good.

If all the critical system table clustered index leaf-level pages are ok then DBCC CHECKDB knows it has solid enough metadata on which to base the next set of checks.

3. Allocation checks

This stage is in SQL Server 2000 and 2005. These checks verify the various structures (IAM pages, IAM chains/allocation units, GAM/SGAM pages, PFS pages) that track the status of pages and extents that have been allocated within a database. I’ll describe how we go about collecting information from the various pages and then describe what some of the actual checks are.

The allocation checks are very fast (orders of magnitude faster than the logical checks that follow in the next stage) because the number of database pages that have to be read is very small.  The algorithm for gathering allocation data is as follows:

  • For each file in each online filegroup in the database (except transaction log files):
    • Read all PFS pages (this provides a bitmap showing all IAM pages, plus another one showing all mixed pages).
    • Read the GAM pages (this provides bitmaps of all allocated extents).
    • Read the SGAM pages (this provides bitmaps of all mixed extents with at least one unallocated page).
    • Read the DIFF_MAP pages (a ‘differential bitmap’ page shows which extents in the GAM interval have been modified since the last full or differential backup – a differential backup only needs to backup those extents marked modified in the various DIFF_MAP pages). This is just to make sure the pages can be read.
    • Read the ML_MAP pages (a ‘minimally-logged bitmap’ page shows which extents in the GAM interval have been modified in bulk-logged recovery mode since the last log backup – a log backup must also backup all such extents to ensure that all changes to the database have been backed up. This can make the log backup quite large (although the log itself stays much smaller) – but that’s a topic for another blog post. Again, this is just to make sure the pages can be read.
    • Read all IAM pages. This provides:
      • A list of all the mixed pages in the file, and by derivation, a list of all mixed extents in the file (remember that the first IAM page in an IAM chain/allocation unit contains an array to hold up to 8 mixed pages for the IAM chain/allocation unit it represents
      • A list of all the valid IAM pages in the file
      • A list of all the allocated dedicated extents in the file
      • Linkage information for IAM chains
  • After all the per-file stuff, read the Storage Engine metadata. This provides:
    • Information about the root of each IAM chain (the first IAM page in the IAM chain). In SQL Server 2000, this is stored in sysindexes. In SQL Server 2005, this is stored in the sys.sysallocunits hidden system table.
    • Information about IAM chains currently waiting to be ‘deferred-dropped’. Deferred-drop is a SQL Server 2005 optimization that prevents a transaction from running out of lock memory while dropping an IAM chain. It is the process by which an IAM chain with > 128 extents that is dropped – by dropping/rebuilding an index or dropping/truncating a table – does not have its actual pages and extents deallocated until after the transaction has committed. The IAM chain is unhooked from sys.sysallocunits though and hooked into an internal queue – if DBCC CHECKDB didn’t scan that queue too as part of the allocation checks, it might see all kinds of inconsistencies with the various allocation bitmaps.

So, now DBCC CHECKDB has a whole bunch of allocation data that its collected and it needs to make sense of it all to ensure the allocation structures are correct. Here’s a non-exhaustive list of checks that are done with this data:

  • Check that each extent is either allocated to:
    • the GAM page for the GAM interval, or
    • the SGAM page for the GAM interval, as a non-full mixed extent, or
    • exactly one IAM page that covers the GAM interval, or
    • to none of the bitmap pages, but all pages in that extent must be allocated to IAM pages as mixed pages
      • This could result in an 8903 (GAM and SGAM), 8904 (multiple IAMs), or 8905 (no page) errors depending on the combination of bitmaps that have the extent allocated
  • Check that all pages marked as being IAM pages in PFS pages really are IAM pages when they’re read
  • Check that all pages marked as being mixed pages in PFS pages appear somewhere in a mixed page array on an IAM page
  • Check that each mixed page is only allocated in a single IAM page
  • Check that the IAM pages in an IAM chain have monatonically increasing sequence numbers
  • Check that the first IAM page in an IAM chain has a reference from a row in sys.sysallocunits (or sysindexes if on SQL Server 2000)
  • Check that no two IAM pages within the same IAM chain map the same GAM interval
  • Check that all IAM pages within an IAM chain belong to the same object/index/partition
  • Check that the linkages within an IAM chain are correct (no missing pages for instance)
  • Check that all IAM/GAM/SGAM pages that map the final GAM interval in a file do not have extents marked allocated that are beyond the physical end of the file

Any errors found here will require REPAIR_ALLOW_DATA_LOSS to repair and some of the repairs are very complicated (e.g. multiply-allocated extents) – topic for a future blog post.

So, the allocation checks lay the next foundation level over the system table primitive checks and then DBCC CHECKDB is ready to move on to the logical checks.

4. Logical checks

This section really has two parts – peform all the logical checks on the critical system tables, and then perform all the logical checks on all the other tables in the database.

If any errors are found in the critical system tables and repair is not specified, or repair is specified, but not all the errors can be repaired, then the DBCC CHECKDB finishes. An example of an unrepairable system table error is something that would require deleting data from one of the system tables – e.g. a corrupt key value in a clustered index data page of sys.sysallocunits (remember that this is the actual hidden table I’m talking about, not the sys.allocation_units catalog view you may have seen or used).

So if the critical system tables are clean, all the tables in the database are checked. This includes indexed views and primary XML indexes (which are both stored as clustered indexes – and as far as the Storage Engine is concerned are objects in their own right – its the Relational Engine that knows that they’re not really separate objects).

The following checks are performed:

  1. Validate each table’s storage engine metadata
  2. Read and check all data, index and text pages, depending on the page type
  3. Check all inter-page relationships
  4. Check the page header counts in each page
  5. Perform any necessary repairs (if a repair level was specified)

Let’s look at the first four stages in more detail – there’s too much to go into about repairs in this post.

4.1 Validate each table’s storage engine metadata

The Storage Engine metadata is what tells DBCC CHECKDB how to crack open records on pages in particular rowsets – so if there’s something wrong with it then DBCC CHECKDB will generate a bunch of misleading errors or, worse yet, miss some errors. Here’s roughly what happens while the metadata is parsed for a single table (for SQL Server 2005):

  • Loop through all the indexes, and then all the rowsets of each index, building a list of known allocation units for the index (including all the DATA, LOB, and SLOB allocation units). This allows reverse lookups when DBCC CHECKDB reads a page, because the information stamped on the page is the allocation unit, it needs a very fast way to convert between an allocation unit and an index/object ID.
  • Make sure DBCC CHECKDB skips any indexes that are in the middle of being built/rebuilt online, because they will not have a complete set of data.
  • Build a list of computed columns and generate the necessary code to enable the column values to be recomputed.
  • Build a list of columns that are used in non-clustered indexes
  • Make sure the various USED, DATA, RSVD counts are not negative (see the BOL for DBCC DBCC CHECKDB for an explanation of how this can happen in versions prior to SQL Server 2005)
  • Figure out what each kind of column is (e.g. a regular column, a generated uniquifier, a dropped column)
  • Build a series of mappings to allow conversion between column IDs at different levels of Storage Engine abstraction
  • Check that the Relational Engine and Storage Eengine nullability flags for a column agree
  • Make sure the columns counters in metadata match what has just been seen

4.2 Read and check all data, index and text pages

No matter what type a page is, the page is audited and then all the records on it are audited.

Page audit first checks for IO errors when the page is read (e.g. page checksums). If there are some, then the page is not processed any further. This is what can lead to errors like 8976 being reported, for example:

Table error: Object ID 132765433, index ID 1, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type DATA). Page (1:3874) was not seen in the scan although its parent (1:3999) and previous (1:3873) refer to it. Check any previous errors.

Then it checks for page header correctness and that the page has an appropriate type for the allocation unit its in (e.g. a DATA page should not be found in an allocation unit for a non-clustered index)

Record audits include checking the various fields in the record header and that the various offsets in the record make sense (e.g. the offset to the variable length columns section of the record should not point off the end of the record. I’ll do a detailed post on record formats and cracking at the start of the internals series.

The more complex checks that are done per-page depend on what type the page is. As an example, here’s what is done for a DATA page in the leaf level of a clustered index (excluding the inter-page relationships – I’ll list those in the next section):

  • Records in the page must be strictly ordered by the defined keys of the index (although the records themselves aren’t necessarily stored in sorted order in the data portion of the page, accessing the records through the slot array must yield them in the correct order)
  • No two records can have duplicate key values (remember that non-unique indexes have a hidden, automatically-generated uniquifier column added to the key – to make or extend the composite key – so that record uniqueness is guaranteed)
  • If the index is partitioned, each record is run through the user-defined partitioning function to ensure its stored in the correct partition
  • All the complex columns in each record are checked:
    • Complex columns are those storing legacy text or LOB values (text, ntext,image, XML, nvarchar(max), varchar(max), varbinary(max)) or in-row pointers to variable length columns that have been pushed off-row in rows that are longer than 8060 bytes
    • The column is checked to make sure it’s storing the right kind of data – either the value itself or a text pointer or some kind of in-row root containing pointers to portions of an off-row value.
    • The linkages between what is stored in-row and the off-row values stored in other pages are eventually checked too
  • Check computed columns:
    • If the column is persisted (either because it’s defined as a persisted computed column or because its used as a non-clustered index key), its value is recomputed and checked against the persisted value
    • This is also important when we come to do the non-clustered index cross-checks (see below) – as any discrepancy in the stored column values will cause mismatches
  • Data purity checks
    • The column value is checked to ensure its within the bounds for its data-type (e.g. the minutes-past-midnight portion of the internal representation of a datetime value cannot be greater than 1440 – 24 hours x 60 minutes)
  • Non-clustered index cross-checks
      • This is probably my favorite part of DBCC CHECKDB and is one of the most complicated bits of code – I’ve rewritten it twice (for 2000 to make the locking work in the final deep-dive part of the algorithm and for 2005 to remove the locking and cope with all the new features that can affect indexes). In fact, it’s been rewritten again for SQL Server 2008 to change the record matching algorithm – more details latrer.
      • What DBCC CHECKDB is trying to do is make sure that each record in a heap or clustered index has exactly one matching record in each non-clustered index, and vice-versa. The brute-force (n2 complexity) way to do this is to do a physical lookup of all the matching rows whenever one row in the relationship is read – but that’s incredibly time consuming so instead there’s a fast algorithm to detect problems.
      • Imagine a table defined by the following DDL and with a single row:
    CREATE TABLE t (c1 INT, c2 char(10), c3 varchar(max))
    CREATE INDEX index1 ON t (c1)
    CREATE INDEX index2 ON t (c2) INCLUDE(c3)
    • Each row in the heap has to have two matching non-clustered index rows, one in each of index1 and index2. But how can DBCC CHECKDB tell without doing the direct lookup or having to store tremendous amounts of state? It uses a hash-bitmap algorithm.
      • Imagine a large bitmap – say half a million bits. Initially all the bits are 0.
      • DBCC CHECKDB happens to read the heap page first, with one record in it. It knows exactly what the two non-clustered index records should look like, so it generates them – exact byte-for-byte matches of what the real non-clustered index records should be. Then, for each generated non-clustered index record it computes a hash value of the record, maps the hash value to a bit in the bitmap and flips the bit. So, we have now have two bits set to 1 in the bitmap.
      • DBCC CHECKDB then reads the two non-clustered index pages, each with a single record on. For each record, just hash the record and flip the corresponding bit in the bitmap.
      • The idea is that the bit-flips should cancel each other out (as the real and generated non-clustered index records should be exactly the same and hash to the same value) and the bitmap should be left with all zeroes at the end of the checks.
    • Taking the view that corruptions are magnitudes rarer than clean databases, DBCC CHECKDB went a step further and allowed the checks for multiple tables to use the same bitmap. If you think about it, this won’t cause any problems, even if two sets of records map to the same bit in the bitmap – as long as the number of bit-flips is a power of 2 (i.e. each record really does have its correct matching record) then there should be no problem.
    • Here’s the catch – what happens when there’s a bit left on in the bitmap? Well, this is where the trade-off comes into play. If there’s a bit left on, DBCC CHECKDB can’t tell which records in which table or index mapped to it so it has to re-scan the tables and indexes that used the bitmap to see which records map to the bit. For every one it find, it actually does the physical lookup of the matching record and then does a comparison of all the columns, including any LOB columns used as INCLUDE‘d columns in non-clustered indexes. This process is called the deep-dive and can add a significant amount to the run-time of DBCC CHECKDB if it occurs.
    • Prior to SQL Server 2005 SP2 there was no way to tell whether DBCC CHECKDB was going to do a potentially long-running deep-dive. From 2005 SP2, error 5268 was added and will be printed to the SQL Server error log when this happens

4.3 Check all inter-page relationships

Inter-page relationships are relevant for:

  • Pages in heaps that have forwarding or forwarded records
  • Pages in indexes
  • Pages that have records with LOB columns that have their data stored off-row (these can be heap pages, clustered index data pages or non-clustered index leaf pages in SQL Server 2005)
  • Text pages that have records with child nodes on other pages

Here are the checks that are done for index pages, for example:

  • All pages in an index level should be pointed to by a page in the next level higher in the b-tree, and also by the left-hand and right-hand neighboring pages in the same level of the b-tree. Exceptions are made for the left-most and right-most pages in a level (where the m_prevPage and m_nextPage fields are (0:0)) and for the root page of the b-tree, where the parent page link comes from the storage-engine metadata. The 8976 error message that I referenced above is generated from this set of checks.
  • Key ranges in neighboring pages in a level should not overlap
  • Key ranges of pages should be correctly defined by the parent pages in the next level up in the b-tree (the parent pages contain the minimum key value that can exist on a child page)

I’ll go into details of how the inter-page checks are done in a future post. For now, its enough to say that its not an n2 complexity algorithm.

4.4 Check the page header counts in each page

The page header contains a bunch of counters – the ones that need to be checked are:

  • slot count
  • ghost record count
  • free space count

The first two are obvious – count the rows as they’re processed and make sure page header counts are valid. The free space count is only checked for text pages and data pages in a heap (the only pages for which free space is tracked in a PFS page).

So, that’s the majority of the work in running a DBCC CHECKDB.

5. Higher-level logical checks

The next set of checks are higher-level checks involving multiple structures and are only present in SQL Server 2005:

  1. Service Broker checks
  2. Metadata cross-checks
  3. Indexed view and XML index checks

Its important that these checks are done after the per-table logical checks. This is because any repairs done as part of the per-table checks may affect the outcome of these checks quite substantially. Imagine the case where an indexed view is based on a join between two tables, foo and bar. Table foo has a damaged page and is damaged in a way that reading the page as part of a query would not recognize – because full page audits are not done as part of regular page reads. The damage is such that the page has to be deleted as part of repair – thereby changing the results of the view. If the indexed view was checked tables foo and bar, then the repair that was done would not get reflected in the indexed view and so the indexed view would essentially be corrupt. The same logic holds for checking XML indexes and Service Broker tables.

Let’s look at these final checks in more detail.

5.1 Service Broker checks

The Service Broker development team wrote a comprehensive set of checks of the data stored in their internal on-disk structures. The checks validate the relationships between conversations, endpoints, messages and queues. For example:

  • A conversation must have two endpoints
  • A service must be related to a valid contract
  • A service must be related to a valid queue
  • A message must have a valid message type

What’s even cooler is that they also implemented a set of logical repairs, so if one of their internal tables was damaged, and repaired by the earlier logical checks, then the Service Broker repair code can clean up any Service Broker entities and entity relationships that were damaged too.

5.2 Metadata cross-checks

The Metadata team also wrote a great set of checks for the relational metadata stored in the system tables. These checks that are run here are the same code that’s run for DBCC CHECKCATALOG. The actual checks themselves in SQL Server 2005 are far more involved than in SQL Server 2000, and they’re done way more efficiently too. However, they’re not comprehensive by any means – some more checks were added during SQL Server 2008.

The checks only cover the Relational Engine metadata – i.e. the relationships between system tables storing relational metadata. There are no such checks for the tables storing the Storage Engine metadata – at present the relationships between those tables are checked implicitly by the metadata checks I described above.

There are no repairs for metadata corruptions. Metadata corruptions are extremely difficult to deal with because changing/deleting metadata affects the entire table the corrupt metadata describes and could potentially be as bad as deleting the table – a table without metadata is just a collection of pages with indecipherable rows (well, not quite true – it’s possible to decipher any record without metadata but it requires human intervention and is incredibly hard).

It’s possible the team may put in some limited metadata repairs in a future release, but the frequency of their occurrence in the field is so low that I decided that the engineering investment was not justified for SQL Server 2005 or SQL Server 2005 and so didn’t push it. So – if you get any metadata corruption, you need to restore from your backups.

5.3 Indexed view and XML index checks

These are very cool and many thanks to Conor Cunningham (former Dev Lead of one of the two Query Optimizer teams) for helping work out how to do this.The indexed view contains the persisted results of a query, and the actual query is persisted in metadata – so the easiest way to check whether the indexed view is accurate is to recalculate the view results into a temp table in tempdb and then compare the calculated values with the values persisted in the indexed view. The view is regenerated into a temporary table and then two concurrent left-anti-semi-joins are run, that basically return all the rows in the indexed view that are not in the recalculated view results, and vice-versa. This gives DBCC CHECKDB all the rows that are extraneous in the indexed view and all the rows that are missing from it.

Indexed-view problems can also be repaired. The repair for extra rows is to delete them one by one (using internal query syntax that only works from DBCC), and the repair for missing rows is to rebuild the indexed view. This is done by simply disabling the indexed view and then bringing it back online (which rebuilds it).

There are two drawbacks to the indexed view checks if the views are large – it can take up a lot of space in tempdb and it can take a lot of time to run the regeneration of the indexed views and to run the left-anti-semi-joins. So if you upgraded your database from SQL Server 2000 and you regularly run full DBCC CHECKDBs (i.e. without using WITH PHYSICAL_ONLY), then you may see a run-time increase for DBCC CHECKDB on SQL Server 2005 – this is documented in BOL and the README.

XML index checks work in a similar way. An XML index is an index over all the nodes in a shredded version of the XML blob. The XML blobs in the table are re-shredded and checked against the shredding that’s persisted in the primary XML index. If anything is wrong, the primary XML index is recreated.


And that’s it. Now you have a complete picture of what’s going on with DBCC CHECKDB when it runs. In the next few posts I want to give some insight into how DBCC CHECKDB manages to do all these checks while only making a single pass through the database and shed some light on how repair works.

23 thoughts on “CHECKDB From Every Angle: Complete description of all CHECKDB stages

  1. What happens if the index cross-check produces a hash collision? That might leave an inconsistency undetected (in case that (hash(correctRecord) == hash(sameRecordWithCorruption)). Is this acceptable?

    1. It could only happen before 2008 and only then if two corrupt records produced a hash collision. Very rare, but still possible. From 2008 onwards, it’s incredibly unlikely given the way the hash algorithm changed.

      Acceptable? No choice – that’s the way it was written.

  2. Date 1/31/2015 7:30:01 AM
    Log Job History (xyz- Weekly Maintenance (Full) – (Sat-0730 hrs) – (14 Day Cleanup))

    Step ID 1
    Server xyz
    Job Name xyz – Weekly Maintenance (Full) – (Sat-0730 hrs) – (14 Day Cleanup)
    Step Name Check Database Integrity
    Duration 05:25:43
    Sql Severity 0
    Sql Message ID 0
    Operator Emailed
    Operator Net sent
    Operator Paged
    Retries Attempted 0

    Msg 5269, Level 16, State 1, Server xyz, Line 1
    Check terminated. The transient database snapshot for database ‘xyz’ (database ID x) has been marked suspect due to an IO operation failure. Refer to the SQL Server error log for details.
    Sqlcmd: Error: Microsoft SQL Server Native Client 10.0 : Unspecified error.

    What does this mean? Since this is the first step in the maintenance job nothing else runs. The user is concerned about performance degradation.

    1. It means your I/O subsystem caused corruption in the database snapshot that DBCC CHECKDB created. This is 100% a corruption problem with your I/O subsystem.

  3. Hi Paul, when CHECKDB uses a snapshot is it actually a “correct” view of the source at the time it “finishes” the creation of the snapshot? When I say “finishes” I am thinking that it means recovery in the snapshot has been fully completed? thanks.

    1. Hey Paul I found the answer in an old whitepaper – data is consistent at the start of analysis phase of snapshot recovery. Sorry for wasting your time. cheers.

  4. HI Paul,

    Kindly want to know Is it possible checkdb lead to sql shutdown and auditing on going and lot of disk space free.

    Thank you

  5. Hi Paul
    I have a question and I’ve been searching online but not able to find any good resource, but I’m sure if anyone know this for sure, it would you..

    I have a database that’s not that big (55GB) and it’s a vendor DB with over 1500 tables. tempdb has 35 GB (and space is limited on this server so I can’t give it more). checkdb fails with tempdb getting full after checkdb completes 80% or so based on percent_complete in dm_exec_requests.
    so I’ve been running checkdb for this database using physical_only. I was wondering if it’s possible to run a SET of commands that would be equal to running CHECKDB, but broken into steps, so perhaps TempDB won’t get filled up. I could loop through all tables and run DBCC CHECKTABLE. what else should I run to compensate for not being able to run a full checkdb.

    1. Yes you can, but it’s likely there’s one large table which is what’s causing the problem. Your best bet is to add more space, or backup-copy-restore-checkdb on another system with more space.

  6. Paul – Do you know of any way to change the directory location where the snapshot files are created? We tried changing the default data directory, but the snapshot files were still created in the same directory as the associated database files. Thanks in advance for your time.

    1. Not when running DBCC CHECKDB, but you can just create your own database snapshot and run DBCC CHECKDB on that instead of the real database – it’s the same thing really.

  7. Hi Paul, I was messing around with restore testing and I found the issue where if you restore master as a user DB, CheckDB won’t come back “clean”. (Errors like: Check Catalog Msg 3851….An invalid row…was found in system table…etc. 45 consistency errors by the end…SQL 2016 SP2) I guess because master has some uniqueness to it that is ignored when CheckDB knows that it’s checking the master db.

    Is there any way to tell CheckDB that it’s checking a restored copy of master so it won’t throw the consistency check errors ?

  8. If a database file’s growth is stopped (not restricted), does CHECKDB still generates a snapshot file for it? I am guessing, it does, but do you know how big a snapshot file could be? (for instance, a snapshot file size could be 10% of the actual data file size, etc.).

    Also, I would like to know how CHECKDB treats a database running with SQL Server 2000 compatibility mode on SQL Server 2008 R2. Will it use the SQL Server 2000 strategy or generates a snapshot file for a consistent view?

    1. Database size is irrelevant to that – it will always generate a snapshot unless the database is: tempdb, read-only, single-user, a snapshot. The snapshot file size is related to how much uncommitted transaction activity there is. Compatibility mode only affects query operator behavior – nothing to do with the Storage Engine.

  9. Is the size of the snapshot, as shown in Windows Explorer, the true size?

    The snapshots I have seen look to be the same size as the actual database.

    1. In the Explorer details for the file, look at the Size on disk – that shows the true size. The file will list itself as the same size as the database, but use less space actually on disk.

  10. Great post Paul.

    I’ve got a question, my company has a 7 TB database and the CHECKDB is taking a long time to complete. What’s the risk (i.e. undetected corruption) if I only run PHYSICAL_ONLY on the primary, and full CHECKDB on secondary?
    My thought as below, and not sure about the drawback with this method
    – If data pages on primary SQL are good, and secondary supports the logical integrity (which should confirm primary because all system table and its meta data are verified)

    Thank you.

    1. The risk is that you miss corruption from a bug or bad memory that causes a corrupt page to be written to disk with a good checksum. Physical-only checks won’t find bad data values from that. Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.