DBCC CHECKDB performance and computed-column indexes

[Edit 2016: The team ‘fixed’ the problem in SQL Server 2016 by skipping consistency checking these indexes unless WITH EXTENDED_LOGICAL_CHECKS is used.]

It’s no secret that DBCC CHECKDB has some performance quirks based on the schema of the database being checked and various kinds of corruptions. I was recently doing some scalability testing of DBCC CHECKDB for a blog post and discovered quite a nasty performance issue that exists in all versions of SQL Server back to SQL Server 2005. This isn’t a bug, it’s just the way things work.

The issue occurs when a nonclustered index exists that has a computed column as part of the index key or as one of the INCLUDEd columns and affects DBCC CHECKDB, DBCC CHECKFILEGROUP, and DBCC CHECKTABLE.

As a bit of background, here’s an excerpt from the SQL Server 2012 Internals book where I describe one of the ways nonclustered indexes are checked for consistency:

The nonclustered index cross-checks verify that

  • Every record in a nonclustered index (whether filtered or nonfiltered) must map to a valid record in the base table (that is, the heap or clustered index).
  • Every record in the base table must map to exactly one record in each nonfiltered, nonclustered index, and one record in each filtered index, where the filter allows.

The mechanism to carry out these checks efficiently has changed in every release since SQL Server 7.0—becoming progressively more and more efficient. In SQL Server 2012, two hash tables are created for each partition of each nonclustered index—one hash table is for the actual records in that partition of the nonclustered index, and the other is for the records that should exist in that partition of the nonclustered index (as calculated from the existing data records in the table).

When a nonclustered index record is processed, all columns in the record are hashed together into a BIGINT value. This includes

  • The physical or logical link back to the base table (known as the base table RID)
  • All included columns—even LOB and FILESTREAM values) are hashed together into a BIGINT value

The resulting value is added to the master hash value for actual records for the nonclustered index partition of which the record is part.

DBCC CHECKDB knows which nonclustered indexes exist for the table and what the complete nonclustered index record composition should be for each. When a data record is processed, the following algorithm is run for each matching nonclustered index record that should exist for the data record (taking into account any filter predicates for filtered nonclustered indexes):

  1. Create the nonclustered index record in memory (again, including the base table RID, plus included columns).
  2. Hash all columns in the index record together into a BIGINT value.
  3. Add the resulting value to the “should exist” master hash value for the relevant nonclustered index partition of which the index record is part.

The premise that this algorithm works on is that if no corruptions exist, the master hash values for the actual records and “should exist” records for each nonclustered index partition should match exactly at the end of the DBCC CHECKDB batch.

When a nonclustered index uses a computed column, the value of the computed column has to be computed based on the column definition. To do that, an internal mechanism called an ‘expression evaluator’ is created. The expression evaluator is provided by the Query Processor code and its behavior is entirely outside the control of DBCC CHECKDB. The drawback of the expression evaluator is that every time it is used, an exclusive latch must be held by the thread using it. This creates an incredible bottleneck and drastically affects performance.

My test system is the Dell R720 we bought last year, with 64GB of memory and 2 E5-2670 CPUs with 8 physical cores and hyperthreading enabled. It’s running SQL Server 2012 SP1 CU3.

The test database is AdventureWorks that Jonathan expanded to 500GB for me using his cool scripts. The database is split over 8 data files stored on two 320GB Fusion-io drives, with tempdb and its log placed on two more 320GB Fusion-io drives. I set things up this way to remove I/O waits from the test.

The first few runs of my testing, with unlimited parallelism produced DBCC CHECKDB elapsed times of around 340 minutes. This seemed incredibly slow to me, so I ran another test and looked at the output of sys.dm_os_waiting_tasks. I discovered that half the threads at any time were waiting for exclusive access to the DBCC_OBJECT_METADATA latch. Adding in some diagnostics to capture waits and latches gave the data below:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------- ---------- ----------- --------- ---------- ----------- ---------- --------- ---------
CXPACKET        684482.45  682667.13   1815.32   60294236   34.90       0.0114     0.0113    0.0000
OLEDB           659325.08  659325.08   0.00      207661462  33.62       0.0032     0.0032    0.0000
LATCH_EX        615168.39  605357.28   9811.11   798224634  31.37       0.0008     0.0008    0.0000

LatchClass               Wait_S     WaitCount  Percentage  AvgWait_S
------------------------ ---------- ---------- ----------- ----------
DBCC_OBJECT_METADATA     611768.00  764845636  99.16       0.0008

This didn’t make any sense to me as I couldn’t believe that such an incredible bottleneck existed – almost 1ms per latch wait and 800 million latch waits! – surely we’d have heard about it by now? I pinged my friend Ryan Stonecipher who inherited the DBCC CHECKDB code from me and he reminded me about the expression evaluator behavior.

I went back to my test system and looked for nonclustered indexes using computed columns using the code below:

SELECT
    [s].[name],
    [o].[name],
    [i].[name],
    [co].[name],
    [ic].*
FROM sys.columns [co]
JOIN sys.index_columns [ic]
    ON [ic].[object_id] = [co].[object_id]
    AND [ic].[column_id] = [co].[column_id]
JOIN sys.indexes [i]
    ON [i].[object_id] = [ic].[object_id]
    AND [i].[index_id] = [ic].[index_id]
JOIN sys.objects [o]
    ON [i].[object_id] = [o].[object_id]
JOIN sys.schemas [s]
    ON [o].[schema_id] = [s].[schema_id]
WHERE [co].[is_computed] = 1;
GO

And I found 6 nonclustered indexes with computed columns in, on some of the largest tables in the database. I then disabled those nonclustered indexes and re-ran the DBCC CHECKDB tests.

17-18 minutes per run.

Wow! The bottleneck from the expression evaluator was causing DBCC CHECKDB to run about 20 times slower.

The wait and latch details for the test runs with the indexes disabled are as below, which is what I expected to see from running DBCC CHECKDB:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------- ---------- ----------- --------- ---------- ----------- ---------- --------- ---------
CXPACKET        33064.56   29331.74    3732.83   42034533   48.14       0.0008     0.0007    0.0001
OLEDB           28883.78   28883.78    0.00      173940154  42.05       0.0002     0.0002    0.0000
LATCH_EX        5021.20    4605.50     415.70    30340659   7.31        0.0002     0.0002    0.0000

LatchClass               Wait_S     WaitCount  Percentage  AvgWait_S
------------------------ ---------- ---------- ----------- ----------
DBCC_CHECK_AGGREGATE     5039.36    30267451   98.82       0.0002

There’s nothing we can do about this bottleneck except to disable nonclustered indexes on computed columns while DBCC CHECKDB is running and then rebuild them afterwards, which is not a very palatable solution. I wanted to blog about this just to make the issue known and help any of you who are beating your head against a wall trying to figure out poor DBCC CHECKDB performance.

[Edit 12/2/2014: Here’s another set of performance tests using a kCura database schema of various sizes.]

15 thoughts on “DBCC CHECKDB performance and computed-column indexes

  1. One could well consider this a performance bug. The fix would be to have one evaluator per thread or at least a pool of them.

    I hope such bottlenecks do not exist in the normal query pipeline…

  2. Seems to be just the filtered index that is the culprit on my server.
    Simple Repro :

    USE [master]
    SET NOCOUNT ON
    DROP DATABASE [Playground]
    GO
    CREATE DATABASE [Playground];
    GO

    USE [Playground];
    GO

    CREATE TABLE [dbo].[FilteredIndexTest]
    (
    IsProcessed BIT DEFAULT ( 0 )
    )
    GO

    INSERT INTO [dbo].[FilteredIndexTest]
    ( IsProcessed )
    VALUES
    ( 0 );
    GO 10000

    INSERT INTO [dbo].[FilteredIndexTest]
    (
    IsProcessed
    )
    SELECT
    [FilteredIndexTest].[IsProcessed]
    FROM
    [dbo].[FilteredIndexTest]
    GO 7

    DECLARE @StartDateTime DATETIME
    DECLARE @EndDateTime DATETIME

    SELECT
    @StartDateTime = GETDATE()
    DBCC CHECKDB('Playground') WITH ALL_ERRORMSGS,NO_INFOMSGS;
    SELECT
    @EndDateTime = GETDATE()

    SELECT
    DATEDIFF(ms, @StartDateTime, @EndDateTime) TimeTaken

    CREATE NONCLUSTERED INDEX [IXF_FilteredIndexTest_IsProcessed]
    ON [dbo].[FilteredIndexTest] (IsProcessed)
    WHERE IsProcessed = 1;

    SELECT
    @StartDateTime = GETDATE()
    DBCC CHECKDB('Playground') WITH ALL_ERRORMSGS,NO_INFOMSGS;
    SELECT
    @EndDateTime = GETDATE()

    SELECT
    DATEDIFF(ms, @StartDateTime, @EndDateTime) TimeTakenWithFilteredIndex

  3. I didn’t see mention of whether this was for all computed columns, or if there was a difference if the column was persisted? thanks, great article.

  4. When I came across this issue (a table with a computed column using a UDF and then also included in an index), I took a copy of the database and tested on another server with the culprit index enabled, disabled, and removed entirely.

    I only saw the performance benefit when it was removed entirely – disabling it wasn’t enough. I’m not sure what the difference is between yours and mine but just in case someone runs into the same thing, try it.

  5. I have an issue with ‘not so big’ a database. Although I inherited it, it needs some work. Although it does have a computed column it is not part of an index. The CHECKDB took longer and longer in production so I just made it Physical_only and restored the backup to another server. I easily determined the table and it takes the 1.1B row table 11 hours for a TABLECHECK. I am seeing the DBCC_OBJECT_METADATA latch 100% on CXPACKET waits. The DDL is as follows:

    CREATE TABLE [dbo].[fCMPPDistributorShipment](

    [dmDateID] [int] NOT NULL,

    [dmCustomerID] [int] NOT NULL,

    [dmStoreID] [int] NOT NULL,

    [dmItemID] [int] NOT NULL,

    [dmItemSaleableProductID] [int] NOT NULL,

    [dmSourceSystemID] [int] NOT NULL,

    [SalesRecordNumber] [int] NOT NULL,

    [QuantitySalesVolume] [decimal](18, 2) NULL,

    [AmountIndirectReported] [decimal](18, 2) NULL,

    [LoadBatchID] [bigint] NULL,

    [HistoryRecId] [bigint] NOT NULL,

    [DateTimeInsertedUTC] [datetime] NULL,

    [DatetimeUpdatedUTC] [datetime] NULL,

    [DatetimeDeletedUTC] [datetime] NULL,

    [ModifiedUTC] AS (case when [DatetimeDeletedUTC]>=isnull([DatetimeUpdatedUTC],CONVERT([datetime],’1900-01-01′,(20))) AND [DatetimeDeletedUTC]>=isnull([DatetimeInsertedUTC],CONVERT([datetime],’1900-01-01′,(20))) then [DatetimeDeletedUTC] when [DatetimeUpdatedUTC]>=isnull([DatetimeDeletedUTC],CONVERT([datetime],’1900-01-01′,(20))) AND [DatetimeUpdatedUTC]>=isnull([DatetimeInsertedUTC],CONVERT([datetime],’1900-01-01′,(20))) then [DatetimeUpdatedUTC] else [DatetimeInsertedUTC] end) PERSISTED,

    CONSTRAINT [PK_fCMPPDistributorShipment] PRIMARY KEY NONCLUSTERED

    (

    [dmSourceSystemID] ASC,

    [HistoryRecId] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    /****** Object: Index [CIX_fCMPPDistributorShipment_dmDateID] Script Date: 7/27/2015 2:06:50 PM ******/

    CREATE CLUSTERED INDEX [CIX_fCMPPDistributorShipment_dmDateID] ON [dbo].[fCMPPDistributorShipment]

    (

    [dmDateID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    Anything obvious to you?

    1. It’s a scalability regression that will be fixed in future builds – I don’t know which exact builds.

      [Edit – my mistake – see comment below.]

  6. So just for my reference, you are saying that for now I cannot do integrity checks (logical) other than ones that take inordinate amounts of time? Will this be cited or is it now listed as a bug. If so, do you know which one it is? I am at 11.0.5669.

    Thanks for your many articles and insights since I use them and those by Brent as my first sources.

    1. Actually – I made a mistake. It’s the computed column thing in the blog post because you have the computed column as PERSISTED, so the value has to be checked. Sorry for the misunderstanding.

  7. Hi Paul,

    we also have problems using DBCC CHECKDB on computed column (persisted) which is on top using managed code.
    We can only work around the problem with adding the column using non-persisted.
    Unfortunately this is not a valid solution because of the heavy performance impact.
    Do you have an idea what we could do instead?

    Thank you,
    Regina

    1. If you can’t temporarily remove those nonclustered indexes, you should backup, copy, restore, and run CHECKDB on the restored copy of the database, without those nonclustered indexes.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.