DBCC CHECKDB scalability and performance benchmarking on SSDs

Back in February I ran a bunch of performance tests of DBCC CHECKDB on SSDs, to evaluate the effects of degree-of-parallelism (DOP) and various options and traceflags, and now I’m finally getting around to presenting the results. Make sure to also read the recent post where I talk about the detrimental effect of computed-column indexes on DBCC CHECKDB performance.

[Edit: I’ve heard anecdotal evidence that reducing DOP for DBCC CHECKDB is even more important on 2012 and later versions. YMMV.]

The two variables I altered during my tests were:

  • DOP – 32, 16, 8, 4, 2, 1
  • Configuration – full DBCC CHECKDB, using WITH PHYSICAL_ONLY (PO), using PO and TF2549, using PO and TF2562, using PO and both trace flags

Trace flags 2549 and 2562 are described in KB 2634571, and can be summarized as:

  • 2549: treat each database file as if it’s on a separate physical drive for the purposes of driving readahead
  • 2562: perform the entire set of checks in a single ‘batch’ instead of multiple batches, and increase the efficient of per-thread calls to get a new list of pages to read

These two trace flags really only apply to DBCC CHECKDB when the WITH_PHYSICAL_ONLY option is used.

My test system is as follows:

  • DELL R720 with two 8-core E5-2670 CPUs and hyper-threading enabled
  • 64GB of memory
  • The test database is AdventureWorks that Jonathan expanded to 500GB for me using his cool scripts. The database is split over 8 data files stored on two 320GB Fusion-io drives, with tempdb and its log placed on two more 320GB Fusion-io drives. I set things up this way to remove I/O waits from the test.
  • There was no additional load running concurrently with DBCC CHECKDB.

Complete Results

The complete results are shown below:

CHECKDB-DOP1

There are some very interesting results from this graph, for the test I’m doing using SSDs:

  • For a full DBCC CHECKDB, there is a 70-80% performance gain from DOP 1 to DOP 2, from DOP 2 to DOP 4, and from DOP 4 to DOP 8.
  • For a full DBCC CHECKDB run, there is really no scalability gain in performance for a DOP greater than 8, and in fact a slight performance degradation for higher DOPs.
  • As you can see for the results of a full DBCC CHECKDB with DOP 8 and higher, DBCC CHECKDB on that server without I/O waits can process 0.5GB of the database per second. That’s pretty cool and is down to the fast SSDs (there were no PAGEIOLATCH_SH waits) and fast processors. Back in SQL Server 2000, our performance benchmark was about 1GB per minute. See here for some example SQL Server 2000 benchmark times that show around 3-5GB/minute, but not on SSDs.
  • The main boost in performance when running on SSDs comes from using the WITH PHYSICAL_ONLY option. This is, of course, to be expected as it cuts out all of the logical checks that make up the bulk of the CPU usage of DBCC CHECKDB.

For the scalability result running a full DBCC CHECKDB, some latch contention for structures inside DBCC CHECKDB starts to show up at DOP 8, but not enough to give zero scalability at higher DOPs. This comes from a much higher time spent waiting for CXPACKET waits, roughly doubling from DOP 8 to DOP 16, and again from DOP 16 to DOP 32. I blame the query processor for that :-) In all seriousness, I didn’t expect to see a scalability gain with DOP higher than 8, based on what I’ve seen in the past.

PHYSICAL_ONLY Results

The results without DOP 1 and DOP 2 are shown below, which allows us to focus on the PHYSICAL_ONLY results:

CHECKDB-DOP2

Again, for the test I’m doing using SSDs, there are some interesting results:

  • DOP 4 seems to give the best performance overall, apart from the use of both trace flags with DOP 8.
  • Higher DOPs get progressively worse performance. Analysis of wait and latch statistics show that this is caused by increasing amounts of latch contention at higher DOPs.
  • The two trace flags don’t have a pronounced effect on performance in my situation.

Summary

When running with a database on SSDs and very fast processors, full DBCC CHECKDB performance scales to DOP 8, and the main performance gain comes from using the WITH PHYSICAL_ONLY option. The two trace flags introduced in late 2011 don’t have a pronounced effect on PHYSICAL_ONLY performance. This is all to be expected, as the SSDs really remove the I/O component of DBCC CHECKDB performance and scalability from consideration.

Although these tests are not conclusive until I’ve run similar tests on a slower I/O subsystem, I’m tempted to recommend limiting DBCC CHECKDB‘s DOP to 8 if you can (maybe using Resource Governor) to limit its resource drain on your system.

I obviously also have to recommend using SSDs to vastly reduce or effectively eliminate (in my case) I/O waits.

During April and May I’ll perform the same set of tests using 15k SCSI drives and also 7.2k SATA drives to see if and how the results differ.

Hope you found this interesting!

DBCC CHECKDB performance and computed-column indexes

[Edit 2016: The team ‘fixed’ the problem in SQL Server 2016 by skipping consistency checking these indexes unless WITH EXTENDED_LOGICAL_CHECKS is used.]

It’s no secret that DBCC CHECKDB has some performance quirks based on the schema of the database being checked and various kinds of corruptions. I was recently doing some scalability testing of DBCC CHECKDB for a blog post and discovered quite a nasty performance issue that exists in all versions of SQL Server back to SQL Server 2005. This isn’t a bug, it’s just the way things work.

The issue occurs when a nonclustered index exists that has a computed column as part of the index key or as one of the INCLUDEd columns and affects DBCC CHECKDB, DBCC CHECKFILEGROUP, and DBCC CHECKTABLE.

As a bit of background, here’s an excerpt from the SQL Server 2012 Internals book where I describe one of the ways nonclustered indexes are checked for consistency:

The nonclustered index cross-checks verify that

  • Every record in a nonclustered index (whether filtered or nonfiltered) must map to a valid record in the base table (that is, the heap or clustered index).
  • Every record in the base table must map to exactly one record in each nonfiltered, nonclustered index, and one record in each filtered index, where the filter allows.

The mechanism to carry out these checks efficiently has changed in every release since SQL Server 7.0—becoming progressively more and more efficient. In SQL Server 2012, two hash tables are created for each partition of each nonclustered index—one hash table is for the actual records in that partition of the nonclustered index, and the other is for the records that should exist in that partition of the nonclustered index (as calculated from the existing data records in the table).

When a nonclustered index record is processed, all columns in the record are hashed together into a BIGINT value. This includes

  • The physical or logical link back to the base table (known as the base table RID)
  • All included columns—even LOB and FILESTREAM values) are hashed together into a BIGINT value

The resulting value is added to the master hash value for actual records for the nonclustered index partition of which the record is part.

DBCC CHECKDB knows which nonclustered indexes exist for the table and what the complete nonclustered index record composition should be for each. When a data record is processed, the following algorithm is run for each matching nonclustered index record that should exist for the data record (taking into account any filter predicates for filtered nonclustered indexes):

  1. Create the nonclustered index record in memory (again, including the base table RID, plus included columns).
  2. Hash all columns in the index record together into a BIGINT value.
  3. Add the resulting value to the “should exist” master hash value for the relevant nonclustered index partition of which the index record is part.

The premise that this algorithm works on is that if no corruptions exist, the master hash values for the actual records and “should exist” records for each nonclustered index partition should match exactly at the end of the DBCC CHECKDB batch.

When a nonclustered index uses a computed column, the value of the computed column has to be computed based on the column definition. To do that, an internal mechanism called an ‘expression evaluator’ is created. The expression evaluator is provided by the Query Processor code and its behavior is entirely outside the control of DBCC CHECKDB. The drawback of the expression evaluator is that every time it is used, an exclusive latch must be held by the thread using it. This creates an incredible bottleneck and drastically affects performance.

My test system is the Dell R720 we bought last year, with 64GB of memory and 2 E5-2670 CPUs with 8 physical cores and hyperthreading enabled. It’s running SQL Server 2012 SP1 CU3.

The test database is AdventureWorks that Jonathan expanded to 500GB for me using his cool scripts. The database is split over 8 data files stored on two 320GB Fusion-io drives, with tempdb and its log placed on two more 320GB Fusion-io drives. I set things up this way to remove I/O waits from the test.

The first few runs of my testing, with unlimited parallelism produced DBCC CHECKDB elapsed times of around 340 minutes. This seemed incredibly slow to me, so I ran another test and looked at the output of sys.dm_os_waiting_tasks. I discovered that half the threads at any time were waiting for exclusive access to the DBCC_OBJECT_METADATA latch. Adding in some diagnostics to capture waits and latches gave the data below:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------- ---------- ----------- --------- ---------- ----------- ---------- --------- ---------
CXPACKET        684482.45  682667.13   1815.32   60294236   34.90       0.0114     0.0113    0.0000
OLEDB           659325.08  659325.08   0.00      207661462  33.62       0.0032     0.0032    0.0000
LATCH_EX        615168.39  605357.28   9811.11   798224634  31.37       0.0008     0.0008    0.0000

LatchClass               Wait_S     WaitCount  Percentage  AvgWait_S
------------------------ ---------- ---------- ----------- ----------
DBCC_OBJECT_METADATA     611768.00  764845636  99.16       0.0008

This didn’t make any sense to me as I couldn’t believe that such an incredible bottleneck existed – almost 1ms per latch wait and 800 million latch waits! – surely we’d have heard about it by now? I pinged my friend Ryan Stonecipher who inherited the DBCC CHECKDB code from me and he reminded me about the expression evaluator behavior.

I went back to my test system and looked for nonclustered indexes using computed columns using the code below:

SELECT
    [s].[name],
    [o].[name],
    [i].[name],
    [co].[name],
    [ic].*
FROM sys.columns [co]
JOIN sys.index_columns [ic]
    ON [ic].[object_id] = [co].[object_id]
    AND [ic].[column_id] = [co].[column_id]
JOIN sys.indexes [i]
    ON [i].[object_id] = [ic].[object_id]
    AND [i].[index_id] = [ic].[index_id]
JOIN sys.objects [o]
    ON [i].[object_id] = [o].[object_id]
JOIN sys.schemas [s]
    ON [o].[schema_id] = [s].[schema_id]
WHERE [co].[is_computed] = 1;
GO

And I found 6 nonclustered indexes with computed columns in, on some of the largest tables in the database. I then disabled those nonclustered indexes and re-ran the DBCC CHECKDB tests.

17-18 minutes per run.

Wow! The bottleneck from the expression evaluator was causing DBCC CHECKDB to run about 20 times slower.

The wait and latch details for the test runs with the indexes disabled are as below, which is what I expected to see from running DBCC CHECKDB:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------- ---------- ----------- --------- ---------- ----------- ---------- --------- ---------
CXPACKET        33064.56   29331.74    3732.83   42034533   48.14       0.0008     0.0007    0.0001
OLEDB           28883.78   28883.78    0.00      173940154  42.05       0.0002     0.0002    0.0000
LATCH_EX        5021.20    4605.50     415.70    30340659   7.31        0.0002     0.0002    0.0000

LatchClass               Wait_S     WaitCount  Percentage  AvgWait_S
------------------------ ---------- ---------- ----------- ----------
DBCC_CHECK_AGGREGATE     5039.36    30267451   98.82       0.0002

There’s nothing we can do about this bottleneck except to disable nonclustered indexes on computed columns while DBCC CHECKDB is running and then rebuild them afterwards, which is not a very palatable solution. I wanted to blog about this just to make the issue known and help any of you who are beating your head against a wall trying to figure out poor DBCC CHECKDB performance.

[Edit 12/2/2014: Here’s another set of performance tests using a kCura database schema of various sizes.]

Make life easier on yourself, get a baseline!

At the SQL Connections conference earlier this month, at the start of my talk on Making SQL Server Faster, Part 1: Simple Things, I talked about the importance of having a performance baseline so you can measure the effect of any changes made to your environment. A month ago I kicked off a survey about performance baselines, see here, to see how many of you do or do not have baselines, and why.

Here are the results:

The 'Other' values are:

  • 10 x "We use a performance monitoring tool to view live trends."
  • 8 x "Not enough time or resources or practical methods in place to baseline."
  • 5 x "I'm trying to create one."
  • 5 x "Not yet."
  • 3 x "We have one, but it's pretty old."
  • 2 x "New job. Created one my first week on systems I had access to. Will continue for the enterprise."
  • 2 x "Our environment is seasonal and the data volumes will be varying from year on year. Hence the baseline was not useful and sometimes the baselines pointed to a resource where it's not an issue."
  • 2 x "Too much stuff changes.
  • 1 x "Clients are not providing details as what an ideal workload is."
  • 1 x "I work for multiple companies and some of them require baselines, others do not know what baselining is. The latter is more."

The 'Other' values are:

  • 10 x "All of the above."
  • 6 x "Any of the above if needed!"
  • 4 x "Checking the effect of a change to the system, Performance troubleshooting, Proactive tuning, Capacity planning."
  • 1 x "If I weren't too busy, all the above."

Information on baselines and benchmarks

(Adapted from one of my recent Insider newsletters)

One of the sentiments that we stress several times during our IE2 course on Performance Tuning is that it's far easier to troubleshoot a performance problem when you have historical data compared to when you don't. This is called having a baseline – a set of measurements of performance metrics when the overall system is performing normally (i.e. satisfactorily for users). Some examples are:

  • The I/O load during regular and peak operations, from the SQL Server and Windows perspectives
  • Buffer pool health (e.g. Page Life Expectancy and Lazywriter activity)
  • Run-time characteristics of important queries
  • Wait statistics

Having a baseline means that when performance appears to be suffering, measurements can be taken and compared against the baseline, to identify areas that have changed. This allows you to target further investigations rather than do what I call 'flailing' or 'knee-jerk' performance tuning, where you have no idea where to start or pick a random symptom and focus on that, wasting precious time.

I could write a whole bunch about baselines, how to collect them, and all kinds of other info, but Erin Stellato has just started a multi-part article series over on SQL Server Central, so I won't duplicate that. Check out her first article: Back to Basics: Capturing Baselines on Production SQL Servers. She's also just published a 3-hour online course through Pluralsight called SQL Server: Baselining and Benchmarking, with all kinds of demos and tons of practical how-to advice.

Bottom line: your performance investigations and tuning will be a LOT easier if you have a baseline to compare against. If you don't have one, get one.