CHECKDB From Every Angle: How long will CHECKDB take to run?


This is a subject I posted about last year on my old blog but it came up at SQL Connections last week several times so I want to repost it for those who’ve just started following my blog.


There’s only one time when you should be trying to work out how long a CHECKDB is going to take – when you’re planning your regular database maintenance. If you’re faced with a corrupt (or suspected corrupt) database and you’re only just starting to think about how long a CHECKDB is going to take – you’ve made a mistake while planning your disaster recovery strategy. You always need to know how long CHECKDB takes (on average) to run for your database so:



  • You can tell whether a particular run of CHECKDB is taking longer than usual – a sign that it’s found some corruption
  • You know how long it will take to get results in a disaster recovery situation

At every conference I go to, someone asks me how long CHECKDB will take to run on their database. There are several ways I could answer this:



  • The unhelpful answer – I’ve got no idea.
  • The almost-helpful answer – how long did it take to run last time and are the conditions exactly the same?
  • The answer I usually give – it depends.

Now, many people would see the third answer as being somewhat equivalent to the first answer – unhelpful. The problem is that there are many factors which influence how long CHECKDB will take to run. Let me explain the ten most important factors so you get an idea why this is actually a helpful answer. These aren’t in any particular order of importance.


1) The size of the database


Pretty obvious… CHECKDB has to read every allocated page in the database so the bigger it is, the longer it will take to read all the pages.


2) Concurrent IO load on the server


At the simplest level, what is CHECKDB going to do? It reads every allocated page in the database. That’s a lot of IO. CHECKDB takes great pains to do the most efficient IO it can and read the database pages in their physical order with plenty of readahead so that the disk heads move smoothly across the disks (rather than jumping around randomly and incurring disk head seek delays). If there’s no concurrent IO load on the server, then the IOs will be as efficient as CHECKDB can make them. However, introducing any additional IO from SQL Server means that the disk heads will be jumping around – slowing down the CHECKDB IOs. If the IO subsystem is at capacity already from CHECKDB’s IO demands, any additional IO is going to reduce the IO bandwidth available to CHECKDB – slowing it down.


3) Concurrent CPU activity on the server


At the next level of simplicity, CHECKDB is going to process every page it reads in some way. Depending on the various options you’ve specified and the database schema (details below), that’s going to use a lot of CPU – it’s possible that the server may be pegged at 100% CPU when CHECKDB is running. If there’s any additional workload on the server, that’s going to take CPU cycles away from CHECKDB and it going to slow it down.


Basically what points #2 and #3 are saying is that CHECKDB is very resource intensive! It’s probably one of the most resource intensive things you can ask SQL Server to do and so it’s usually a good idea to not run it during peak workload times, as you’ll not only cause CHECKDB to take longer to run, you will slowdown the concurrent workload, possibly unacceptably.


4) Concurrent update activity on the database


This is relevant for both SQL 2000 and SQL 2005, but for different reasons.


In SQL 2000, CHECKDB gets its consistent view of the database from transaction log analysis of concurrent DML transactions (see here for details). The more concurrent DML there is while CHECKDB is running, the more transaction log will be generated – and so the longer it will take for CHECKDB to analyze that transaction log. It’s possible that on a large multi-CPU box with a ton of concurrent DML and CHECKDB limited to a single CPU that this phase of CHECKDB could take several times longer than the reading and processing of the database pages! (I’ve seen this in real-life several times.)


In SQL 2005, CHECKDB gets its consistent view of the database from a database snapshot, which is stored on the same disk volumes as the database itself. If there are a lot of changes in the database while CHECKDB is running, the changed pages are pushed to the snapshot so that it remains consistent. As the snapshot files are stored in the same location as the database files, every time a page is pushed to the snapshot, the disk head has to move, which interrupts the efficient IO described in #2. Also, whenever CHECKDB goes to read a page and it needs to read the page from the snapshot files instead of the database files, that’s another disk head move, and another efficient IO interruption. The more concurrent changes to the database, the more interruptions to efficient IO and the slower that CHECKDB runs.


5) Throughput capabilities of the IO subsystem


This one’s simple. CHECKDB is going to do a boat-load of IOs and it could even end up being IO-bound (meaning that the CPUs are idle periodically waiting for IOs to complete) depending on the options specified and the database schema. This means that the throughput of the IO subsystem is going to have a direct effect on the run-time of CHECKDB. so, if you have a 1TB database and the IO subsystem can only manage 100MB/sec, it’s going to take almost 3 hours just to read the database (1TB / 100MB / 3600 secs) and there’s nothing you can do to speed that up except upgrade the IO subsystem.


I’ve lost count of the number of times I’ve heard customers complain that CHECKDB (or index rebuilds or other IO-heavy operations) are running sloooowly only to find that the disk queue lengths are enormous and the IO subsystem it entirely unmatched to the server and workload.


6) The number of CPUs (processing cores) on the box


This also really encompasses the Edition of SQL Server that’s being run. In Enterprise Edition, CHECKDB can run in parallel across all the CPUs in the box (or as many as the query processor decides to parallelize over when the CHECKDB internal queries are compiled). Running in parallel can give a significant performance boost to CHECKDB and lower run times, as long as the database is also spread over multiple files too (so the IOs can be parallelized). There’s a nifty algorithm that’s used that allows CHECKDB to run in parallel which I’ll explain in detail in a future post.


On the other hand, the fact that CHECKDB can run in parallel in Enterprise Edition can be bad for some scenarios, and so some DBAs chose to force CHECKDB to be single-threaded. SAP typically recommends this to help with user query predictability. The way to do this is to turn on the documented trace flag 2528.


7) The speed of the disks where tempdb is placed


Running CHECKDB against a VLDB uses lots of memory for internal state and for VLDBs the memory requirement usually exceeds the amount of memory available to SQL Server. In this case, the state is spooled out to tempdb and so the performance of tempdb can be a critical factor in CHECKDB performance. See this post for more details of this and how CHECKDB can run out of disk space if tempdb is too small.


8) The complexity of the database schema


This can have a really high impact on the run-time of CHECKDB because it impacts the amount of CPU that CHECKDB requires. For example, the most expensive checks that CHECKDB does are for non-clustered indexes. It needs to check that each row in a non-clustered index maps to exactly one row in the heap or clustered index for the table, and that every heap/clustered index row has exactly one matching row in each non-clustered index. Although there’s a highly efficient algorithm for doing this, it still takes around 30% of the total CPU that CHECKDB uses!


There are a bunch of other checks that are only done if the features have been used in the database – e.g. computed column evaluation, links between off-row LOB values, Service Broker, XML indexes, indexed views – so you can see that empirical factors along aren’t enough to determine the run-time.


9) Which options are specified


This is almost the same as #7 in that by specifying various options you’re limiting what checks CHECKDB actually performs. For instance, using the WITH NOINDEX option will turn off the non-clustered index checks that I described in #7 and using the WITH PHYSICAL_ONLY option will turn off all logical checks, vastly decreasing the run-time of CHECKDB and making it nearly always IO-bound rather than CPU-bound (in fact this is the most common option that DBAs of VLDBs use to make the run-time of CHECKDB manageable).


One thing to be aware of – if you specify any repair options, CHECKDB always runs single-threaded, even on a multi-proc box on Enterprise Edition.


10) The number and type of corruptions that exist in the database


Again, this is similar to #7 and #8. If there are any corruptions present, there may be extra checks triggered to try to figure out more details of the corruptions. For instance, for the non-clustered index checks, the algorithm is tuned very heavily for the case when there are no corruptions present (the overwhelming majority of cases considering the millions of times CHECKDB is run every day around the world). When a non-clustered index corruption is detected, a more in-depth algorithm has to be used to figure out exactly where the corruption is, which involves re-scanning a bunch of data and so taking a bunch more time. There are a few other algorithms like this too.


Summary


So you can see that there’s no simple answer.

14 thoughts on “CHECKDB From Every Angle: How long will CHECKDB take to run?

  1. Perfect timing for me.

    I’ve got a stubborn client and a DB that must be taken offline to be fixed (VLDB with no DR and no maintenance). Hopefully this will convince them.

  2. Paul,

    Following up on item 7) The speed of the disks where tempdb is placed, would being able to add enough RAM to the DBCC platform mitigate the need for a fast disk subsystem for tempdb? I have a dedicated R720 for DBCC testing with a 24-spindle DAS but only 32 GB of RAM to test DBs up to 2 TB in size with tables in the 150-200 GB range. I have an option to add RAM up to 256 GB but would moving TempDB from the DAS to SSDs be more effective?

    Thank you,
    Matt

  3. Hi Paul,
    I am trying to setup Check Table in a semi-automated process and I am getting confusing results. I use sp_executeSQL to put the results into a temp table and this is a typical result set.

    Error Level State MessageText RepairLevel
    2593 10 1 There are 55296983 rows in 1160959 pages for object “dbo.MyTable”. NULL

    When I check the SQL Error log the corresponding log entry is:
    DBCC CHECKTABLE (MyDB.dbo.MyTable) WITH all_errormsgs, tableresults executed by MyDomain\ray.herring found 0 errors and repaired 0 errors.

    The only results I find on Google say Error 2593 indicates Corruption. I do have corruption (CRC Error) in one of the files in the file group. I am just trying to isolate the particular table. There are only about 15 but each is around 50GB and 300M rows.

    SQL2008R2 Developer on Windows 2008R2 (sigh)

    1. Nope – whatever you’re reading is wrong. 2593 is an informational message about row counts, not an error message. You should just run DBCC CHECKDB and it’ll tell you which table is affected.

  4. Hi Paul,

    I have three AlwaysOn clusters on SQL 2016 SP2 CU2 with 12 DBs each. The nodes have 204 GB RAM and maxserver memory set to 180GB, and maxdop 8. When I run DBCC on two of the three clusters, I get an error saying “Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1” and either the SQL Service stops altogether, or DBCC gets stuck in Suspended state and does not change percent_complete in sys.dm_exec_requests, so I need to reboot to stop the query.
    I have ran DBCC with maxdop=4 and maxdop = 6 also, and that doesn’t help either. It doesn’t always crash however and I could not pinpoint the problem. It looks like a memory leak in SQL Server 2016, as I run the same on SQL Server 2014 and this doesn’t reproduce. Is there something I should check for in particular? So far I have only encountered the issue when running DBCC checkdb, no other process.
    Thank you!

    1. There’s a memory bug (exposed, but not caused, by CHECKDB’s memory usage) that’s fixed in the latest builds of 2012, 2014, 2016 – you’ll need to patch your servers to get rid of it.

      1. Hi Paul,
        Thanks so much for your answer! Is it a reported bug anywhere? I could not find it. I did upgrade one of the clusters to SQL Server 2016 SP2 CU3 and I still experienced the problem. I could see a new CU came out for 2014, but nothing in SQL Server 2016 after CU3. Are you referring to the CU3 version for 2016 having the issue fixed?

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.