CHECKDB From Every Angle: Consistency Checking Options for a VLDB

This is a question that comes up a lot – how to run consistency checks on a VLDB?

We’re talking hundreds of GBs or 1 TB or more. These databases are now common on SQL Server – there are many customers with multi-TB databases. Any experienced DBA knows the value of running consistency checks, even when the system is behaving perfectly and the hardware is rock-solid. The two problems that people have with running a full DBCC CHECKDB on their VLDB are:

It takes a long time to run (based on many factors – see my previous post here for details).
It uses lots of resources – memory, CPU, IO bandwidth, tempdb space.

So it uses lots of resources for a long time. Even with a decent sized maintenance window, the DBCC CHECKDB may run over into normal operations. There’s also the case of a system that’s already pegged in more or more resource dimensions. Whatever the case, there are a number of options:

Don’t run consistency checks
Run DBCC CHECKDB using the WITH PHYSICAL_ONLY option
Use the partitioning feature and devise a consistency checking plan around that
Figure out your own scheme to divide up the consistency checking work over several days
Offload the consistency checks to a separate system

Let’s look at each in turn.

Don’t run consistency checks

A lot of people end up doing this as they can’t figure out how to run consistency checks on their DB. Don’t even think about using this option – there’s always a way to do it. If you absolutely cannot figure out a way to get consistency checks on your system, send me email and I’ll help you – I’ve helped many customers with all the schemes below, both in and out of Microsoft.

Use WITH PHYSICAL_ONLY

A full DBCC CHECKDB does a lot of stuff – see previous posts in this series for more details. You can vastly reduce the run-time and resource usage of DBCC CHECKDB by using the WITH PHYSICAL_ONLY option. With this option, DBCC CHECKDB will:

Run the equivalent of DBCC CHECKALLOC (i.e. check all the allocation structures)
Read and audit every allocated page in the database

So it skips all the logical checks, inter-page checks, and things like DBCC CHECKCATALOG. The fact that all allocated pages are read means that:

Any pages that cannot be read at all (i.e. 823 errors) will be discovered
If page checksums are enabled, any corruptions caused by the IO subsystem will be discovered as the page checksum will be checked as part of reading the page into the buffer pool

So there’s a trade-off of consistency checking depth against run-time and resource usage – but this option will pick up problems caused by the IO subsystem as long as page checksums are enabled and present.

Use the partitioning feature

One of the obvious ways to reduce the time/resources issue is to partition the load. If you’re using the partitioning feature then you’re already setup for this. Given that you’ve hopefully got your partitions stored on separate filegroups, you can use the DBCC CHECKFILEGROUP command.

Consider this example – you have the database partitioned by date such that the current month is on a read-write filegroup and the past 11 months are on 11 read-only filegroups (data from more than a year ago is on some offline storage medium). The prior months also have multiple backups on various media so are considered much ‘safer’ than the current month. It makes sense that you don’t need to check the read-only filegroups as often as the current month’s filegroup so an example consistency checking scheme could be:

Run a DBCC CHECKFILEGROUP on each read-only filegroup every week or two
Run a DBCC CHECKFILEGROUP on the read-write filegroup every day or two (depending on the stability of the hardware, how critical the datais, and the frequency and comprehensiveness of your backup strategy).

I know of several companies who’ve made the decision to move to partitioning in part because of this capability to easily divide up the consistency checking.

Figure out your own way to partition the checks

If you haven’t partitioned your tables, then there are ways you can split up the consistency checking workload so that it fits within a maintenance window. You basically need to simulate what DBCC CHECKDB does in a staggered manner. Here’s one scheme that I’ve recommended to several customers:

Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.
Take all the remaining tables in the database and divide them equally between the 7 buckets (using number of pages again)
On Sunday:
- Run a DBCC CHECKALLOC
- Run a DBCC CHECKCATALOG
- Run a DBCC CHECKTABLE on each table in the first bucket
On Monday, Tuesday, Wednesday:
- Run a DBCC CHECKTABLE on each table in the 2nd, 3rd, 4th buckets, respectively
On Thursday:
- Run a DBCC CHECKALLOC
- Run a DBCC CHECKTABLE on each table in the 5th bucket
On Friday and Saturday:
- Run a DBCC CHECKTABLE on each table in the 6th and 7th buckets, respectively

There’s one drawback to this method – a new internal database snapshot is created each time you start a new DBCC command, even for a DBCC CHECKTABLE. If the update workload on the database is significant, then there could be a lot of transaction log to recover each time the database snapshot is created – leading to a long total run-time. In this case, you may need to alter the number of buckets you use to make the total operation fit within your available window.

Use a separate system

This alternative is relatively simple – restore your backup (you are taking regular backups, right?) on another system and run a full DBCC CHECKDB on the restored database. This offloads the consistency checking burden from the production system and also allows you to check that your backups are valid. There are some drawbacks to this however:

You need to have sufficient disk space on the spare system to be able to restore the backup onto. If the production database is several TB, you need the same several TB on the spare box. This equates to a non-trivial amount of money – initial capital investment plus ongoing storage management costs. (Hopefully a future release will alleviate this – while at Microsoft I invented and patented a mechanism for consistency checking a database in a backup without restoring it.)
If the consistency checks find an error, you don’t know for sure that the database is corrupt on the production system. It could be a problem with the spare box that’s caused the corruption. The only way to know for sure is to run a consistency check on the production system. This is a small price to pay though, because most of the time the consistency checks on the spare system will be ok, so you know the production database was clean at the time the backup was taken.

Summary

You’ve got a bunch of choices to allow you to run consistency checks, so there’s really no excuse for not knowing (within a reasonable time-frame) that something’s gone wrong with your database.

40 thoughts on “CHECKDB From Every Angle: Consistency Checking Options for a VLDB”

Ryan Skinner says:

April 23, 2009 at 10:29 am

Would another option be to take a SAN level snapshot of the database and then do a DBCC CheckDB on that copy of the database?

Reply
paul says:

April 27, 2009 at 3:15 am

Yup – that would work too.

Reply
Wolf Goebel says:

June 17, 2013 at 10:31 am

I have a SQLServer 2012 server that is performing very strangely. I was running full CheckDB every day, for about a month or so, without any problems. Then, one day, it took about 12 hours to run it against a very small database. From this day on, I keep getting varying times to run it. Today, in one of the instances, it took my job 6 hours and a half to run CheckDB against the msdb database, which is 64Mb, with 40% unallocated space. What could there possibly be wrong?

Reply
1. Paul Randal says:
  
  June 18, 2013 at 1:28 am
  
  Gross I/O subsystem perf issue or something preventing CHECKDB creating its database snapshot.
  
  Reply
SQL DBA says:

July 5, 2013 at 10:24 am

Paul, when you say – Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.

What do you mean by a ‘bucket’?

Reply
1. Paul Randal says:
  
  August 13, 2013 at 2:05 pm
  
  Group, set of tables to operate on per day.
  
  Reply
BradC says:

August 19, 2013 at 1:21 pm

Great article, Paul.

In your “table buckets” scenario, does CHECKALLOC + CHECKCATALOG + CHECKTABLE cover everything that a CHECKDB does? The MSDB documentation says that CHECKDB also:

Validates the contents of every indexed view in the database.

Validates link-level consistency between table metadata and file system directories and files when storing varbinary(max) data in the file system using FILESTREAM.

Validates the Service Broker data in the database.

Are these important? Are there ways to do them separately? (I’m mostly worried about the indexed view validation, I don’t believe we are using FILESTREAM or Service Broker currently).

Reply
1. Paul Randal says:
  
  August 22, 2013 at 9:43 am
  
  Don’t worry about the indexed view validation. It’s off by default from 2008 onward because it wasn’t finding problems.
  
  Reply
JerryJ says:

February 12, 2014 at 2:01 pm

Thanks for the great options. Is there recommended minimum frequency to running DBCC’s? Our policy has been to run “Physical only” nightly and full on weekends, but have timing issues on some servers. We anticipate partitioning the workload over a week’s time. Is that typical? Or could we extend over 2 weeks or a month? I know that’s a “it depends” question, but would like best practice recommendation.

Reply
1. Paul Randal says:
  
  February 13, 2014 at 5:42 am
  
  I always like to recommend a full CHECKDB every week, unless there are extenuating circumstances (e.g. you have to split the checks up rather than using CHECKDB).
  
  Reply
arun says:

February 6, 2017 at 11:51 pm

Hi Paul – PHYSICAL ONLY option does CHECKALLOC and skips CHECKCATALOG, when you go on say it still reads every page I assume that CHECKTABLE runs then too? thanks.

Reply
1. Paul Randal says:
  
  February 7, 2017 at 10:03 am
  
  Yes – but only doing the page-reading checks.
  
  Reply
Anthony Curreri says:

March 15, 2017 at 12:53 pm

I read through BOL and all your recommendations multiple times. I’ve been doing some testing also. I’m trying to determine if NOINDEX checking is a better approach then PHYSICAL_ONLY. They seem very similar and INDEXES are logical structures that are also skipped during PHYSICAL_ONLY from what I can tell.

My questions are:

1. Because PHYSICAL_ONLY skips checkcatalog, and checkcatalog seems important because the errors it detects require a restore, isn’t it less risk to run NOINDEX instead of PHYSICAL_ONLY as a best practice for production and VLDBs?

2. Because indexes can always be rebuild isn’t running DBCC with no NOINDEX no real risk of data loss, because the worst case is I have to rebuild an index?

3. If I was to separate and schedule the workload of DBCC, by large tables and different nights, it would be nice if DBCC CHECKDB had an omit tables option. It would allow me to do most of the critical databases on the maint schedule, and then run selected checktables (maybe with NOINDEX) as need on other nights.

Please let me know what you think about using NOINDEX instead of PHYSICAL_ONLY. Thanks

Reply
1. Paul Randal says:
  
  March 15, 2017 at 9:08 pm
  
  1) No – you shouldn’t use NOINDEX as it skips indexes completely. PHYSICAL_ONLY still reads the index pages and checks their page checksums.
  2) Correct. But you still want to know if there’s corruption in the database as soon as possible. Just because it hit a nonclustered index (which will cause query failures), next time it might hit a clustered index or system table.
  3) Please submit a request on connect.microsoft.com
  
  Don’t use NOINDEX. I’ve never recommended it but never got around to pressing to have it removed back when I was on the SQL team.
  
  Reply
Sourav says:

April 10, 2017 at 5:07 am

I have two servers (SQL 2008 SP4 Enterprise) with database mirroring configured across them. If I run CHECKDB WITH PHYSICAL_ONLY on principal database and normal DBCC CHECKDB () on mirrored database snapshot, will this option cover all consistency check that normal CHECKDB performs on principal database.

Note: page_verify is configured as CHECKSUM for database on Principal database. From 2008 onward, we also have auto page repair option with database mirroring.

Reply
1. Paul Randal says:
  
  April 11, 2017 at 2:01 am
  
  No – you need to run full checkdb on them both. Running it on the mirror doesn’t tell you anything about the state of the principal.
  
  Reply
PaulG says:

August 23, 2017 at 4:04 am

Hi Paul

I’m a huge fan of you and Kimberly’s work.

Regarding this: “while at Microsoft I invented and patented a mechanism for consistency checking a database in a backup without restoring it.”

This is very interesting. Is this functionality publicly available now or are there plans for it to be so?

thanks
Paul.

Reply
1. Paul Randal says:
  
  August 23, 2017 at 5:01 am
  
  Thanks!
  
  No – it wasn’t implemented and isn’t ever likely to be. See https://www.sqlskills.com/blogs/paul/patents-and-parties/
  
  Reply
Michael DeFehr says:

June 29, 2018 at 10:29 am

Hey Paul

SQL 2008 installation – CheckDB is running on a secondary system, so there is lots of everything *except* disk space – assuming checkdb is using what it needs and not more (DB data file is 293GB, tempdb grows to 34GB, largest table is 97GB) – I’m guessing all of the partitioning options don’t help me and the only way to make CheckDB use less tempdb (potentially) is PHYSICAL_ONLY – can’t seem to find any info on what impact (if any) that might have on tempdb space usage.

Any other suggestions? I’ve asked management for more disk – they’re thinking about it (~2 years now…) – I could run checkdb in production – more space there and a reasonable maintenance window

Thanks

Reply
1. Paul Randal says:
  
  June 29, 2018 at 2:11 pm
  
  Correct – PHYSICAL_ONLY should use profoundly less space in tempdb. In your situation, I’d be tempted to run it in production every so often too so you get the full checks.
  
  Reply
Damodhar Marella says:

August 27, 2018 at 1:20 pm

Hello Paul, we have a Sql 2014 instance with latest patch level. These days we are facing logical inconsistency issues on tempdb when ever we ran some queries and these are intermittent. Actually it says incorrect page I’d detected on one of the tempdb file while trying to sort particular operation. What could be the reason for this? There are few procedures written with no lock and row lock hints and we have good amount of tempdb files based on the processors. There is no TDE enabled.

Reply
1. Paul Randal says:
  
  August 27, 2018 at 1:27 pm
  
  Either an I/O subsystem problem or bad memory would be my guesses. I don’t know of any current bugs that could cause this.
  
  Reply
Jeremy says:

June 6, 2019 at 9:19 pm

Paul, thanks for the help already given and this wonderful article. I’m finally implementing a kind of “bucket” system- I have four hours a night that the system can take a performance hit so I’ve got a procedure that, each night, will checkalloc and checkcatalog (these take under a minute) and then proceed to checktable on every table, stopping after four hours. The next night, it runs checkalloc, checkcatalog, and continues on the next table in the list. If there aren’t any tables in the checktable queue, it fills up the queue again with 3900 tables or so (JD Edwards, ugh) and starts at the top.

In an email you mentioned, “You’re probably missing system tables.” Can you elaborate? select * from databasename.sys.tables where type ‘U’ results in zero rows… should I be running checktables on some other tables that aren’t in sys.tables?

Thank you

Jeremy

Reply
1. Paul Randal says:
  
  June 9, 2019 at 12:00 pm
  
  You’re welcome! You want to be doing everything in sys.objects with type = S or U.
  
  Reply
Anand Thool says:

September 5, 2019 at 10:40 pm

Hi Paul,

Is it better option to devide the Full dbcc checkdb into three sepearte checks to reduce the overhead like

Run a DBCC CHECKALLOC
Run a DBCC CHECKCATALOG
Run a DBCC CHECKTABLE

Reply
1. Paul Randal says:
  
  September 9, 2019 at 7:41 am
  
  There’s no right answer to this – it depends on what restrictions you have. And remember it’s not three parts if you do this, it’s two parts plus one part for each table in your database, as CHECKTABLE only checks one table.
  
  Reply
Mehdi says:

November 15, 2019 at 10:34 pm

Hi Paul!

Does executing the DBCC checkdb increase the size of full or differential backups?

Reply
1. Paul Randal says:
  
  November 16, 2019 at 1:21 pm
  
  No – why would you think that?
  
  Reply
Elvin says:

November 18, 2019 at 6:24 am

Hi Paul.

Thank you for such a great posts about CHECKDB and for great feature at all.

I want to ask you about using separate system option for CHECKDB. If I’ll restore FULL+DIFF+LOG backups and run CHECKDB,is it would be OK or I must restore FULL+DIFF without LOG?

Reply
1. Paul Randal says:
  
  November 18, 2019 at 7:43 am
  
  If you’re just doing it do run CHECKDB, then the full backup is enough for that (assuming you’re doing it, say, once per week).
  
  Reply
  1. Elvin says:
    
    November 19, 2019 at 5:49 am
    
    Thank you for quick reply.
    
    What if I want to run CHECKDB everyday after DIFF backup, would it be correct solution to run CHECKDB on restored FULL+DIFF? Or it wouldn’t give me any advantage?
    
    Reply
    1. Paul Randal says:
      
      November 19, 2019 at 3:10 pm
      
      You could absolutely do that – the more frequently the better, if you can, is always good.
      
      Reply
Jitanshu Srivastava says:

December 18, 2019 at 1:49 pm

Hi Paul,

We are facing multipe checkdb failures every weekend on our VLDB.By the time i came across this Physical_only feature.

Can you please suggest if it is fine to replace checkdb with checkdb Physical_only and is it safe if my db or any page gets corrupted,does it will be able to get that corruption information as error message like as it does on normal Checkdb?

Thanks,
Jitanshu

Reply
1. Paul Randal says:
  
  December 27, 2019 at 8:49 am
  
  You need to fix the root cause of the corruption – likely hardware. No – you can’t *never* run a full DBCC CHECKDB as there are a lot of checks it does that a PHYSICAL_ONLY does not.
  
  Reply
srinivas says:

February 12, 2020 at 2:43 am

Hi Paul,
Thanks for your wonderful post.
If we do Check table for every table in the database individually, Does it creates complete database snapshot for every DBCC CheckTable call?
If it so, Is it preferrable to run for all the tables everytime which will create multiple DB snapshots?

Reply
1. Paul Randal says:
  
  February 18, 2020 at 11:11 pm
  
  Yes it does, so I wouldn’t do that unless you have a good reason you can’t run DBCC CHECKDB, even on a backup-copy-restored copy of the database.
  
  Reply
Dao says:

February 21, 2020 at 4:13 am

Hi Paul.
Thank you for this article.

I was hoping you could give an advise on this.

I am running a checkdb(olla Hollagren) with noindex . It does not check the non clustered indexes only.

I am worried we could miss a corruption of an index.

The reason why I chose this solution is because of the time it takes to run

What do you think if i decide to run a full check with index every 2 weeks and one without index every 2 weeks as well.

Reply
1. Paul Randal says:
  
  February 21, 2020 at 11:17 am
  
  Sounds ok to me.
  
  Reply
ParanoidDBA says:

June 25, 2024 at 7:03 am

Hi Paul, thanks so much for this article. Please may I ask a question about a situation I am facing. We run checkdb against our alwayson replica, it normally completes without error. However, on 2 occasions in the last 4 months, it has reported a corruption:

The operating system returned error 1392(The file or directory is corrupted and unreadable.) to SQL Server during a write at offset 0x00000d0a690000 in file ‘E:\xxx\data\xxx.ndf_MSSQL_DBCC9’. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Subsequent executions of checkdb report no corruption. I have logged an incident with MS and they told me to run chkdsk, which came back with zero errors, then told me it was likely a transient write error and likely safe to ignore. I’m comfortable ignoring this, but I am coming close to being out of ideas on how to troubleshoot it. So my question, Is there any chance that checkdb could run out of space, and report a corruption rather than an out of space error? (we are getting close to disk space limits).

Reply
1. Paul Randal says:
  
  June 27, 2024 at 1:54 pm
  
  No – it’s definitely an I/O subsystem error with the sparse file CHECKDB uses for its snapshot. Apart from chkdsk, about the only thing you can do is run whatever diagnostics are provided by your I/O subsystem vendor.
  
  Reply

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.

CHECKDB From Every Angle: Consistency Checking Options for a VLDB

40 thoughts on “CHECKDB From Every Angle: Consistency Checking Options for a VLDB”

Leave a Reply Cancel reply

Other articles

2024: the year in books

2023 (not a typo!): the year in books

Black Friday Super Sale!

Summer School Savings SALE is live!

The Curious Case of… missing log information

The Curious Case of… missing 832 errors during memory corruption

Imagine feeling confident enough to handle whatever your database throws at you.