This is a question that comes up a lot – how to run consistency checks on a VLDB?
We’re talking hundreds of GBs or 1 TB or more. These databases are now common on SQL Server – there are many customers with multi-TB databases. Any experienced DBA knows the value of running consistency checks, even when the system is behaving perfectly and the hardware is rock-solid. The two problems that people have with running a full CHECKDB on their VLDB are:
- It takes a long time to run (based on many factors – see my previous post here for details).
- It uses lots of resources – memory, CPU, IO bandwidth, tempdb space.
So it uses lots of resources for a long time. Even with a decent sized maintenance window, the CHECKDB may run over into normal operations. There’s also the case of a system that’s already pegged in more or more resource dimensions. Whatever the case, there are a number of options:
- Don’t run consistency checks
- Run CHECKDB using the WITH PHYSICAL_ONLY option
- Use SQL Server 2005’s partitioning feature and devise a consistency checking plan around that
- Figure out your own scheme to divide up the consistency checking work over several days
- Offload the consistency checks to a separate system
Let’s look at each in turn.
Don’t run consistency checks
A lot of people end up doing this as they can’t figure out how to run consistency checks on their DB. Don’t even think about using this option – there’s always a way to do it. If you absolutely cannot figure out a way to get consistency checks on your system, send me email and I’ll help you – I’ve helped many customers with all the schemes below, both in and out of Microsoft.
Use WITH PHYSICAL_ONLY
A full CHECKDB does a lot of stuff – see previous posts in this series for more details. You can vastly reduce the run-time and resource usage of CHECKDB by using the WITH PHYSICAL_ONLY option. With this option, CHECKDB will:
- Run the equivalent of DBCC CHECKALLOC (i.e. check all the allocation structures)
- Read and audit every allocated page in the database
So it skips all the logical checks, inter-page checks, and things like DBCC CHECKCATALOG. The fact that all allocated pages are read means that:
- Any pages that cannot be read at all (i.e. 823 errors) will be discovered
- If page checksums are enabled in SQL Server 2005, any corruptions caused by the IO subsystem will be discovered as the page checksum will be checked as part of reading the page into the buffer pool
So there’s a trade-off of consistency checking depth against run-time and resource usage – but this option will pick up problems caused by the IO subsystem as long as page checksums are enabled and present.
Use the SQL Server 2005 partitioning feature
One of the obvious ways to reduce the time/resources issue is to partition the load. If you’re using the partitioning feature in SQL Server 2005 then you’re already setup for this. Given that you’ve hopefully got your partitions stored on separate filegroups, you can use the DBCC CHECKFILEGROUP command.
Consider this example – you have the database partitioned by date such that the current month is on a read-write filegroup and the past 11 months are on 11 read-only filegroups (data from more than a year ago is on some offline storage medium). The prior months also have multiple backups on various media so are considered much ‘safer’ than the current month. It makes sense that you don’t need to check the read-only filegroups as often as the current month’s filegroup so an example consistency checking scheme could be:
- Run a DBCC CHECKFILEGROUP on each read-only filegroup every week or two
- Run a DBCC CHECKFILEGROUP on the read-write filegroup every day or two (depending on the stability of the hardware, the criticality of the data, and the frequency and comprehensiveness of your backup strategy).
I know of several companies who’ve made the decision to move to SQL Server 2005 in part because of this capability to easily divide up the consistency checking.
Beware that until SP2 of SQL Server 2005, DBCC CHECKFILEGROUP would not check a table at all if it was split over multiple filegroups. This is now fixed and DBCC CHECKFILEGROUP will check partitions on the specified filegroup even if the table is now completely contained on the filegroup.
Figure out your own way to partition the checks
If you’re on SQL Server 2000, or you just haven’t partitioned your database on SQL Server 2005, then there are ways you can split up the consistency checking workload so that it fits within a maintenance window. You basically need to simulate what CHECKDB does in a staggered manner. Here’s one scheme that I’ve recommended to several customers:
- Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.
- Take all the remaining tables in the database and divide them equally between the 7 buckets (using number of pages again)
- On Sunday:
- Run a DBCC CHECKALLOC
- Run a DBCC CHECKCATALOG
- Run a DBCC CHECKTABLE on each table in the first bucket
- On Monday, Tuesday, Wednesday:
- Run a DBCC CHECKTABLE on each table in the 2nd, 3rd, 4th buckets, respectively
- On Thursday:
- Run a DBCC CHECKALLOC
- Run a DBCC CHECKTABLE on each table in the 5th bucket
- On Friday and Saturday:
- Run a DBCC CHECKTABLE on each table in the 6th and 7th buckets, respectively
There’s one drawback to this method – a new internal database snapshot is created each time you start a new DBCC command, even for a DBCC CHECKTABLE. If the update workload on the database is significant, then there could be a lot of transaction log to recover each time the database snapshot is created – leading to a long total run-time. In this case, you may need to alter the number of buckets you use to make the total operation fit within your available window.
Use a separate system
This alternative is relatively simple – restore your backup (you are taking regular backups, right?) on another system and run a full CHECKDB on the restored database. This offloads the consistency checking burden from the production system and also allows you to check that your backups are valid. There are some drawbacks to this however:
- You need to have sufficient disk space on the spare system to be able to restore the backup onto. If the production database is several TB, you need the same several TB on the spare box. This equates to a non-trivial amount of money – initial capital investment plus ongoing storage management costs. (Hopefully a future release will alleviate this – while at Microsoft I invented and patented a mechanism for consistency checking a database in a backup without restoring it.)
- If the consistency checks find an error, you don’t know for sure that the database is corrupt on the production system. It could be a problem with the spare box that’s caused the corruption. The only way to know for sure is to run a consistency check on the production system. This is a small price to pay though, because most of the time the consistency checks on the spare system will be ok, so you know the production database was clean at the time the backup was taken.
You’ve got a bunch of choices to allow you to run consistency checks, so there’s really no excuse for not knowing (within a reasonable timeframe) that something’s gone wrong with your database.