This has come up a few times now, most recently in an email question this morning – subsequent runs of DBCC CHECKDB show varying numbers of corruptions, and sometimes no corruptions – what’s going on? Even more strange – a maintenance job runs a DBCC CHECKDB, which shows errors, but then in the morning – no consistency errors. What?
I answered this back in the April 2009 SQL Q&A column in TechNet Magazine, but I want to get it here on the blog too in a bit more detail. The answer has to do with the way the database is consistency checked, and how corruptions are detected.
In 2005 onward, you’re going to be using page checksums to help detect corruption. If you created the database on 2005 onward, page checksums are enabled by default and every allocated page will have one. If you upgraded a database from 2000 or before, then you need to manually enable page checksums with ALTER DATABASE. The nothing happens. Until a page is read in, changed, and then written back out. So your upgraded database will have a mixture of nothing/page checksums, or torn-page detection/page checksums. Note: torn-page protected pages remain torn-page protected, even with page checksums enabled, until the next time they’re altered. Then they get a page checksum. See Inside The Storage Engine: Does turning on page checksums discard any torn-page protection? for an explanation and examples.
Once you’ve got page checksums enabled, who can you tell if there are corruptions in the database? Well, there are a number of ways corruptions will show up:
- You run an operation that hits a page that has been corrupted, and the page checksum test fails
- You run a BACKUP … WITH CHECKSUM and it finds a page with a bad checksum
- You run a DBCC CHECKDB and it finds a page with a bad checksum
That’s all very well, but what if a page *doesn’t* have a page checksum on it (because it hasn’t been changed since page checksums were enabled)? None of #1 to #3 will fail because of a bad page checksum, as there isn’t a page checksum to check. #1 might fail, depending on how corrupt the page is, and it will likely fail with an obscure message that doesn’t immediately scream ‘corruption’. #2 won’t fail, as the only time BACKUP examines what it’s backing up is when WITH CHECKSUM is enabled and a page has a page checksum on it. #3 might find the corruption, depending on how the page is corrupt. If the corruption is in the middle of a large varchar field, for instance, probably not. Your best bet is to have page checksums enabled and regularly run DBCC CHECKDB.
That’s how corruptions are detected. So what about the disappearing corruptions? This gets into how consistency checks work. Consistency checks only run on the pages in the database that are allocated. If a page isn’t allocated to anything, then the 8192 bytes of it are meaningless and can’t be interpreted. Don’t get confused between reserved and allocated – I explain that in the first misconceptions post here. As long as a page is allocated, it will be consistency checked by DBCC CHECKDB, including testing the page checksum, if it exists. A corruption can seem to ‘disappear’ if a corrupt page is allocated at the time a DBCC CHECKDB runs, but is then deallocated by the time the next DBCC CHECKDB runs. The first time it will be reported as corrupt, but the second time it’s not allocated, so it isn’t consistency checked and won’t be reported as corrupt. The corruption looks like it’s mysteriously vanished. But it hasn’t – it’s just that the corrupt page is no longer allocated. There’s nothing stopping SQL Server deallocating a corrupt page – in fact, that’s what many of the DBCC CHECKDB repairs do – deallocate what’s broken, and fix up all the links.
The maintenance job phenomenon can occur because of the order of operations in the job. If the DBCC CHECKDB is first, and then there’s an index rebuild, and the index rebuild happens to rebuild an index that DBCC CHECKDB had found a corruption in, then the *new* index will have a completely different set of database pages, and won’t contain the corrupt page. Bingo – disappearing corruption. A subsequent DBCC CHECKDB might not find any corruption, because the previously corrupt pages are no longer allocated.
Bottom line – any time you get corruption error messages, 99.99% of the time it’s your I/O subsystem that’s got problems, even if the corruptions ‘disappear’.