SQLskills SQL101: How can corruptions disappear?

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Every so often I get an email question from a confused DBA: why is it that sometimes corruptions seem to disappear?

The situation is commonly as follows:

  • There is a regular SQL Agent job that runs DBCC CHECKDB
  • One morning the DBA finds that the job failed, reporting corruptions in one of the databases
  • The DBA runs DBCC CHECKDB on that database again, but this time there are no reported corruptions

This can lead the DBA to mistrust DBCC CHECKDB. Remember the SQL Server 2000 days where sometimes DBCC CHECKDB occasionally reported corruptions when there weren’t any? Those days are long gone now: if DBCC CHECKDB reports corruption, then at that time that it ran there was definitely corruption.

Think about what DBCC CHECKDB is doing: it reads and processes all the allocated pages in the database – all the pages that are part of tables and indexes at the time that DBCC CHECKDB runs. It doesn’t check all the pages in the data files; only those that are currently being used. The pages that are not currently allocated to an object cannot be checked as there’s no “page history” maintained. There’s really no way for DBCC CHECKDB to tell if they have ever been used before or not and since they’re not currently allocated there’s no valid page structure on them and no past to verify.

And if your database is still being accessed then the set of allocated pages can change after DBCC CHECKDB runs. A simple example of this occurring is:

  • Nonclustered index X of table Y has some corrupt pages in, which the DBCC CHECKDB (being run by a SQL Agent job) reports
  • Another SQL Agent job runs and performs index maintenance where it rebuilds index X (the rebuild operation always builds a new index and then drops the old index)
  • The DBA runs DBCC CHECKDB manually and there are no corruptions reported in the new index structure

Nonclustered index corruption is the best kind of corruption to have. The rebuild operation rewrote the index to a new set of pages and deallocated the pages that had corruption. When DBCC CHECKDB is run manually, those new pages are not corrupt and the old pages are not checked, as they are no longer in use.

These kind of ‘disappearing’ corruptions are a problem because it’s almost impossible to investigate them further. However, they could indicate a problem with your I/O subsystem. If you find that they’re occurring repeatedly, consider briefly preventing the process that causes the corrupt pages to be deallocated so you can investigate the corruption.

Another cause of disappearing corruptions can be transient I/O subsystem problems, where page reads sometimes fail outright and then succeed after that.

And yet one more cause could be that the database is mirrored or is part of an availability group and the page was fixed by automatic page repair before the second DBCC CHECKDB occurred. You can look in the msdb.dbo.suspect_pages table (more details here) for an entry for the broken page(s) with event_type of 4.

Bottom line: From SQL Server 2005 onward, if DBCC CHECKDB reports corruption, then at the time that it ran there definitely was corruption. Make sure you don’t just ignore the problem as next time the corruption occurs, you may not be so ‘lucky’ that it just seemed to disappear.

6 thoughts on “SQLskills SQL101: How can corruptions disappear?

  1. Hmm – I’m wondering about the Agent job reindexing scenario. I thought SQL Server used the existing nonclustered index to build a new one during a rebuild? Here’s a repro script:

    CREATE DATABASE [CarDealer] /* Assuming you have checksums on model, of course */
    GO
    USE [CarDealer];
    GO
    CREATE TABLE dbo.Cars (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, Make VARCHAR(50), Model VARCHAR(50));
    CREATE NONCLUSTERED INDEX IX_Model ON dbo.Cars(Model);
    INSERT INTO dbo.Cars (Make, Model)
    VALUES (‘Porsche’, ‘911’),
    (‘Aston Martin’, ‘DB11’),
    (‘Ford’, ‘Pinto’);
    GO

    /* Turn on actual plans, and run this: */
    ALTER INDEX IX_Model ON dbo.Cars REBUILD;

    /* SQL Server uses the existing nonclustered index to build the new one:
    https://www.brentozar.com/pastetheplan/?id=ryB0qs4aZ

    Another way to test it – take the database offline, and use a hex editor
    change ‘Pinto’ to ‘Minto’ or whever. Then bring it back online, and: */

    ALTER INDEX IX_Model ON dbo.Cars REBUILD;

    /* And you get:
    The statement has been terminated.
    Msg 824, Level 24, State 2, Line 15
    SQL Server detected a logical consistency-based I/O error: incorrect checksum
    (expected: 0xe9599c38; actual: 0xe95992b8). It occurred during a read of page (1:232)
    in database ID 12 at offset 0x000000001d0000 in file ‘M:\MSSQL\Data\CarDealer.mdf’.

    Additional messages in the SQL Server error log or operating system error log may provide more detail.
    This is a severe error condition that threatens database integrity and must be corrected immediately.
    Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors;
    for more information, see SQL Server Books Online.
    */

    1. Yup – certain corruptions, in certain versions of SQL Server will cause an index rebuild to fail. So the corruption is still there and doesn’t ‘disappear’. A bunch of other corruptions won’t affect the index rebuild. Try whacking a page in the middle of a large b-tree. CHECKDB finds it but an index rebuild doesn’t care. It really depends on how the index is corrupted, what kind of index rebuild, and what version you’re using, whether SQL Server will fail the operation or not. But that discussion is a little beyond SQL101, so the simple example is enough to illustrate the point.

      And this is just one example. What about a case where tables are truncated and reloaded every night? (several students in class last week had exactly that scenario). Plenty of ways that pages can be deallocated and detected corruptions ‘disappear’.

      Thanks for the comment.

  2. I remember a couple companies ago, we had point of sales machines with local SQL 2000 DBs in all our stores (they didn’t want to upgrade, even though SQL 2000 support had long ended), and every once in a while, we’d get a backup failure due to corruption. 9 times out of 10, it was a faulty index and rebuilding it, usually with a larger amount of free space, fixed the problem. If it kept repeating, we’d tell our retail team to change the hardware because the I/O was most likely shot.

    As a pivot to this topic, I’ve noticed SQL acts as a pretty interesting early warning system for larger problems in the server (and even network).

  3. Good Morning Paul,

    Question: In the case of an 824 error disappearing, does it make any sense to run DBCC CHECKDB REPAIR_REBUILD or REPAIR_ALLOW_ DATA_LOSS?

    In other words is it possible that DBCC CHECKDB WITH ALL_ERRORMSGS, DATA_PURITY,EXTENDED_LOGICAL_CHECKS would miss errors, but the repair options would find them?

    Thanks!
    Michael

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.