SQLskills SQL101: Why DBCC CHECKDB can miss memory corruption

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

An interesting situation was discussed online recently which prompted me to write this post. A fellow MVP was seeing periodic corruption messages in his error log, but DBCC CHECKDB on all databases didn’t find any corruptions. A subsequent restart of the instance caused the problem to go away.

My diagnosis? Memory corruption. Something had corrupted a page in memory – maybe it was bad memory chips or a memory scribbler (something that writes into SQL Server’s buffer pool, like a poorly-written extended stored procedure), or maybe a SQL Server bug. Whatever it was, restarting the instance wiped the buffer pool clean, removing the corrupt page.

So why didn’t DBCC CHECKDB encounter the corrupt page?

The answer is to do with DBCC CHECKDB‘s use of database snapshots (and all other DBCC CHECK* commands). It creates a database snapshot and then runs the consistency-checking algorithms on the database snapshot. The database snapshot is a transactionally-consistent, unchanging view of the database, which is what DBCC CHECKDB requires.

More info on DBCC CHECKDB’s use of snapshots, and potential problems can be found at:

A database snapshot is a separate database as far as the buffer pool is concerned, with its own database ID. A page in the buffer pool is owned by exactly one database ID, and cannot be shared by any other databases. So when DBCC CHECKDB reads a page in the context of the database snapshot, that page must be read from the source database on disk; it cannot use the page from the source database if it’s already in memory, as that page has the wrong database ID.

This means that DBCC CHECKDB reads the entire source database from disk when it uses a database snapshot. This is not a bad thing.

This also means that if there’s a page in the source database that’s corrupt in memory but not corrupt on disk, DBCC CHECKDB will not encounter it if it uses a database snapshot (the default).

If you suspect that a database has some corruption in memory, the only way to have DBCC CHECKDB use the in-memory pages, is to use the WITH TABLOCK option, which skips using a database snapshot and instead uses locks to quiesce changes in the database.

Hope this helps clear up any confusion!

10 thoughts on “SQLskills SQL101: Why DBCC CHECKDB can miss memory corruption

  1. Hi Paul,

    Thanks for other great post.
    We are struggling here to find which object the CHECKDB is processing. If there is a latch, we can get the waitresource and find out, but if there isn’t, it has shown to be quite difficult.
    I have tried 2 jobs in parallel writing to the same file, one writes a timestamp every 5 seconds the other runs CHECKDB with ALLERRORMSG. What happens is that the checkdb just get written to the file when all the process is finished.
    Did some research also using XE and profiler, but I haven’t got anything.
    Do you know any other (accurate) way to know what object is being processed?

    Thanks a million :)

      1. What do you think about this script Paul? (Assuming we know the snapshot db id) How accurate you think this can be?

        DECLARE @snapshot_dbid BIGINT = 15
        DECLARE @schema_name NVARCHAR(200)
        DECLARE @object_name NVARCHAR(200)

        DECLARE @object_id INT
        DECLARE @page_id INT
        DECLARE @file_id INT
        DECLARE @dbcc_page_output table (

        parent_obj VARCHAR (MAX)
        , obj VARCHAR(MAX)
        , field varchar(max)
        , value varchar(max)

        )

        SELECT TOP 1
        @page_id = page_id
        , @file_id = file_id
        FROM sys.dm_os_buffer_descriptors
        WHERE database_id = @snapshot_dbid
        ORDER BY
        allocation_unit_id DESC
        , page_id DESC

        INSERT INTO @dbcc_page_output
        EXEC (‘DBCC PAGE (‘ + @snapshot_dbid+ ‘, ‘ + @file_id + ‘, ‘ + @page_id + ‘, 3 ) WITH TABLERESULTS’)

        SELECT @object_id = value
        FROM @dbcc_page_output
        WHERE field = ‘Metadata: ObjectId’

        SELECT @schema_name = SCHEMA_NAME(schema_id)
        , @object_name = name
        FROM sys.objects
        WHERE object_id = @object_id

        SELECT @schema_name + ‘.’ + @object_name as OBJECT_being_checked

        SELECT field as [column]
        , value
        FROM @dbcc_page_output
        WHERE obj LIKE ‘Slot 0 Column%’

          1. Thanks, Paul. We had a huge DWH release where we partitioned few tables containing computed columns. Since then the checkdb just doubled the time. I read your other article about how bad a checkdb can be on computed columns and I was trying to find which table would be the most disruptive in our case and if the computed column + table partitioning could make the checkdb even worse.

  2. Please it means to check pages in memory we need to run on source database with tablock. So read-only will also not work as it will flush the memory. Here single user will work too. Please correct me if otherwise

    1. Well, that will only check the pages from the database that are already in memory. I would not advise using WITH TABLOCK instead of the default, as memory corruption is pretty rare.

  3. Hi Paul,

    Is in-memory corruption a possible reason that a particular page cannot successfully be latched?

    We have a page that, following AG failover, is persistently experiencing pagelatch_ex time-outs, as shown in the sql server error log.

    A time-out occurred while waiting for buffer latch — type 4, bp 0000030379B20B40, page 9:5378401, stat 0x109, database id: 16, allocation unit Id: 72057669309628416, task 0x0000009D1A447C28 : 0, waittime 300 seconds, flags 0x10b8, owning task 0x0000000000000000. Not continuing to wait. 11739

    The index can be scanned so i assume pagelatch_sh is not a problem.

    The page is an index leaf level page. It’s the same page over and over again, 100+ times in 10 days. We aren’t getting other latch time-outs with the exception of a few occasions where other pages for the same index (some level 0, some level 1) are affected at the same time as our frequently affected page.

    We have performed DBCC CHECKDB WITH PHYSICAL_ONLY on a restored database, per our usual approach, and this reveals no corruption.

    There aren’t any corruption errors in the sql server error log.

    We’re on SQL Server 2016 SP2 CU6.

    At a forthcoming maintenance window, we should get the chance to perform a DBCC CHECKTABLE WITH TABLOCK and/or an index rebuild.

    I’m curious, though, might i able to flush this individual page out of memory before that?

    Many thanks.
    James

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Some thoughts on courage

(This is also the Ponderings – editorial – in today’s SQLskills newsletter.) I want to start out this post by sincerely thanking everyone who emailed

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.