As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
An interesting situation was discussed online recently which prompted me to write this post. A fellow MVP was seeing periodic corruption messages in his error log, but DBCC CHECKDB on all databases didn’t find any corruptions. A subsequent restart of the instance caused the problem to go away.
My diagnosis? Memory corruption. Something had corrupted a page in memory – maybe it was bad memory chips or a memory scribbler (something that writes into SQL Server’s buffer pool, like a poorly-written extended stored procedure), or maybe a SQL Server bug. Whatever it was, restarting the instance wiped the buffer pool clean, removing the corrupt page.
So why didn’t DBCC CHECKDB encounter the corrupt page?
The answer is to do with DBCC CHECKDB‘s use of database snapshots (and all other DBCC CHECK* commands). It creates a database snapshot and then runs the consistency-checking algorithms on the database snapshot. The database snapshot is a transactionally-consistent, unchanging view of the database, which is what DBCC CHECKDB requires.
More info on DBCC CHECKDB’s use of snapshots, and potential problems can be found at:
- The first section of CHECKDB From Every Angle: Complete description of all CHECKDB stages
- CHECKDB From Every Angle: Why would CHECKDB run out of space?
- Database snapshots – when things go wrong
A database snapshot is a separate database as far as the buffer pool is concerned, with its own database ID. A page in the buffer pool is owned by exactly one database ID, and cannot be shared by any other databases. So when DBCC CHECKDB reads a page in the context of the database snapshot, that page must be read from the source database on disk; it cannot use the page from the source database if it’s already in memory, as that page has the wrong database ID.
This means that DBCC CHECKDB reads the entire source database from disk when it uses a database snapshot. This is not a bad thing.
This also means that if there’s a page in the source database that’s corrupt in memory but not corrupt on disk, DBCC CHECKDB will not encounter it if it uses a database snapshot (the default).
If you suspect that a database has some corruption in memory, the only way to have DBCC CHECKDB use the in-memory pages, is to use the WITH TABLOCK option, which skips using a database snapshot and instead uses locks to quiesce changes in the database.
Hope this helps clear up any confusion!
10 thoughts on “SQLskills SQL101: Why DBCC CHECKDB can miss memory corruption”
Hi Paul,
Thanks for other great post.
We are struggling here to find which object the CHECKDB is processing. If there is a latch, we can get the waitresource and find out, but if there isn’t, it has shown to be quite difficult.
I have tried 2 jobs in parallel writing to the same file, one writes a timestamp every 5 seconds the other runs CHECKDB with ALLERRORMSG. What happens is that the checkdb just get written to the file when all the process is finished.
Did some research also using XE and profiler, but I haven’t got anything.
Do you know any other (accurate) way to know what object is being processed?
Thanks a million :)
There isn’t any way to tell which object is being processed.
What do you think about this script Paul? (Assuming we know the snapshot db id) How accurate you think this can be?
DECLARE @snapshot_dbid BIGINT = 15
DECLARE @schema_name NVARCHAR(200)
DECLARE @object_name NVARCHAR(200)
DECLARE @object_id INT
DECLARE @page_id INT
DECLARE @file_id INT
DECLARE @dbcc_page_output table (
parent_obj VARCHAR (MAX)
, obj VARCHAR(MAX)
, field varchar(max)
, value varchar(max)
)
SELECT TOP 1
@page_id = page_id
, @file_id = file_id
FROM sys.dm_os_buffer_descriptors
WHERE database_id = @snapshot_dbid
ORDER BY
allocation_unit_id DESC
, page_id DESC
INSERT INTO @dbcc_page_output
EXEC (‘DBCC PAGE (‘ + @snapshot_dbid+ ‘, ‘ + @file_id + ‘, ‘ + @page_id + ‘, 3 ) WITH TABLERESULTS’)
SELECT @object_id = value
FROM @dbcc_page_output
WHERE field = ‘Metadata: ObjectId’
SELECT @schema_name = SCHEMA_NAME(schema_id)
, @object_name = name
FROM sys.objects
WHERE object_id = @object_id
SELECT @schema_name + ‘.’ + @object_name as OBJECT_being_checked
SELECT field as [column]
, value
FROM @dbcc_page_output
WHERE obj LIKE ‘Slot 0 Column%’
Sure – that’ll give you an idea, but it’ll fluctuate widely when the batch contains a bunch of different tables. Why do you care btw?
Thanks, Paul. We had a huge DWH release where we partitioned few tables containing computed columns. Since then the checkdb just doubled the time. I read your other article about how bad a checkdb can be on computed columns and I was trying to find which table would be the most disruptive in our case and if the computed column + table partitioning could make the checkdb even worse.
Please it means to check pages in memory we need to run on source database with tablock. So read-only will also not work as it will flush the memory. Here single user will work too. Please correct me if otherwise
Well, that will only check the pages from the database that are already in memory. I would not advise using WITH TABLOCK instead of the default, as memory corruption is pretty rare.
Hi Paul,
Is in-memory corruption a possible reason that a particular page cannot successfully be latched?
We have a page that, following AG failover, is persistently experiencing pagelatch_ex time-outs, as shown in the sql server error log.
A time-out occurred while waiting for buffer latch — type 4, bp 0000030379B20B40, page 9:5378401, stat 0x109, database id: 16, allocation unit Id: 72057669309628416, task 0x0000009D1A447C28 : 0, waittime 300 seconds, flags 0x10b8, owning task 0x0000000000000000. Not continuing to wait. 11739
The index can be scanned so i assume pagelatch_sh is not a problem.
The page is an index leaf level page. It’s the same page over and over again, 100+ times in 10 days. We aren’t getting other latch time-outs with the exception of a few occasions where other pages for the same index (some level 0, some level 1) are affected at the same time as our frequently affected page.
We have performed DBCC CHECKDB WITH PHYSICAL_ONLY on a restored database, per our usual approach, and this reveals no corruption.
There aren’t any corruption errors in the sql server error log.
We’re on SQL Server 2016 SP2 CU6.
At a forthcoming maintenance window, we should get the chance to perform a DBCC CHECKTABLE WITH TABLOCK and/or an index rebuild.
I’m curious, though, might i able to flush this individual page out of memory before that?
Many thanks.
James
Hard to say without being on the system, but my guess would be a bug/problem where something is holding the SH latch rather than corruption. See https://troubleshootingsql.com/2011/08/26/debugging-that-latch-timeout/ for general guideliness.