(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)
Last week I was asked to explain an unstoppable DBCC CHECKDB process. The client regularly runs DBCC CHECKDB at night during off-peak load times and has no problems with it but last week they ran it during the day because of a corruption indication, then decided to stop it and couldn’t. They killed the spid and nothing happened except the spid showed as being in the KILLED/ROLLBACK state.
Let’s back up a bit and explain what DBCC CHECKDB is doing under the covers.
It needs to see a transactionally-consistent, unchanging view of the database. Prior to SQL Server 2005 it did this by essentially running its own version of crash recovery inside itself, by analyzing the database’s transaction log. This was pretty tortuous code that I helped write for SQL Server 2000, there were some cases where it caused false positives, and I had great fun one week in 2001 or 2002 removing all that code forever. The replacement for that code was to instead use a private database snapshot, because a database snapshot gives a transactionally-consistent, unchanging view of a database.
So, the first thing DBCC CHECKDB does in SQL Server 2005 and later is create a database snapshot (unless you specified WITH TABLOCK, or the target database is read-only, single-user, or already a database snapshot).
A database snapshot runs crash recovery of the target database *into* the database snapshot, and herein lies the problem: crash recovery cannot be interrupted, and there’s no check in the crash recovery code to tell whether it’s *real* crash recovery, or crash recovery into a database snapshot (or into the DBCC CHECKDB private snapshot).
If there’s a lot of transaction log to be recovered as part of the initial crash recovery into the database snapshot, that could take a long time. And if someone tries to kill the DBCC CHECKDB while the database snapshot is still running crash recovery, nothing will happen until the crash recovery finishes. It’s not a DBCC shortcoming, it’s a database snapshot shortcoming, and you just have to let it finish.
So there you go – mystery explained!
8 thoughts on “The Curious Case of… unstoppable DBCC CHECKDB”
I used to see the ‘unkillable CHECKDB’ problem when I had a database restore scheduled to happen over the weekend into a non-production environment, and a CHECKDB scheduled to run on the same non-production database. If the CHECKDB took longer than expected, some ALTER DATABASE SET OFFLINE or KILL commands in the restore script would kill the CHECKDB, which went into a rollback that still hadn’t finished over a day later. Normally I wouldn’t restart SQL Server because of slow rollback due to the risk of getting a much worse problem, but in this case a service restart appeared to allow the snapshot to be dropped very quickly. Does this behavior match what you would expect to see based on recent (2016+) code?
Yes. If there was a ton of transaction log to roll back, snapshot creation can take a very long time – but a day seems excessive so I’d check that the database doesn’t have thousands of VLFs in its log. And yes, restarting SQL Server will kill the snapshot creation.
Wow, dropped in to read your blog and found out about a bug I wrote 15 years ago! I’d feel worse except Microsoft hasn’t fixed it yet.
Does DBCC Checktable works the same way?
I am using “DBCC CHECKTABLE(‘sys.sysdbfiles’) WITH PHYSICAL_ONLY;” in one place and currently looking if it is safe to stop/kill this command.
And now after going through this blog, I am wondering if its actually possible to stop it soon after I hit a timeout.
Also, Can there be an impact of killing the command on source database (in case it also uses a database snapshot) ?
Yes, CHECKTABLE works the same way. It will always use a database snapshot unless a) it’s tempdb b) the database is read-only or single-user c) you’re already running it against a manually-created snapshot.
Thanks for another interesting post! I wasn’t aware crash recovery was used to populate the private snapshot, I’d always assumed it was just another snapshot. Does that mean that ADR could have a positive effect on checkdb times, as it could speed up the that crash recovery process, in certain scenarios?
I don’t think so, as the snapshot has to be fully recovered before CHECKDB can start reading through it.