The Curious Case of… unstoppable DBCC CHECKDB

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

Last week I was asked to explain an unstoppable DBCC CHECKDB process. The client regularly runs DBCC CHECKDB at night during off-peak load times and has no problems with it but last week they ran it during the day because of a corruption indication, then decided to stop it and couldn’t. They killed the spid and nothing happened except the spid showed as being in the KILLED/ROLLBACK state.

Let’s back up a bit and explain what DBCC CHECKDB is doing under the covers.

It needs to see a transactionally-consistent, unchanging view of the database. Prior to SQL Server 2005 it did this by essentially running its own version of crash recovery inside itself, by analyzing the database’s transaction log. This was pretty tortuous code that I helped write for SQL Server 2000, there were some cases where it caused false positives, and I had great fun one week in 2001 or 2002 removing all that code forever. The replacement for that code was to instead use a private database snapshot, because a database snapshot gives a transactionally-consistent, unchanging view of a database.

So, the first thing DBCC CHECKDB does in SQL Server 2005 and later is create a database snapshot (unless you specified WITH TABLOCK, or the target database is read-only, single-user, or already a database snapshot).

A database snapshot runs crash recovery of the target database *into* the database snapshot, and herein lies the problem: crash recovery cannot be interrupted, and there’s no check in the crash recovery code to tell whether it’s *real* crash recovery, or crash recovery into a database snapshot (or into the DBCC CHECKDB private snapshot).

If there’s a lot of transaction log to be recovered as part of the initial crash recovery into the database snapshot, that could take a long time. And if someone tries to kill the DBCC CHECKDB while the database snapshot is still running crash recovery, nothing will happen until the crash recovery finishes. It’s not a DBCC shortcoming, it’s a database snapshot shortcoming, and you just have to let it finish.

So there you go – mystery explained!

The Curious Case of… faster disks causing more WRITELOG waits

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

I was contacted last week by someone who was asking how are faster disks causing more WRITELOG waits. They were seeing lots of these waits, with an average wait time of 18ms. The log was stored on a Raid-1 array, using locally-attached spinning disks in the server. They figured that by moving the log to Raid-1 array of SSDs, they’d reduce the WRITELOG wait time and get better workload throughput.

They did so and got better performance, but were very surprised to now see WRITELOG as the most frequent wait type on the server, even though the average wait time was less than 1ms, and asked me to explain.

Let me start by saying this is not unusual, and I’ve seen this sequence of events many times with clients and students.

A WRITELOG wait occurs when a log block is being flushed to disk, and the two most common causes of this are that the log block is full (60KB) or a transaction committed. If it’s the case that a workload is comprised of lots of very small transactions, then a reduction in wait time for the log block flush from 18ms down to less than 1ms is very likely to result in a higher throughput. This means more transactions happening and hence more log block flushes when the transactions commit, so *more* WRITELOG waits.

Bottom line: What was confusing to them was that performance tuning is usually about reducing the incidence of waits and/or the individual wait times. This is about the only example I can think of where you do something that increases performance and the wait type you were concerned about becomes more prevalent.

PREEMPTIVE_OS_FLUSHFILEBUFFERS waits on Linux

Last week I blogged about Using the Wait Statistics Report in Azure Data Studio, using a Linux VM in Azure as an example host. As that was my first time running wait stats analysis on a Linux host, I found that the PREEMPTIVE_OS_FLUSHFILEBUFFERS waits on Linux will show up as the top wait on SQL Server 2017 CU6 and higher with default options.

The simple reason for this is that on some Linux servers where the underling I/O subsystem cannot guarantee durable writes when the power fails (i.e. it’s using a volatile write cache), a power failure could lead to data loss within SQL Server. To work around this, 2017 CU6 introduced forced-flush behavior for log and data writing, including things like backups and writes to FILESTREAM files. This means that the writes will be done as normal, and then a call is made to the Windows FlushFileBuffers function, which is translated via the SQLPAL/HostExtension wrapper on Linux into an fdatasync call. There’s a very deep explanation of all the file system internals in Bob Dorr’s excellent post SQL Server On Linux: Forced Unit Access (Fua) Internals.

If you know your I/O subsystem is resilient, you can disable this behavior, as described in KB 4131496.

However, as this will be by far the highest wait on many Linux systems, and as it’s benign, I’ve added it to the list of waits I ignore and updated the Server Reports extension for Azure Data Studio to filter it out as well (this is now live as v1.4).

Enjoy!

PS I also just added the VDI_CLIENT_OTHER wait to the ignore list as well (common from Availability Group seeding), and submitted GitHub changes to Server Reports yesterday, which will be v1.5.