The Curious Case of… faster disks causing more WRITELOG waits

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

I was contacted last week by someone who was asking how are faster disks causing more WRITELOG waits. They were seeing lots of these waits, with an average wait time of 18ms. The log was stored on a Raid-1 array, using locally-attached spinning disks in the server. They figured that by moving the log to Raid-1 array of SSDs, they’d reduce the WRITELOG wait time and get better workload throughput.

They did so and got better performance, but were very surprised to now see WRITELOG as the most frequent wait type on the server, even though the average wait time was less than 1ms, and asked me to explain.

Let me start by saying this is not unusual, and I’ve seen this sequence of events many times with clients and students.

WRITELOG wait occurs when a log block is being flushed to disk, and the two most common causes of this are that the log block is full (60KB) or a transaction committed. If it’s the case that a workload is comprised of lots of very small transactions, then a reduction in wait time for the log block flush from 18ms down to less than 1ms is very likely to result in a higher throughput. This means more transactions happening and hence more log block flushes when the transactions commit, so *more* WRITELOG waits.

Bottom line: What was confusing to them was that performance tuning is usually about reducing the incidence of waits and/or the individual wait times. This is about the only example I can think of where you do something that increases performance and the wait type you were concerned about becomes more prevalent.

PREEMPTIVE_OS_FLUSHFILEBUFFERS waits on Linux

Last week I blogged about Using the Wait Statistics Report in Azure Data Studio, using a Linux VM in Azure as an example host. As that was my first time running wait stats analysis on a Linux host, I found that the PREEMPTIVE_OS_FLUSHFILEBUFFERS waits on Linux will show up as the top wait on SQL Server 2017 CU6 and higher with default options.

The simple reason for this is that on some Linux servers where the underling I/O subsystem cannot guarantee durable writes when the power fails (i.e. it’s using a volatile write cache), a power failure could lead to data loss within SQL Server. To work around this, 2017 CU6 introduced forced-flush behavior for log and data writing, including things like backups and writes to FILESTREAM files. This means that the writes will be done as normal, and then a call is made to the Windows FlushFileBuffers function, which is translated via the SQLPAL/HostExtension wrapper on Linux into an fdatasync call. There’s a very deep explanation of all the file system internals in Bob Dorr’s excellent post SQL Server On Linux: Forced Unit Access (Fua) Internals.

If you know your I/O subsystem is resilient, you can disable this behavior, as described in KB 4131496.

However, as this will be by far the highest wait on many Linux systems, and as it’s benign, I’ve added it to the list of waits I ignore and updated the Server Reports extension for Azure Data Studio to filter it out as well (this is now live as v1.4).

Enjoy!

PS I also just added the VDI_CLIENT_OTHER wait to the ignore list as well (common from Availability Group seeding), and submitted GitHub changes to Server Reports yesterday, which will be v1.5.

Using the Wait Statistics Report in Azure Data Studio

Azure Data Studio, which used to be called SQL Operations Studio, has a bunch of extensions available, including one called Server Reports from Microsoft. Last year they took my wait stats query (from here) and made it into a report as part of Server Reports extension. In this quick post I’ll show you how to install that extension and look at the report.

I’ve been using Azure Data Studio a lot as part of working with SQL Server 2017 on Linux in Azure and I like it a lot. It’s obviously not as comprehensive as SSMS, but for simple stuff it does the job.

You can see the pretty big list of available extensions by hitting the Extensions icon:

In the list that appears, find the Server Reports extension and click the Install button:

Once an extension is installed, you need to activate it. Without reading any documentation, the first time I used Azure Data Studio I couldn’t figure out why the extension wasn’t there once I’d installed it. Click the Reload button to active the extension:

The extension will now show up in the list of enabled extensions:

If the Server Dashboard screen isn’t showing, right-click on one of the servers you’re connected to and select Manage. All the active extensions will show as tabs at the top of the screen, so select Server Reports and then click the heart-shaped icon and you’ll see the current aggregate wait stats for the instance:

Note that the x-axis is percentage of all waits, not wait count. You’ll see that PREEMPTIVE_OS_FLUSHFILEBUFFERS is the top wait on my Linux instance – that’s by design and I’ll blog about that next. I’ve also submitted a GitHub change to add that wait to the list of waits filtered out by script the extension uses.

Anyway, you can drill in to the details by clicking the ellipsis at the top-right of the graph and selecting ‘Show Details’. That’ll give all the waits and by selecting each one you can see the usual output from my waits script. To get more information on what each wait means, select the bottom cell, right-click on the URL to copy it, and paste into your favorite browser to go to my waits library. And of course, you can refresh the results via the ellipsis as well.

There are many more useful extensions that you should check out too – enjoy!