Most common wait stats over 24 hours and changes since 2010

[Edit 2016: Check out my new resource – a comprehensive library of all wait types and latch classes – see here.]

Back in February I kicked off a survey asking you to run code that created a 24-hour snapshot of the most prevalent wait statistics. It’s taken me a few months to provide detailed feedback to everyone who responded and to correlate all the information together. Thanks to everyone who responded!

I did this survey because I wanted to see how the results had changed since my initial wait statistics survey back in 2010.

The results are interesting!

2010 Survey Results

Results from 1823 servers, top wait type since server last restarted (or waits cleared). The blog post for this survey (Wait statistics, or please tell me where it hurts) has a ton of information about what these common wait types mean, and I’m not going to repeat all that in this blog post.

2014 Survey Results

Results from 1708 servers, top wait type over 24 hours

2014waits

The distribution of the top waits has changed significantly over the last four years, even when taking into account that in the 2010 survey I didn’t filter out BROKER_RECEIVE_WAITFOR.

  • CXPACKET is still the top wait type, which is unsurprising
  • OLEDB has increased to being the top wait type roughly 17% of the time compared to roughly 4% in 2010
  • WRITELOG has increased to being the top wait 10% of the time compared with 6% in 2010
  • ASYNC_NETWORK_IO has decreased to being the top wait 8% of the time compared with 15% in 2010
  • PAGEIOLATCH_XX has decreased to being the top wait 7% of the time compared with 18% in 2010

These percentages remain the same even when I ignore the BROKER_RECEIVE_WAITFOR waits in the 2010 results.

Now I’m going to speculate as to what could have caused the change in results. I have no evidence that supports most of what I’m saying below, just gut feel and supposition – you might disagree. Also, even though the people reading my blog and responding to my surveys are likely to be paying more attention to performance and performance tuning than the general population of people managing SQL Server instances across the world, I think that these results are representative of what’s happening on SQL Server instances across the world.

I think that OLEDB waits have increased in general due to more and more people using 3rd-party performance monitoring tools that make extensive, repeated use of DMVs. Most DMVs are implemented as OLE-DB rowsets and will cause many tiny OLEDB waits (1-2 milliseconds on average, or smaller). This hypothesis is actually borne out by the data I received and confirmation from many people who received my detailed analyses of results they sent me. If you see hundreds of millions or billions of tiny OLEDB waits, this is likely the cause.

I think WRITELOG waits being the top wait have increased partly because other bottlenecks have become less prevalent, and so the next highest bottleneck is the transaction log, and partly because more workloads are hitting logging bottlenecks inside SQL Server that are alleviated starting in SQL Server 2012 (blog post coming next week!). I also think that WRITELOG waits have been prevented from becoming even more prevalent because of the increased use of solid-state disks for transaction log storage mitigating the increased logging from higher workloads.

Now it could be that the drop in PAGEIOLATCH_XX and ASYNC_NETWORK_IO waits being the top wait is just an effect caused by the increase in OLEDB and WRITELOG waits. It could also be because of environmental changes…

PAGEIOLATCH_XX waits being the top wait might have decreased because of:

  • Increased memory on servers meaning that buffer pools are larger and more of the workload fits in memory, so fewer read I/Os are necessary.
  • Increased usage of solid-state disks meaning that individual I/Os are faster, so when I/Os do occur, the PAGEIOLATCH_XX wait time is smaller and so the aggregate wait time is smaller and it is no longer the top wait.
  • More attention being paid to indexing strategies and buffer pool usage.

ASYNC_NETWORK_IO waits being the top wait might have decreased because of fewer poorly written applications, or fixes to applications that previously were poorly written. This supposition is the most tenuous of the four and I really have no evidence for this at all. I suspect it’s more likely the change is an effect of the changes in prevalence of the other wait types discussed above.

Summary

I think it’s interesting how the distribution of top waits has occurred over the last four years and I hope my speculation above rings true with many of you. I’d love to hear your thoughts on all of this in the post comments.

It’s not necessarily bad to have any particular wait type as the most prevalent one in your environment, as waits always happen, so there has to be *something* that’s the top wait on your system. What’s useful though is to trend your wait statistics over time and notice how code/workload/server/schema changes are reflected in the distribution of wait statistics.

There is lots of information about wait statistics in my Wait Statistics blog category and there’s a new whitepaper (SQL Server Performance Tuning Using Wait Statistics: A Beginners Guide) on wait statistics written by Jonathan and Erin in conjunction with Red Gate which you can download from our website here.

Enjoy!

Survey: target uptime – planned and actual

It’s been five years(!) since the last time I asked about your target uptimes for your critical SQL Server instances and I think we’d all be interested to see how things have changed.

Edit 6/2/14: The survey is closed now – see here for the results.

So I present four surveys to you. For your most critical SQL Server instance:

  • If it’s a 24×7 system, what’s the target uptime?
  • If it’s a 24×7 system, what’s your measured uptime over the last year?
  • If it’s not a 24×7 system, what’s the target uptime?
  • If it’s a 24×7 system, what’s your measured uptime over the last year?

You’ll notice that the surveys are termed in percentages. Here’s what the percentages mean for a 24×7 system:

  • 99.999% = 5.26 minutes of downtime per year
  • 99.99% = 52.56 minutes of downtime per year
  • 99.9% = 8.76 hours of downtime per year
  • 99.5% = 1.825 days of downtime per year
  • 99% = 3.65 days of downtime per year
  • 98.5% = 5.475 days of downtime per year
  • 98% = 7.3 days of downtime per year
  • 95% = 18.25 days of downtime per year

If your target uptime allows for planned maintenance downtime, then that doesn’t count as unplanned downtime, as long as your system was only down for the length of time allowed. But don’t cheat yourself and retroactively classify unplanned downtime as planned, so it doesn’t affect your actual, measured uptime.

For instance, if you have a 99.9% uptime goal for a 24×7 system, with a quarterly 4-hour maintenance window, then I would select 99.9% in the 24×7 target survey. For that same system, if the downtime was limited to the proscribed 4-hour window each quarter, and there was no other downtime *at all*, I would select 99.999% on the 24×7 measured uptime survey.

Basic advice is to use common sense in how you answer. If you say you have a 24×7 system but you have a 12-hour maintenance window each week, I wouldn’t classify that as a 24×7 system.

24×7 Systems

Survey 1: 24×7 system target uptime


Survey 2: 24×7 system measured uptime
Please be honest. Remember if you choose 99.999% that means you’re saying your system was up for all but 5 minutes in the last year.


Non-24×7 Systems

Survey 3: Non-24×7 system target uptime
Use ‘Other’ to answer if your answer is ‘No target or target unknown’.


Survey 4: Non-24×7 system measured uptime
Please be honest.



I’ll editorialize the results in a week or two.

Thanks!

Causes of IO_COMPLETION and WRITE_COMPLETION SQL Server wait types

(Check out my Pluralsight online training course: SQL Server: Performance Troubleshooting Using Wait Statistics and my comprehensive library of all wait types and latch classes.)

In many of the sets of wait statistics I’ve been analyzing, the IO_COMPLETION and WRITE_COMPLETION waits show up (but never as the most prevalent wait type).

The official definition of these wait types are:

  • IO_COMPLETION: Occurs while waiting for I/O operations to complete. This wait type generally represents non-data page I/Os. Data page I/O completion waits appear as PAGEIOLATCH_* waits.
  • WRITE_COMPLETION: Occurs when a write operation is in progress.

I promised many of the people who sent me wait statistics recently that I would write a blog post giving more detailed information on when these wait types occur, so here it is.

I used the Extended Events code in my post How to determine what causes a particular wait type to watch for these wait types occurring and then ran a variety of operations and analyzed the call stacks. There are way too many occurrences to document them all here, so I’ll summarize my findings below.

Note that these are not lists are not exhaustive, but you get the idea of the kinds of operation where these wait types occur.

IO_COMPLETION

  • Reading log blocks from the transaction log (during any operation that causes the log to be read from disk – e.g. recovery)
  • Reading allocation bitmaps from disk (e.g. GAM, SGAM, PFS pages) during many operations (e.g. recovery, DB startup, restore)
  • Writing intermediate sort buffers to disk (these are called ‘Bobs’)
  • Reading and writing merge results from/to disk during a merge join
  • Reading and writing eager spools to disk
  • Reading VLF headers from the transaction log

WRITE_COMPLETION

  • Writing any page to a database snapshot (e.g. while running DBCC CHECK*, which is often the most common cause of this wait type)
  • Writing VLF headers while creating or growing a transaction log file
  • Writing a file’s header page to disk
  • Writing portions of the transaction log during database startup
  • Writing allocation pages to disk when creating or growing a data file

These aren’t waits that I’d generally be concerned about, and I’d expect the individual resource wait times to be in line with those of the read and write latencies of the instance.

Enjoy!