Target and actual SQL Server uptime survey results

Exactly five years ago I published survey results showing target uptime SLAs and actual uptime measurements. I re-ran the survey a few weeks ago to see what’s changed, if anything, in the space of five years, and here are the results.

24×7 Systems

 24x7target Target and actual SQL Server uptime survey results

 24x7actual Target and actual SQL Server uptime survey results

Other responses:

  • 1 x 99.95%

Non 24×7 Systems

Non24x7target Target and actual SQL Server uptime survey results

Other responses:

  • 7 x “No target or target unknown”
  • 1 x “0830 – 1730 M-Sat”

Non24x7actual Target and actual SQL Server uptime survey results

Other values:

  • 1 x “n/a”

Summary

Well, the good thing is that this survey had almost twice the number of respondents as the 2009 survey, but that could just be that a lot more people read my blog now than five years ago.

My takeaway from the data is that nothing has really changed over the last five years. Given the really low response rate to the survey (when I usually get more than 2-300 responses for a typical survey), my inference is that the majority of you out there don’t have well-defined uptime targets (or recovery time objective service level agreements, RTO SLAs, or whatever you want to call it) and so didn’t respond to the survey. The same thing happens when surveying something like backup testing frequency – where you *know* you’re supposed to do it, but don’t do it enough so feel guilty and don’t respond to the survey.

For those of you that responded, or didn’t respond and do have targets, well done! For those of you that don’t have targets, I don’t blame you, I blame the environment you’re in. Most DBAs I know that *want* to do something about HA/DR are prevented from doing so by their management not placing enough importance on the subject, from talking to a bunch of you. This is also shown by the demand for our various in-person training classes: IE2 on Performance Tuning is usually over-subscribed even though it runs 3-4 times per year, but IE3 on HA/DR has only sold out once even though we generally run it only once per year.

Performance is the number one thing on the collective minds of most I.T. management, not HA/DR planning, and that’s just wrong. Business continuity is so crucial, especially in this day and age of close competition where being down can cause fickle customers to move to a different store/service provider.

If you’re reading this and you know you don’t have well-defined uptime targets then I strongly encourage you to raise the issue with your management, as it’s likely that your entire HA/DR strategy is lacking too. For more information, you can read the results post from the survey five years ago (Importance of defining and measuring SLAs).

Don’t wait until disaster strikes to make HA/DR a priority.

Most common wait stats over 24 hours and changes since 2010

Back in February I kicked off a survey asking you to run code that created a 24-hour snapshot of the most prevalent wait statistics. It’s taken me a few months to provide detailed feedback to everyone who responded and to correlate all the information together. Thanks to everyone who responded!

I did this survey because I wanted to see how the results had changed since my initial wait statistics survey back in 2010.

The results are interesting!

2010 Survey Results

Results from 1823 servers, top wait type since server last restarted (or waits cleared). The blog post for this survey (Wait statistics, or please tell me where it hurts) has a ton of information about what these common wait types mean, and I’m not going to repeat all that in this blog post.

waitstatssurvey Most common wait stats over 24 hours and changes since 2010

2014 Survey Results

Results from 1708 servers, top wait type over 24 hours

2014waits Most common wait stats over 24 hours and changes since 2010

The distribution of the top waits has changed significantly over the last four years, even when taking into account that in the 2010 survey I didn’t filter out BROKER_RECEIVE_WAITFOR.

  • CXPACKET is still the top wait type, which is unsurprising
  • OLEDB has increased to being the top wait type roughly 17% of the time compared to roughly 4% in 2010
  • WRITELOG has increased to being the top wait 10% of the time compared with 6% in 2010
  • ASYNC_NETWORK_IO has decreased to being the top wait 8% of the time compared with 15% in 2010
  • PAGEIOLATCH_XX has decreased to being the top wait 7% of the time compared with 18% in 2010

These percentages remain the same even when I ignore the BROKER_RECEIVE_WAITFOR waits in the 2010 results.

Now I’m going to speculate as to what could have caused the change in results. I have no evidence that supports most of what I’m saying below, just gut feel and supposition – you might disagree. Also, even though the people reading my blog and responding to my surveys are likely to be paying more attention to performance and performance tuning than the general population of people managing SQL Server instances across the world, I think that these results are representative of what’s happening on SQL Server instances across the world.

I think that OLEDB waits have increased in general due to more and more people using 3rd-party performance monitoring tools that make extensive, repeated use of DMVs. Most DMVs are implemented as OLE-DB rowsets and will cause many tiny OLEDB waits (1-2 milliseconds on average, or smaller). This hypothesis is actually borne out by the data I received and confirmation from many people who received my detailed analyses of results they sent me. If you see hundreds of millions or billions of tiny OLEDB waits, this is likely the cause.

I think WRITELOG waits being the top wait have increased partly because other bottlenecks have become less prevalent, and so the next highest bottleneck is the transaction log, and partly because more workloads are hitting logging bottlenecks inside SQL Server that are alleviated starting in SQL Server 2012 (blog post coming next week!). I also think that WRITELOG waits have been prevented from becoming even more prevalent because of the increased use of solid-state disks for transaction log storage mitigating the increased logging from higher workloads.

Now it could be that the drop in PAGEIOLATCH_XX and ASYNC_NETWORK_IO waits being the top wait is just an effect caused by the increase in OLEDB and WRITELOG waits. It could also be because of environmental changes…

PAGEIOLATCH_XX waits being the top wait might have decreased because of:

  • Increased memory on servers meaning that buffer pools are larger and more of the workload fits in memory, so fewer read I/Os are necessary.
  • Increased usage of solid-state disks meaning that individual I/Os are faster, so when I/Os do occur, the PAGEIOLATCH_XX wait time is smaller and so the aggregate wait time is smaller and it is no longer the top wait.
  • More attention being paid to indexing strategies and buffer pool usage.

ASYNC_NETWORK_IO waits being the top wait might have decreased because of fewer poorly written applications, or fixes to applications that previously were poorly written. This supposition is the most tenuous of the four and I really have no evidence for this at all. I suspect it’s more likely the change is an effect of the changes in prevalence of the other wait types discussed above.

Summary

I think it’s interesting how the distribution of top waits has occurred over the last four years and I hope my speculation above rings true with many of you. I’d love to hear your thoughts on all of this in the post comments.

It’s not necessarily bad to have any particular wait type as the most prevalent one in your environment, as waits always happen, so there has to be *something* that’s the top wait on your system. What’s useful though is to trend your wait statistics over time and notice how code/workload/server/schema changes are reflected in the distribution of wait statistics.

There is lots of information about wait statistics in my Wait Statistics blog category and there’s a new whitepaper (SQL Server Performance Tuning Using Wait Statistics: A Beginners Guide) on wait statistics written by Jonathan and Erin in conjunction with Red Gate which you can download from our website here.

Enjoy!

Survey: target uptime – planned and actual

It’s been five years(!) since the last time I asked about your target uptimes for your critical SQL Server instances and I think we’d all be interested to see how things have changed.

Edit 6/2/14: The survey is closed now – see here for the results.

So I present four surveys to you. For your most critical SQL Server instance:

  • If it’s a 24×7 system, what’s the target uptime?
  • If it’s a 24×7 system, what’s your measured uptime over the last year?
  • If it’s not a 24×7 system, what’s the target uptime?
  • If it’s a 24×7 system, what’s your measured uptime over the last year?

You’ll notice that the surveys are termed in percentages. Here’s what the percentages mean for a 24×7 system:

  • 99.999% = 5.26 minutes of downtime per year
  • 99.99% = 52.56 minutes of downtime per year
  • 99.9% = 8.76 hours of downtime per year
  • 99.5% = 1.825 days of downtime per year
  • 99% = 3.65 days of downtime per year
  • 98.5% = 5.475 days of downtime per year
  • 98% = 7.3 days of downtime per year
  • 95% = 18.25 days of downtime per year

If your target uptime allows for planned maintenance downtime, then that doesn’t count as unplanned downtime, as long as your system was only down for the length of time allowed. But don’t cheat yourself and retroactively classify unplanned downtime as planned, so it doesn’t affect your actual, measured uptime.

For instance, if you have a 99.9% uptime goal for a 24×7 system, with a quarterly 4-hour maintenance window, then I would select 99.9% in the 24×7 target survey. For that same system, if the downtime was limited to the proscribed 4-hour window each quarter, and there was no other downtime *at all*, I would select 99.999% on the 24×7 measured uptime survey.

Basic advice is to use common sense in how you answer. If you say you have a 24×7 system but you have a 12-hour maintenance window each week, I wouldn’t classify that as a 24×7 system.

24×7 Systems

Survey 1: 24×7 system target uptime


Survey 2: 24×7 system measured uptime
Please be honest. Remember if you choose 99.999% that means you’re saying your system was up for all but 5 minutes in the last year.


Non-24×7 Systems

Survey 3: Non-24×7 system target uptime
Use ‘Other’ to answer if your answer is ‘No target or target unknown’.


Survey 4: Non-24×7 system measured uptime
Please be honest.



I’ll editorialize the results in a week or two.

Thanks!

Causes of IO_COMPLETION and WRITE_COMPLETION SQL Server wait types

In many of the sets of wait statistics I’ve been analyzing, the IO_COMPLETION and WRITE_COMPLETION waits show up (but never as the most prevalent wait type).

The official definition of these wait types are:

  • IO_COMPLETION: Occurs while waiting for I/O operations to complete. This wait type generally represents non-data page I/Os. Data page I/O completion waits appear as PAGEIOLATCH_* waits.
  • WRITE_COMPLETION: Occurs when a write operation is in progress.

I promised many of the people who sent me wait statistics recently that I would write a blog post giving more detailed information on when these wait types occur, so here it is.

I used the Extended Events code in my post How to determine what causes a particular wait type to watch for these wait types occurring and then ran a variety of operations and analyzed the call stacks. There are way too many occurrences to document them all here, so I’ll summarize my findings below.

Note that these are not lists are not exhaustive, but you get the idea of the kinds of operation where these wait types occur.

IO_COMPLETION

  • Reading log blocks from the transaction log (during any operation that causes the log to be read from disk – e.g. recovery)
  • Reading allocation bitmaps from disk (e.g. GAM, SGAM, PFS pages) during many operations (e.g. recovery, DB startup, restore)
  • Writing intermediate sort buffers to disk (these are called ‘Bobs’)
  • Reading and writing merge results from/to disk during a merge join
  • Reading and writing eager spools to disk
  • Reading VLF headers from the transaction log

WRITE_COMPLETION

  • Writing any page to a database snapshot (e.g. while running DBCC CHECK*, which is often the most common cause of this wait type)
  • Writing VLF headers while creating or growing a transaction log file
  • Writing a file’s header page to disk
  • Writing portions of the transaction log during database startup
  • Writing allocation pages to disk when creating or growing a data file

These aren’t waits that I’d generally be concerned about, and I’d expect the individual resource wait times to be in line with those of the read and write latencies of the instance.

Enjoy!

SQLskills community mentoring – round 6

We’ve had a mentoring program here at SQLskills for a few years (the brainchild of Jonathan), where each of us can pick someone who’s attended one of our Immersion Events in the past and offer to be their mentor for six months.

It’s time to kick off the fifth mentoring session. You can read about the previous mentees below:

I’ll be the only one mentoring this time. I continue to find it interesting helping my mentees with a variety of personal and professional goals (whatever they want help with) and getting to know them better. This time I’ll be mentoring Dainius Sutkevicius who’s taken IE1, IE2, and IE3 previously. A brief resume of Dainius in his own words:

Dainius Sutkevicius has been working as a Sr. SQL Server DBA for an international natural and organic foods retailer for the past several years.

He is highly-motivated, well versed IT professional with over 15 years of experience in logistics, telecommunications, trading and retail industries. He possesses diverse skills and experience in troubleshooting, optimizing, maintaining and developing secure, high availability and performance database solutions.

When he is not tinkering with technology he enjoys spending time with his family, digital imaging, javelin throwing and reading about quantum physics, cosmology and philosophy.

He blogs at http://www.SqlSorcerer.com and is on Twitter @SqlSorcerer.

Congratulations to Dainius – I’m looking forward to working with him over the next six months.

PS I’m often asked how to get into our mentoring program – it’s simple: attend one of our Immersion Events and get to know one of us, and then ask and see if there’s a match that works.

Shrinking tempdb no longer prohibited

For the longest time the guidance around tempdb is that if you shrink it on a live system then it could cause tempdb corruption.

A few months ago I was discussing this with my good friend Bob Ward from Product Support and neither of us could remember the last time we’d seen a case of tempdb corruption that had been caused by shrinking. So we both did some investigations, including looking through the internal bug databases, to find the root cause of the long-running advice.

The bottom line is that tempdb corruption hasn’t been a problem with shrink since early builds of SQL Server 2000. There was also some extensive testing done to verify this.

As such, the KB article that discusses shrinking tempdb has been updated and I got notification last night from the author that it’s been published.

KB 307487 (How to shrink the tempdb database in SQL Server) now explains that even though you may see messages from shrink that look like corruption, they’re not.

Remember though, shrinking should be a rare operation, whether data or log file shrinking – and never a regular operation.

Enjoy!

Nasty day-1 bug causing page checksums to miss corruptions on 2008 R2/2012/2014

Over the last few weeks I’ve been investigating a series of customer corruption issues (on one customer’s system) where it appears that the page checksum algorithm is missing a combination of DWORD bit flips, caused by another (Query Processor) bug that’s been introduced in recent builds of SQL Server 2008 R2, SQL Server 2012, and SQL Server 2014 (on 2014, most likely when using a readable AG secondary).

I’ve gone back-and-forth with my good friends in Product Support and the Product Group quite a bit and we’ve decided to publicize the issue before more people find they have unrecoverable corruption because of this. I’ve been told that this will be fixed ASAP for all three versions, but it isn’t going to delay the RTM and general release of 2014. I’ll blog details as soon as I know.

So what’s the bug?

Well, it involves the way the page checksum algorithm aggregates values. It turns out that a certain repeating bit pattern on adjacent words can cause the Fletcher’s Checksum algorithm (the one that’s used – it’s way cheaper than a full CRC-based checksum – see this Wikipedia link) to fail because of the order insensitivity of the algorithm (it’s explained in the Wikipedia link, under the section Weaknesses of Simple Checksums) and the fact that it can’t distinguish between a block of zeros and a block of ones (described in the main Weaknesses section). The SQL Server implementation was supposed to work around that problem (making it more like a traditional Adler’s Checksum, from which the Fletcher’s Checksum is derived), but obviously it doesn’t work entirely correctly.

The bug hasn’t come to light until now because there has never been a possibility of having these bit patterns adjacent in a data value that would have a page checksum on, or at least it’s been highly unlikely and we all just got lucky. Now that the main Query Processor bug exists, it’s able to persist these bad values because of the page checksum hole. It’s just a nasty combination.

To make matters worse, unfortunately DBCC CHECKDB also has an issue where it also fails to spot this corruption. DBCC CHECKDB uses the buffer pool to do all it’s I/O, and because the page checksum comes back clean (from the broken algorithm), DBCC CHECKDB does a reduced-scope page audit, and so misses the new corruption problem. That issue has been there since my rewrite in SQL Server 2005.

So it’s really a trifecta of nasty bugs, only two of which I’m responsible for (the Query Processor bug is new). I can only apologize I guess.

And the final kicker is that because the page checksum algorithm is compromised, backup checksums also don’t find the problem so most people’s 2008 R2/2012/2014 backups are also corrupt. Even restoring the backup and running DBCC CHECKDB won’t find the problem because of the page checksum bug I describe above.

It’s a pretty sucky combination that’s going to cause a lot of hassle for people to figure out whether they’ve been affected or not. From what we can tell, many people will have undetectable corruption that will continue to get worse until it causes query failures or the Storage Engine sets the database offline. It seems to be more common with OLTP query patterns so far, and 7 of our long-term clients have discovered they have this ‘undetectable’ corruption after I checked them all last week.

How can you tell if you have the corruption?

It’s not a simple test to figure out whether you have the problem, but with some luck you may be able to spot it. I know how to find it using a hex editor to search through data files (which I’ll explain below) and I’ll write some code that uses DBCC PAGE … WITH TABLERESULTS and some post-processing once I have a 100% accurate test and then I’ll blog the code.

In the meantime, here’s an example using XVI32 (note that XVI32 will truncate data files larger that 2GB so you’ll need to use my other favorite freeware hex editor HxD).

You need to set the database you suspect (I’d test all of them to be sure, including master, but you don’t need to check msdb) to be offline and then open it with your hex editor. Then you need to search for the bit pattern of all ones followed by all zeroes, for 32-bits each.

Here’s an example from a 2012 database:

fool Nasty day 1 bug causing page checksums to miss corruptions on 2008 R2/2012/2014

Summary

You really think there would be day-1 bugs in DBCC CHECKDB and the page checksum algorithm? Oh ye of little faith!

Have a great day :-)

A cause of high-duration ASYNC_IO_COMPLETION waits

In some of the wait statistics data I’ve been analyzing, some servers have been showing very long ASYNC_IO_COMPLETION waits, which I had a hunch about but wanted proof.

The official definition of ASYNC_IO_COMPLETION is ‘Occurs when a task is waiting for I/Os to finish.’ Very helpful – NOT!

Using the code I blogged yesterday (How to determine what causes a particular wait type), I set up an XEvent session to track all the ASYNC_IO_COMPLETION waits and their durations, changing to a ring_buffer target so I get the wait durations too:

CREATE EVENT SESSION [InvestigateWaits] ON SERVER
ADD EVENT [sqlos].[wait_info]
(
    ACTION ([package0].[callstack])
    WHERE [wait_type] = 98 -- ASYNC_IO_COMPLETION only, same on all versions
    AND [opcode] = 1 -- Just the end wait events
)
ADD TARGET [package0].[ring_buffer]
WITH
(
    MAX_MEMORY = 50 MB,
    MAX_DISPATCH_LATENCY = 5 SECONDS
)
GO

And then I ran some commands. Although ASYNC_IO_COMPLETION pops up in various locations, the long-running wait happens during a data backup. In fact, three ASYNC_IO_COMPLETION waits occurred during each full backup I ran:

.
<snip>
.
    <data name="duration">
      <type name="uint64" package="package0" />
      <value>7</value>
    </data>
.
<snip>
.
XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal<SuspendQueueSLock>::Wait+1f9 [ @ 0+0x0
AsynchronousDiskPool::WaitUntilDoneOrTimeout+fb [ @ 0+0x0
CheckpointDB2+2cd [ @ 0+0x0
BackupDatabaseOperation::PerformDataCopySteps+c2 [ @ 0+0x0
BackupEntry::BackupDatabase+2ca [ @ 0+0x0
CStmtDumpDb::XretExecute+ef [ @ 0+0x0
CMsqlExecContext::ExecuteStmts<1,1>+400 [ @ 0+0x0
CMsqlExecContext::FExecute+a33 [ @ 0+0x0
CSQLSource::Execute+866 [ @ 0+0x0
process_request+73c [ @ 0+0x0
process_commands+51c [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0</value
.
<snip>
.
    <data name="duration">
      <type name="uint64" package="package0" />
      <value>0</value>
    </data>
.
<snip>
.
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal<SuspendQueueSLock>::Wait+1f9 [ @ 0+0x0
AsynchronousDiskPool::WaitUntilDoneOrTimeout+fb [ @ 0+0x0
BackupOperation::GenerateExtentMaps+34c [ @ 0+0x0
BackupDatabaseOperation::PerformDataCopySteps+179 [ @ 0+0x0
BackupEntry::BackupDatabase+2ca [ @ 0+0x0
CStmtDumpDb::XretExecute+ef [ @ 0+0x0
CMsqlExecContext::ExecuteStmts<1,1>+400 [ @ 0+0x0
CMsqlExecContext::FExecute+a33 [ @ 0+0x0
CSQLSource::Execute+866 [ @ 0+0x0
process_request+73c [ @ 0+0x0
process_commands+51c [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0</value>
.
<snip>
.
    <data name="duration">
      <type name="uint64" package="package0" />
      <value>1958</value>
    </data>
.
<snip>
.
XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal<SuspendQueueSLock>::Wait+1f9 [ @ 0+0x0
AsynchronousDiskPool::WaitUntilDoneOrTimeout+fb [ @ 0+0x0
BackupOperation::BackupData+272 [ @ 0+0x0
BackupEntry::BackupDatabase+2ca [ @ 0+0x0
CStmtDumpDb::XretExecute+ef [ @ 0+0x0
CMsqlExecContext::ExecuteStmts<1,1>+400 [ @ 0+0x0
CMsqlExecContext::FExecute+a33 [ @ 0+0x0
CSQLSource::Execute+866 [ @ 0+0x0
process_request+73c [ @ 0+0x0
process_commands+51c [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0

One wait for each of the following:

  1. The start of the initial checkpoint at the start of the backup
  2. Reading the GAM pages from the data files to figure out what to back up
  3. Reading the actual data from the data files (the writes to the backup files are tracked by BACKUPIO waits)

The third wait is for almost the full duration of the backup. As there are three of these per data backup, the per wait average will be a bit skewed.

The aggregate waits for the single backup operation are as follows, showing the BACKUPIO and BACKUPBUFFER waits each adding up to almost the total backup run time of about 2 seconds:

WaitType                Wait_S   Resource_S   Signal_S   WaitCount   Percentage   AvgWait_S   AvgRes_S   AvgSig_S
BACKUPBUFFER            2.69     2.68         0.01       505         37.23        0.0053      0.0053     0.0000
ASYNC_IO_COMPLETION     2.33     2.33         0.00       3           32.29        0.7773      0.7773     0.0000
BACKUPIO                1.96     1.96         0.00       171         27.08        0.0114      0.0114     0.0000
PREEMPTIVE_OS_WRITEFILE	0.20     0.20         0.00       2           2.78         0.1005      0.1005     0.0000
BACKUPTHREAD            0.04     0.04         0.00       15          0.50         0.0024      0.0024     0.0000
PREEMPTIVE_OS_FILEOPS   0.00     0.00         0.00       5           0.06         0.0008      0.0008     0.0000
WRITELOG                0.00     0.00         0.00       11          0.06         0.0004      0.0004     0.0000

So there you have it. XEvents and a little internals knowledge let us understand what could otherwise be seen as a worrying high-duration wait type. Long-duration ASYNC_IO_COMPLETION waits commonly are from data backups (they don’t happen for regular log backups).

Enjoy!

How to determine what causes a particular wait type

Wait statistics, as you know, are one of my favorite things to do with SQL Server, along with corruption, the transaction log, and Kimberly (but not necessarily in that order :-)

One of the things that really frustrates me about wait statistics is that there is hardly any documentation about what the various wait types actually mean. For example, the Microsoft documentation for the WRITE_COMPLETION wait is ‘Occurs when a write operation is in progress.’ That’s not very illuminating. What kind of writes? Under what circumstances?

There is a relatively easy way to figure out when particular wait types are occurring, using Extended Events to track the SQL Server code call stacks when the wait occurs. Also, this information and the methodology for it are required for quite a few of the people I’m working with to analyze their 24-hours of wait statistics (see this post).

Here’s what you need to do:

  1. Download the symbol files for the SQL Server instance you’re interested in
  2. Create an Extended Event session to track the wait type you’re interested in
  3. Run your workload
  4. Examine the call stacks that you’re collected

And in this post I’ll show you how to do these steps.

Symbol files and call stacks

Whenever an executable is compiled, you can optionally have the compiler generate symbols that can help with debugging. The symbols effectively correlate offsets in the compiled code with the human-readable names of code functions, class methods, and variables. This allows us to look at what’s called a call stack and figure out what SQL Server is doing.

As an example of a call stack, consider the following C++ pseudo-code:

bool Thing::ValidateThing (ThingPtr thing)
{
    // blah
    if (thing->m_Eggs != GREEN && thing->m_side != HAM)
    {
        __ThrowError (THING_NOT_VALID);
    }
    return TRUE;
}

void CountManager::AddNextThingToTotal (ThingPtr thing)
{
    // blah
    if (TRUE == thing->ValidateThing (ThingPtr thing))
    {
        m_ThingCount++;
    }
}

int CountManager::CountListOfThings (ThingListPtr things)
{
    // blah
    AddNextThingToTotal (ThingPtr thing);
    // blah
    return m_ThingCount;
}

And we wanted to see all the call stacks that end up with an error being thrown, one such call stack might look like:

__ThrowError
Thing::ValidateThing+0x26
CountManager::AddNextThingToTotal+0x441
CountManager::CountListOfThings+0x104

It lets us see at what point in the executable something happens, and we can make sense of the call stack if the words in there make sense in our context. You might be concerned that you don’t know the internals of SQL Server, but most of the time the names of the classes and methods have enough information for you to be able to work out what’s happening.

We need the SQL Server symbol files for this to work. You can get them freely from Microsoft and I have a blog post with instructions to do it: How to download a sqlserver.pdb symbol file. If you have trouble with this, let me know as it can be tricky.

Make sure your call stacks look correct – see the example in the ‘how to’ post.

Extended Event Session

The Extended Event session to use is pretty simple. It uses the histogram target (called the ‘asynchronous bucketizer’ in earlier versions) so capture unique call stacks when the wait type we’re interested in occurs.

We do have to make sure that we use the correct wait type value in the session though, as some of the wait type values changed in Extended Events from version to version. The code to get the correct value to use is as follows, using the WRITE_COMPLETION wait type as the example:

-- Figure out the wait type that we need
SELECT
    [map_key]
FROM sys.dm_xe_map_values
WHERE [name] = N'wait_types'
    AND [map_value] = N'WRITE_COMPLETION';
GO

One problem you may have is that some wait types aren’t listed by the name that shows up in sys.dm_os_wait_stats. Jonathan has a handy blog post that does the mapping for these – see here if you run the query above and don’t get a result.

On 2012 I get the following result:

map_key
-----------
628

(Just FYI – the value is 577 on SQL Server 2008 R2!)

We then plug that value into an Extended Event session that will give us all the call stacks and how many time each was hit:

-- Drop the session if it exists.
IF EXISTS (
    SELECT * FROM sys.server_event_sessions
        WHERE [name] = N'InvestigateWaits')
    DROP EVENT SESSION [InvestigateWaits] ON SERVER
GO

-- Create the event session
-- Note that before SQL 2012, the wait_type to use may be
-- a different value.
-- On SQL 2012 the target name is 'histogram' but the old
-- name still works.
CREATE EVENT SESSION [InvestigateWaits] ON SERVER
ADD EVENT [sqlos].[wait_info]
(
    ACTION ([package0].[callstack])
    WHERE [wait_type] = 628 -- WRITE_COMPLETION only
    AND [opcode] = 1 -- Just the end wait events
    --AND [duration] > X-milliseconds
)
ADD TARGET [package0].[asynchronous_bucketizer]
(
    SET filtering_event_name = N'sqlos.wait_info',
    source_type = 1, -- source_type = 1 is an action
    source = N'package0.callstack' -- bucketize on the callstack
)
WITH
(
    MAX_MEMORY = 50 MB,
    MAX_DISPATCH_LATENCY = 5 SECONDS)
GO

-- Start the session
ALTER EVENT SESSION [InvestigateWaits] ON SERVER
STATE = START;
GO

-- TF to allow call stack resolution
DBCC TRACEON (3656, -1);
GO

If the trace flag isn’t enabled, the call stacks will not be resolved by SQL Server. The trace flag also pins dbghelp.dll in memory – don’t worry about it causing a perf issue.

If you want to grab the call stacks and the wait duration for every wait that occurs (e.g. so you can identify the cause of long-duration waits), change the session to:

CREATE EVENT SESSION [InvestigateWaits] ON SERVER
ADD EVENT [sqlos].[wait_info]
(
    ACTION ([package0].[callstack])
    WHERE [wait_type] = 628 -- WRITE_COMPLETION only
    AND [opcode] = 1 -- Just the end wait events
    --AND [duration] > X-milliseconds
)
ADD TARGET [package0].[ring_buffer]
WITH
(
    MAX_MEMORY = 50 MB,
    MAX_DISPATCH_LATENCY = 5 SECONDS)
GO

Workload

Now the Extended Event session exists, you need to run your workload. This may be just your regular workload, or a few example commands that you think may be involved.

As an example, for the WRITE_COMPLETION wait type, I’ll do something simple like creating a database.

Be careful: depending on the wait type you’re investigating, the Extended Event session may cause a small performance issue (e.g. a locking wait, CXPACKET wait, or PAGEIOLATCH_XX wait) so be prepared to stop the session. If you stop the session though, the information in the session histogram target disappears, so grab the data from it (see the Analysis section below) before stopping the session using the following code.

-- Stop the event session
ALTER EVENT SESSION [InvestigateWaits] ON SERVER
STATE = STOP;
GO

But the longer you can run the session for, the more likely you’ll get the causes of the wait type you’re interested in.

Analysis

To get the data out of the histogram target, use the following code:

-- Get the callstacks from the bucketizer target
SELECT
    [event_session_address],
    [target_name],
    [execution_count],
    CAST ([target_data] AS XML)
FROM sys.dm_xe_session_targets [xst]
INNER JOIN sys.dm_xe_sessions [xs]
    ON [xst].[event_session_address] = [xs].[address]
WHERE [xs].[name] = N'InvestigateWaits';
GO

In the example I’ve created, where I’m looking at WRITE_COMPLETION waits occurring, I get back a bunch of call stacks. Here are the first few call stacks collected on SQL Server 2012 SP1:


    XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal::Wait+1f9 [ @ 0+0x0
FCB::SyncWrite+104 [ @ 0+0x0
DBMgr::CopyModel+fe [ @ 0+0x0
DBMgr::CreateAndFormatFiles+966 [ @ 0+0x0
CStmtCreateDB::CreateLocalDatabaseFragment+682 [ @ 0+0x0
DBDDLAgent::CreateDatabase+f7 [ @ 0+0x0
CStmtCreateDB::XretExecute+8fc [ @ 0+0x0
CMsqlExecContext::ExecuteStmts<1,1>+400 [ @ 0+0x0
CMsqlExecContext::FExecute+a33 [ @ 0+0x0
CSQLSource::Execute+866 [ @ 0+0x0
process_request+73c [ @ 0+0x0
process_commands+51c [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0

    XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal::Wait+1f9 [ @ 0+0x0
FCB::SyncWrite+104 [ @ 0+0x0
FCB::PageWriteInternal+55 [ @ 0+0x0
InitGAMIntervalPages+4cb [ @ 0+0x0
InitDBAllocPages+d0 [ @ 0+0x0
FileMgr::CreateNewFile+137 [ @ 0+0x0
AsynchronousDiskAction::ExecuteDeferredAction+8f [ @ 0+0x0
AsynchronousDiskWorker::ThreadRoutine+15c [ @ 0+0x0
SubprocEntrypoint+a21 [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0

    XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal::Wait+1f9 [ @ 0+0x0
FCB::SyncWrite+104 [ @ 0+0x0
FCB::PageWriteInternal+55 [ @ 0+0x0
DirectlyMarkPFSPage+154 [ @ 0+0x0
InitGAMIntervalPages+52a [ @ 0+0x0
InitDBAllocPages+d0 [ @ 0+0x0
FileMgr::CreateNewFile+137 [ @ 0+0x0
AsynchronousDiskAction::ExecuteDeferredAction+8f [ @ 0+0x0
AsynchronousDiskWorker::ThreadRoutine+15c [ @ 0+0x0
SubprocEntrypoint+a21 [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0

    XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal::Wait+1f9 [ @ 0+0x0
FCB::SyncWrite+104 [ @ 0+0x0
SQLServerLogMgr::FormatVirtualLogFile+175 [ @ 0+0x0
SQLServerLogMgr::FormatLogFile+c3 [ @ 0+0x0
FileMgr::CreateNewFile+106 [ @ 0+0x0
AsynchronousDiskAction::ExecuteDeferredAction+8f [ @ 0+0x0
AsynchronousDiskWorker::ThreadRoutine+15c [ @ 0+0x0
SubprocEntrypoint+a21 [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0

    XeSosPkg::wait_info::Publish+138 [ @ 0+0x0
SOS_Scheduler::UpdateWaitTimeStats+30c [ @ 0+0x0
SOS_Task::PostWait+90 [ @ 0+0x0
EventInternal::Wait+1f9 [ @ 0+0x0
FCB::SyncWrite+104 [ @ 0+0x0
FCB::PageWriteInternal+55 [ @ 0+0x0
GlobalFileHeader::CreateInitialPage+395 [ @ 0+0x0
GlobalFileHeader::WriteInitialPage+50 [ @ 0+0x0
FCB::InitHeaderPage+25c [ @ 0+0x0
FileMgr::CreateNewFile+144 [ @ 0+0x0
AsynchronousDiskAction::ExecuteDeferredAction+8f [ @ 0+0x0
AsynchronousDiskWorker::ThreadRoutine+15c [ @ 0+0x0
SubprocEntrypoint+a21 [ @ 0+0x0
SOS_Task::Param::Execute+21e [ @ 0+0x0
SOS_Scheduler::RunTask+a8 [ @ 0+0x0
SOS_Scheduler::ProcessTasks+29a [ @ 0+0x0
SchedulerManager::WorkerEntryPoint+261 [ @ 0+0x0
SystemThread::RunWorker+8f [ @ 0+0x0
SystemThreadDispatcher::ProcessWorker+3c8 [ @ 0+0x0
SchedulerManager::ThreadEntryPoint+236 [ @ 0+0x0
BaseThreadInitThunk+d [ @ 0+0x0
RtlUserThreadStart+21 [ @ 0+0x0

.
.
.

In this example, we can see from the collected call stacks that  a WRITE_COMPLETION wait occurs when the following operations occur (and there are many more, of course):

  • Copying the pages from the model database into our new database (call stack 1)
  • Creating and formatting the GAM, SGAM, DIFF_MAP, and ML_MAP allocation bitmaps (call stack 2)
  • Creating and formatting the PFS allocation byte-maps (call stack 3)
  • Creating and formatting the transaction log VLF headers (call stack 4)
  • Creating and formatting the data and log file header pages (call stack 5)

How cool is that? :-)

Summary

Now you have a method to investigate any wait type that you’re seeing in your workload. I’ll be posting a bunch of information about wait types and when they occur through the year.

If you have call stacks that you’d like to know what they are, feel free to send me an email and I’ll respond within a week or so.

Enjoy!

What is the most worrying cause of log growth (log_reuse_wait_desc)?

Two weeks ago I kicked off a survey that presented a scenario and asked you to vote for the log_reuse_wait_desc value you’d be most worried to see on a critical database with a 24×7 workload.

Here are the results:

logreuse What is the most worrying cause of log growth (log reuse wait desc)?

Another very interesting spread of responses – as always, thanks to everyone who took the time to respond.

Remember that you have no knowledge about anything on the system except what the log_reuse_wait_desc is for the database with constantly growing log.

Before I start discussing them, I’m going to explain the term log truncation that I’ll be using. This is one of the two terms used (the other is log clearing) to describe the mechanism that marks currently-active VLFs (virtual log files) in the log as inactive so they can be reused. When there are inactive VLFs, the log can wrap around (see this post for more details) and doesn’t have to grow. When there are no available VLFs because log truncation isn’t possible, the log has to grow. When log truncation can’t make any VLFs inactive, it records the reason why, and that’s what the log_reuse_wait_desc value in sys.databases gives us. You can read more about how the log works in the this TechNet Magazine article I wrote back in 2009, and get all the log info you could ever want in my logging and recovery Pluralsight course.

We also need to understand how the log_reuse_wait_desc reporting mechanism works. It gives the reason why log truncation couldn’t happen the last time log truncation was attempted. This can be confusing – for instance if you see ACTIVE_BACKUP_OR_RESTORE and you know there isn’t a backup or restore operation running, this just means that there was one running the last time log truncation was attempted. You can also see some weird effects – for instance if you do a log backup, and then see LOG_BACKUP immediately as the log_reuse_wait_desc value. I blogged about the reason for that phenomenon here.

Let’s quickly consider what each of the log_reuse_wait_desc values in the list above means (and this isn’t an exhaustive analysis of each one):

  • NOTHING: Just as it looks, this value means that SQL Server thinks there is no problem with log truncation. In our case though, the log is clearly growing, so how could we see NOTHING? Well, for this we have to understand how the log_reuse_wait_desc reporting works. The value is reporting what stopped log truncation last time it was attempted. If the value is NOTHING, it means that at least one VLF was marked inactive the last time log truncation occurred. We could have a situation where the log has a huge number of VLFs, and there are a large number of active transactions, with each one having its LOP_BEGIN_XACT log record in successive VLFs. If each time log truncation happens, a single transaction has committed, and it only manages to clear one VLF, it could be that the speed of log truncation is just waaaay slower than the speed of log record generation, and so the log is having to grow to accommodate it. I’d use my script here to see how many active transactions there are, monitor VLFs with DBCC LOGINFO, and also track the Log Truncations and Log Growths counters for the database in the Databases perfmon object. I need to see if I can engineer this case and repeatably see NOTHING.
  • CHECKPOINT: This value means that a checkpoint hasn’t occurred since the last time log truncation occurred. In the simple recovery model, log truncation only happens when a checkpoint completes, so you wouldn’t normally see this value. When it can happen is if a checkpoint is taking a long time to complete and the log has to grow while the checkpoint is still running. I’ve seen this on one client system with a very poorly performing I/O subsystem and a very large buffer pool with a lot of dirty pages that needed to be flushed when the checkpoint occurred.
  • LOG_BACKUP: This is one of the most common values to see, and says that you’re in the full or bulk_logged recovery model and a log backup hasn’t occurred. In those recovery models, it’s a log backup that performs log truncation. Simple stuff. I’d check to see why log backups are not being performed (disabled Agent job or changed Agent job schedule? backup failure messages in the error log?)
  • ACTIVE_BACKUP_OR_RESTORE: This means that there’s a data backup running or any kind of restore running. The log can’t be truncated during a restore, and is required for data backups so can’t be truncated there either.
  • ACTIVE_TRANSACTION: This means that there is a long-running transaction that is holding all the VLFs active. The way log truncation works is that it goes to the next VLF (#Y) from the last one (#X) made inactive last time log truncation works, and looks at that. If VLF #Y can’t be made inactive, then log truncation fails and the log_reuse_wait_desc value is recorded. If a long-running transaction has its LOP_BEGIN_XACT log record in VLF #Y, then no other VLFs can be made inactive either. Even if all other VLFs after VLF #Y have nothing to do with our long-running transaction – there’s no selective active vs. inactive. You can use this script to see all the active transactions.
  • DATABASE_MIRRORING: This means that the database mirroring partnership has some latency in it and there are log records on the mirroring principal that haven’t yet been sent to the mirroring mirror (called the send queue). This can happen if the mirror is configured for asynchronous operation, where transactions can commit on the principal before their log records have been sent to the mirror. It can also happen in synchronous mode, if the mirror becomes disconnected or the mirroring session is suspended. The amount of log in the send queue can be equated to the expected amount of data (or work) loss in the event of a crash of the principal.
  • REPLICATION: This value shows up when there are committed transactions that haven’t yet been scanned by the transaction replication Log Reader Agent job for the purpose of sending them to the replication distributor or harvesting them for Change Data Capture (which uses the same Agent job as transaction replication). The job could have been disabled, could be broken, or could have had its SQL Agent schedule changed.
  • DATABASE_SNAPSHOT_CREATION: When a database snapshot is created (either manually or automatically by DBCC CHECKDB and other commands), the database snapshot is made transactionally consistent by using the database’s log to perform crash recovery into the database snapshot. The log obviously can’t be truncated while this is happening and this value will be the result. See this blog post for a bit more info.
  • LOG_SCAN: This value shows up if a long-running call to fn_dblog (see here) is under way when the log truncation is attempted, or when the log is being scanned during a checkpoint.
  • AVAILABILITY_REPLICA: This is the same thing as DATABASE_MIRRORING, but for an availability group (2012 onward) instead of database mirroring.

Ok – so maybe not such a quick consideration of the various values :-) Books Online has basic information about these values here.

So which one would *I* be the most worried to see for a 24 x 7, critical database?

Let’s think about some of the worst case scenarios for each of the values:

  • NOTHING: It could be the scenario I described in the list above, but that would entail a workload change having happened for me to be surprised by it. Otherwise it could be a SQL Server bug, which is unlikely.
  • CHECKPOINT: For a critical, 24×7 database, I’m likely using the full recovery model, so it’s unlikely to be this one unless someone’s switched to simple without me knowing…
  • LOG_BACKUP: This would mean something had happened to the log backup job so it either isn’t running or it’s failing. Worst case here would be data loss if a catastrophic failure occurred, plus the next successful log backup is likely to be very large.
  • ACTIVE_BACKUP_OR_RESTORE: As the log is growing, if this value shows up then it must be a long-running data backup. Log backups can run concurrently so I’m not worried about data loss of a problem occurs.
  • ACTIVE_TRANSACTION: Worst case here is that the transaction needs to be killed and then will take a long time to roll back, producing a lot more transaction log before the log stops growing.
  • DATABASE_MIRRORING: Worst case here is that a crash occurs and data/work loss occurs because of the log send queue on the principal, but only if I don’t have log backups that I can restore from. So maybe we’re looking at a trade off between some potential data loss or some potential down time (to restore from backups).
  • REPLICATION: The worst case here is that replication’s got itself badly messed up for some reason and has to be fully removed from the database with sp_removedbreplication, and then reconfigured again.
  • DATABASE_SNAPSHOT_CREATION: The worst case here is that there are some very long running transactions that are still being crash recovered into the database snapshot and that won’t finish for a while. It’s not possible to interrupt database snapshot creation, but it won’t affect anything in the source database apart from log truncation.
  • LOG_SCAN: This is very unlikely to be a problem.
  • AVAILABILITY_REPLICA: Same as for database mirroring, but we may have another replica that’s up-to-date (the availability group send queue could be for one of the asynchronous replicas that is running slowly, and we have a synchronous replica that’s up-to-date.

The most worrying value is going to depend on what is likely to be the biggest problem for my environment: potential data loss, reinitialization of replication, very large log backup, or lots *more* log growth from waiting for a long-running transaction to commit or roll back.

I think I’d be most worried to see ACTIVE_TRANSACTION, as it’s likely that the offending transaction will have to be killed which might generate a bunch more log, cause more log growth, and more log to be backed up, mirroring, replicated, and so on. This could also lead to disaster recovery problems too – if a long-running transaction exists and a failover occurs, the long-running transaction has to be rolled back before the database is fully available (even with fast recovery in Enterprise Edition).

So there you have it. 37% of you agreed with me. I discussed this with Glenn just after I posted it and we agreed on ACTIVE_TRANSACTION as the most worrying type (so I didn’t just pick this because the largest proportion of respondents did).

I’m interested to hear your thoughts on my pick and the scenario, but please don’t rant about how I’m wrong or the scenario is bogus, as there is no ‘right’ answer, just opinions based on experience.

And that’s the trick when it comes to performance troubleshooting – although it’s an art and a science, so much of how you approach it is down to your experience.

Happy troubleshooting!