[Edit 2016: Check out my new resource – a comprehensive library of all wait types and latch classes – see here.]
Two weeks ago I kicked off a survey that presented a scenario and asked you to vote for the wait type you’d be most worried to see after a code roll-out to production.
Here are the results:
A very interesting spread of responses – thank you to all who thought about it and responded.
Remember, you have no other information apart from what the top wait type is over the week since the new code was rolled out. No knowledge of the I/O subsystem, memory, workload, code, or anything else. And that makes it a bit hard to pick what the most worrying wait type would be.
Let’s quickly consider what each of the wait types listed above could mean (and this is by no means an exhaustive examination of each wait type):
- ASYNC_NETWORK_IO: the classic cause of this wait type is RBAR (Row-By-Agonizing-Row) processing of results in a client, instead of caching the results client-side and telling SQL Server to send more. A common misconception is that this wait type is usually caused by network problems – that’s rarely the case in my experience.
- CXPACKET: this wait type always accrues when parallelism happens, as the control thread in a parallel operation waits until all threads have completed. However, when parallel threads are given unbalanced amounts of work to do, the threads that finish early also accrue this wait type, leading to it maybe becoming the most prevalent. So this one could be benign, as the workload has lots of good parallelism, but could be malignant if there’s unwanted parallelism or problems causing skewed distribution of work among parallel threads.
- LCK_M_IX: this wait type occurs when a thread is waiting for a table or page IX lock so that a row insert or update can occur. It could be from lock escalation to a table X or S lock causing all other threads to wait to be able to insert/update.
- LCK_M_X: this wait type commonly occurs when lock escalation is happening. It could also be caused by using a restrictive isolation level like REPEATABLE_READ or SERIALIZABLE that requires S and IS locks to be held until the end of a transaction. Note that distributed transactions change the isolation level to SERIALIZABLE under the covers – something that’s bitten several of our clients before we helped them. Someone could also have inhibited row locks on a clustered index causing all inserts to acquire page X locks – this is very uncommon though.
- PAGEIOLATCH_SH: this wait type occurs when a thread is waiting for a data file page to be read into memory. Common causes of this wait being the most prevalent are when the workload doesn’t fit in memory and the buffer pool has to keep evicting pages and reading others in from disk, or when query plans are using table scans instead of index seeks, or when the buffer pool is under memory pressure which reduces the amount of space available for data.
- PAGELATCH_EX: the two classic causes of this wait type are tempdb allocation bitmap contention (from lots of concurrent threads creating and dropping temp tables combined with a small number of tempdb files and not having TF1118 enabled) and an insert hotspot (from lots of concurrent threads inserting small rows into a clustered index with an identity value, leading to contention on the index leaf-level pages). There are plenty of other causes of this wait type too, but none that would commonly lead to it being the leading wait type over the course of a week.
- SOS_SCHEDULER_YIELD: the most common cause of this wait type is that the workload is memory resident and there is no contention for resources, so threads are able to repeatedly exhaust their scheduling quanta (4ms), registering SOS_SCHEDULER_YIELD when they voluntarily yield the processor. An example would be scanning through a large number of pages in an index. This may or may not be a good thing.
- WRITELOG: this wait type is common to see in the first few top waits on servers as the transaction log is often one of the chief bottlenecks on a busy server. This could be caused by the I/O subsystem not being able to keep up with the rate of log flushing combined with lots of tiny transactions forcing frequent flushes of minimal-sized log blocks.
So which one would *I* be most worried about seeing after a week of running the new code? The biggest problem is that we don’t know what the pattern of wait statistics was before the code roll-out, so we don’t know what’s normal for this production system, and it’s always better to have a baseline (or multiple snapshots) of wait statistics to compare with. We also don’t know what *other* wait types were the next few most prevalent types.
I’d be unsurprised to see ASYNC_NETWORK_IO, CXPACKET, PAGEIOLATCH_SH, or WRITELOG as the top wait, and I’d be intrigued to see SOS_SCHEDULER_YIELD as the top wait as that’s very uncommon. None of these would be the most worrying IMHO. PAGELATCH_EX would mean there’s a big bottleneck somewhere, which may be easy to alleviate (if in tempdb) or not, but queries would still be progressing.
And that leaves the two locking waits. When I designed the survey and scenario, I had it in mind that LCK_M_IX would be the most worrying as it would likely be from a bunch of threads waiting for other long-running, many-row updates to complete under an escalated table X lock. Now, thinking more about LCK_M_X, there would have to be a lot of blocked lock escalations happening to cause this – but in the meantime other queries are progressing.
I sat back for ten minutes after writing the last paragraph and thought about it some more, and came to a conclusion: LCK_M_IX is what I’d be most worried about. Most of the other waits are likely to be very small in duration, and although they’d make the overall system be slower, the locking waits can easily cause individual queries to time out and fail, leading to a very poor user experience. Although the fix for the locking waits could be as simple as enabling read-committed snapshot isolation (and paying attention to tempdb configuration and performance), the overall user experience is my primary concern, hence my biggest worry is the wait that could have the biggest effect on someone’s use of the system.
Now remember, this thought experiment was conducted with a very contrived scenario, and my reasoning is based on that. And it’s just my reasoning. More than 90% of you thought differently about your answer, which doesn’t mean either of us is wrong, just that we analyzed the scenario in a different way. A lot of it comes down to deciding what it is that makes you worried: user impact, overall workload performance, cost/complexity to alleviate, or something else. I picked user impact.
Some of you may cynically think I picked LCK_M_IX as it’s the answer that had the fewest responses, but if you ask Erin or Jon, they’ll tell you that I ran the survey past them (at lunch during our IE2 class two weeks ago) and told them my answer would be LCK_M_IX.
I’m interested to hear your thoughts on my pick and the scenario, but please don’t rant about how I’m wrong or the scenario is bogus, as there is no ‘right’ answer, just opinions based on experience.
And that’s the trick when it comes to performance troubleshooting – although it’s an art and a science, so much of how you approach it is down to your experience.
PS Don’t forget to run the code from my previous post and get some free insight from me into your production wait statistics.