How many times have you walked up to a SQL Server that has a performance problem and wondered where to start looking?
One of the most under-utilized performance troubleshooting methodologies in the SQL Server world is one called “waits and queues” (also known simply as “wait stats”). The basic premise is that SQL Server is permanently tracking why execution threads have to wait. You can ask SQL Server for this information and then use the results to narrow down where to start digging to unearth the cause of performance issues. The “waits” are what SQL Server tracks. The “queues” are the resources that the threads are waiting for. There are a myriad of waits in the system and they all indicate different resources being waited for. For example, a PAGEIOLATCH_EX wait means a thread is waiting for a data page to be read into the buffer pool from disk. A LCK_M_X wait means a thread is waiting to be granted an exclusive lock on something.
The great thing about all of this is the SQL Server *knows* where the performance issues are, and you just need to ask it…. and then interpret what it tells you, which can be a little tricky.
Now – where people sometimes get hung up is trying to track down every last wait and figure out what’s causing it. Waits *always* occur. It’s the way SQL Server’s scheduling system works.
A thread is using the CPU (called RUNNING) until it needs to wait for a resource. It then moves to an unordered list of threads that are SUSPENDED. In the meantime, the next thread on the FIFO (first-in-first-out) queue of threads waiting for the CPU (called being RUNNABLE) is given the CPU and becomes RUNNING. If a thread on the SUSPENDED list is notified that it’s resource is available, it becomes RUNNABLE and is put on the bottom of the RUNNABLE queue. Threads continue this clockwise movement from RUNNING to SUSPENDED to RUNNABLE to RUNNING again until the task is completed. You can see processes in these states using the sys.dm_exec_requests DMV.
SQL Server keeps track of the time that elapses between leaving the RUNNING state and becoming RUNNING again (called the “wait time”) and the time spent on the RUNNABLE queue (called the “signal wait time” – i.e. how long does the thread need to wait for the CPU after being signaled that its resource is available). We need to work out the time spent waiting on the SUSPENDED list (called the “resource wait time”) by subtracting the signal wait time from the overall wait time.
A great resource is the 2005 whitepaper on ‘waits and queues’ which I encourage you to read: Performance Tuning Using Waits and Queues. Books Online has a section on the sys.dm_os_wait_stats DMV that gives info on some of the newer wait types, and PSS is (very slowly) putting together a repository of info on all the wait types but not much progress has been made. The most in-depth coverage of understanding and using wait statistics is in my Pluralsight online training course, and I spend several hours on it in our IE2 class on Performance Tuning.
You can ask SQL Server for the cumulative wait statistics using the sys.dm_os_wait_stats DMV, and many people prefer to wrap the DMV call in some aggregation code. I use code based on a query that I got from fellow-MVP Glenn Berry (blog|twitter) and then modified quite a bit. See below for the version updated to take account of the results discussed below:
WITH [Waits] AS
(SELECT
[wait_type],
[wait_time_ms] / 1000.0 AS [WaitS],
([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
[signal_wait_time_ms] / 1000.0 AS [SignalS],
[waiting_tasks_count] AS [WaitCount],
100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
FROM sys.dm_os_wait_stats
WHERE [wait_type] NOT IN (
N'CLR_SEMAPHORE', N'LAZYWRITER_SLEEP',
N'RESOURCE_QUEUE', N'SQLTRACE_BUFFER_FLUSH',
N'SLEEP_TASK', N'SLEEP_SYSTEMTASK',
N'WAITFOR', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
N'XE_TIMER_EVENT', N'XE_DISPATCHER_JOIN',
N'LOGMGR_QUEUE', N'FT_IFTS_SCHEDULER_IDLE_WAIT',
N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT',
N'CLR_AUTO_EVENT', N'DISPATCHER_QUEUE_SEMAPHORE',
N'TRACEWRITE', N'XE_DISPATCHER_WAIT',
N'BROKER_TO_FLUSH', N'BROKER_EVENTHANDLER',
N'FT_IFTSHC_MUTEX', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
N'DIRTY_PAGE_POLL', N'SP_SERVER_DIAGNOSTICS_SLEEP')
)
SELECT
[W1].[wait_type] AS [WaitType],
CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S],
CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S],
CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S],
[W1].[WaitCount] AS [WaitCount],
CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage],
CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S],
CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S],
CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
[W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold
GO
This will show the waits grouped together as a percentage of all waits on the system, in decreasing order. The waits to be concerned about (potentially) are those at the top of the list as this represents the majority of where SQL Server is spending it’s time waiting. You can see that a bunch of waits are being filtered out of consideration – as I said above, waits happen all the time and these are the benign ones we can usually ignore.
You can also reset the aggregated statistics using this code:
DBCC SQLPERF (N'sys.dm_os_wait_stats', CLEAR); GO
And of course you can very easily come up with a way to persist the results every few hours or every day and do some time-series analysis to figure out trends or automatically spot problems as they start to happen. You can also use Performance Dashboard to see these graphically in 2005 and Data Collector in 2008. On SQL Server 2000 you can use DBCC SQLPERF (N’waitstats’).
Once you get the results, you then start figuring out how to interpret them and where to go looking. The whitepaper I referenced above has a ton of good info on most of the wait types (except those added in 2008). There are various ways you can dig in deeper to this information that I’ll go into in later posts.
I’m going to start blogging about wait stats analysis, either as standalone posts or as part of other things – and I’ve already done so in the last post (at time of writing this) in my benchmarking series.
For now, I want to report on the results of the wait stats survey I posted a couple of months back. I asked people to run the code above and let me know the results. I received results for a whopping 1823 SQL Servers out there – thank you!
Here’s a graphical view of the results:

I’m not surprised at all by the top four results as I see these over and over on client systems.
For the remainder of this post, I’m going to list all the top wait types reported by survey respondents, in descending order, and give a few words about what they might mean if they are the most prevalent wait on your system. The list format shows the number of systems with that wait type as the most prevalent, and then the wait type.
- 505: CXPACKET
- This indicates parallelism, not necessarily that there’s a problem. The coordinator thread in a parallel query always accumulates these waits. If the parallel threads are not given equal amounts of work to do, or one thread blocks, the waiting threads will also accumulate CXPACKET waits, which will make them aggregate a lot faster – this is a problem. One thread may have a lot more to do than the others, and so the whole query is blocked while the long-running thread completes. If this is combined with a high number of PAGEIOLATCH_XX waits, it could be large parallel table scans going on because of incorrect non-clustered indexes, or a bad query plan. If neither of these are the issue, you might want to try setting MAXDOP to 4, 2, or 1 for the offending queries (or possibly the whole instance). Make sure that if you have a NUMA system that you try setting MAXDOP to the number of cores in a single NUMA node first to see if that helps the problem. You also need to consider the MAXDOP effect on a mixed-load system. To be honest, I’d play with the cost threshold for parallelism setting before reducing the MAXDOP of the whole instance. And don’t forget Resource Governor in SQL Server 2008 onwards that allows DOP governing for a particular group of connections to the server.
- 304: PAGEIOLATCH_XX
- This is where SQL Server is waiting for a data page to be read from disk into memory. It commonly indicates a bottleneck at the I?O subsystem level, but why is the I/O subsystem havign to service so many reads? It could be buffer pool/memory pressure (i.e. not enough memory for the workload), a sudden change in query plans, plan cache bloat or a number of other things. Don’t assume the root cause is the I/O subsystem.
- 275: ASYNC_NETWORK_IO
- This is commonly where SQL Server is waiting for a client to finish consuming data. It could be that the client has asked for a very large amount of data or just that it’s consuming it reeeeeally slowly because of poor programming – I rarely see this being a network issue.
- 112: WRITELOG
- This is the log management system waiting for a log flush to disk. It commonly indicates a problem with the I/O subsystem where the log is, but on very high-volume systems it could also be caused by internal log flush limits, that may mean you have to split your workload over multiple databases or even make your transactions a little longer to reduce log flushes. To be sure it’s the I/O subsystem, use the DMV sys.dm_io_virtual_file_stats to examine the I/O latency for the log file.
- 109: BROKER_RECEIVE_WAITFOR
- This is just Service Broker waiting around for new messages to receive. I would add this to the list of waits to filter out and re-run the wait stats query.
- 086: MSQL_XP
- This is SQL Server waiting for an extended stored-proc to finish. This could indicate a problem in your XP code.
- 074: OLEDB
- As its name suggests, this is a wait for something communicating using OLEDB – e.g. a linked server. However, OLEDB is also used by all DMVs and by DBCC CHECKDB, so don’t assume linked servers are the problem – it could be a third-party monitoring tool making excessive DMV calls.
- 054: BACKUPIO
- This shows up when you’re backing up directly to tape, which is slooooow. I’d be tempted to filter this out.
- 041: LCK_M_XX
- This is simply the thread waiting for a lock to be granted and indicates blocking problems. These could be caused by unwanted lock escalation or bad programming, but could also be from I/Os taking a long time causing locks to be held for longer than usual. Look at the resource associated with the lock using the DMV sys.dm_os_waiting_tasks. Don’t assume that locking is the root cause.
- 032: ONDEMAND_TASK_QUEUE
- This is normal and is part of the background task system (e.g. deferred drop, ghost cleanup). I would add this to the list of waits to filter out and re-run the wait stats query.
- 031: BACKUPBUFFER
- This shows up when you’re backing up directly to tape, which is slooooow. I’d be tempted to filter this out.
- 027: IO_COMPLETION
- This is SQL Server waiting for I/Os to complete and can be an indication of I/O subsystem problems.
- 024: SOS_SCHEDULER_YIELD
- This is most likely to be code running that doesn’t hit any resource waits, but it could be spinlock contention. More on this in my Spinlocks category.
- 022: DBMIRROR_EVENTS_QUEUE
- 022: DBMIRRORING_CMD
- These two are database mirroring just sitting around waiting for something to do. I would add these to the list of waits to filter out and re-run the wait stats query.
- 018: PAGELATCH_XX
- This is contention for access to in-memory copies of pages. The most well-known cases of these are the PFS, SGAM, and GAM contention that can occur in tempdb under certain workloads. To find out what page the contention is on, you’ll need to use the DMV sys.dm_os_waiting_tasks to figure out what page the latch is for. For tempdb issues, Robert Davis (blog|twitter) has a good post showing how to do this. Another common cause I’ve seen is an index hot-spot with concurrent inserts into an index with an identity value key.
- 016: LATCH_XX
- This is contention for some non-page structure inside SQL Server – so not related to IO or data at all. These can be hard to figure out and you’re going to be using the DMV sys.dm_os_latch_stats. More on this in my Latches category.
- 013: PREEMPTIVE_OS_PIPEOPS
- This is SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven’t been documented yet (anywhere). The easiest way to figure out what they mean is to remove the PREEMPTIVE_OS_ and then search for what’s left on MSDN – it’ll be the name of a Windows API.
- 013: THREADPOOL
- This says that there aren’t enough worker threads on the system to satisfy demand. Commonly this is large numbers of high-DOP queries trying to execute.
- 009: BROKER_TRANSMITTER
- This is just Service Broker waiting around for new messages to send. I would add this to the list of waits to filter out and re-run the wait stats query.
- 006: SQLTRACE_WAIT_ENTRIES
- Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query./li>
- 005: DBMIRROR_DBM_MUTEX
- This one is undocumented and is contention for the send buffer that database mirroring shares between all the mirroring sessions on a server. It could indicate that you’ve got too many mirroring sessions.
- 005: RESOURCE_SEMAPHORE
- This is queries waiting for execution memory (the memory used to process the query operators – like a sort). This could be memory pressure or a very high concurrent workload.
- 003: PREEMPTIVE_OS_AUTHENTICATIONOPS
- 003: PREEMPTIVE_OS_GENERICOPS
- These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven’t been documented yet (anywhere). The easiest way to figure out what they mean is to remove the PREEMPTIVE_OS_ and then search for what’s left on MSDN – it’ll be the name of a Windows API.
- 003: SLEEP_BPOOL_FLUSH
- This is normal to see and indicates that checkpoint is throttling itself to avoid overloading the IO subsystem. I would add this to the list of waits to filter out and re-run the wait stats query.
- 002: MSQL_DQ
- This is SQL Server waiting for a distributed query to finish. This could indicate a problem with the distributed query, or it could just be normal.
- 002: RESOURCE_SEMAPHORE_QUERY_COMPILE
- When there are too many concurrent query compilations going on, SQL Server will throttle them. I don’t remember the threshold, but this can indicate excessive recompilation, or maybe single-use plans.
- 001: DAC_INIT
- I’ve never seen this one before and BOL says it’s because the dedicated admin connection is initializing. I can’t see how this is the most common wait on someone’s system…
- 001: MSSEARCH
- This is normal to see for full-text operations. If this is the highest wait, it could mean your system is spending most of its time doing full-text queries. You might want to consider adding this to the filter list.
- 001: PREEMPTIVE_OS_FILEOPS
- 001: PREEMPTIVE_OS_LIBRARYOPS
- 001: PREEMPTIVE_OS_LOOKUPACCOUNTSID
- 001: PREEMPTIVE_OS_QUERYREGISTRY
- These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven’t been documented yet (anywhere). The easiest way to figure out what they mean is to remove the PREEMPTIVE_OS_ and then search for what’s left on MSDN – it’ll be the name of a Windows API.
- 001: SQLTRACE_LOCK
- Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.
I hope you found this interesting! Let me know if there’s anything in particular you’re interested in seeing or just that you’re following along and enjoying the ride!
48 Responses to Wait statistics, or please tell me where it hurts
WoW … amazing post … This makes me feel allot better about my efforts at performance tuning over the past couple of months, I fall pretty much in line with these wait states. This also clarified allot of little points that the tech net descriptions kind of glosses over. I’ve been thinking that a server I’ve been working on was under CPU pressure, the tech net article says of SOS_SCHEDULER_YIELD "Occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed." ~ errr (sounds like a fortune cookie) but after working out allot of this is one of my top waits states.
This is very useful information. I’ll be sharing this article with my team tomorrow and look forward to the future wait stat analysis posts. Thanks!
BROKER_RECEIVE_WAITFOR: probably some of the surveyed sites were using SqlDependency and/or SqlCacheDependency.
Hi Jonathan – I’d remove those from consideration as well. Cheers
This is a great summary. I look forward to reading your future blogs on LATCH_XX wait tracing specifically the cause of Latch_EX waits under heavy load as this has been a thorn in my side for some time.
Another nice post about a very relevant subject that seems to still be a mystery to so many people.
Thanks for the mention!
Great Article.
Hi Paul, thanks for explaining this as you have, its great to be able to get some background on this sort of topic. I have tried the script on some of my servers here and have at the top of my list on one of them a Wait Type of DBMIRROR_DBM_EVENT which is explained in BOL as "Identified for informational purposes only. Not supported. Future compatibility is not guaranteed.". 4th in order on the same server is DBMIRROR_WORKER_QUEUE
with BOL saying "Indicates that the database mirroring worker task is waiting for more work." from that explanation I think I should be adding that to the excluded list but what should I do about the DBMIRROR_DBM_EVENT waits, is there any cause for concern or should I exclude that from the results too?
Great post Paul. Time series analysis really helps to understand the baseline of the system. Looking forward to read the post about it
Thanks for another great labor of love, Paul. You explain these things in a way that even I can understand!
Good Information on waits Statistics..
Fantastic post, Paul – that’s the best presentation of wait types I have seen and should give us all no excuses for not applying this troubleshooting methodology more rigorously.
I have one query about the DBMIRROR_DBM_EVENT wait type. This showed up on one of our systems last year as the lead wait type. After an exhaustive Premier Support Call, we were directed to two issues: this wait type can be symptomatic of an issue where the DBM thread is blocked because the checkpoint process does not yield sufficiently frequently (hence the DBMIRROR_DBM_EVENT wait) – this was a problem apparently fixed in SQL Server 2005 SP3 CU10; alternatively, it can also be symptomatic of throttling of the send rate for the log via the DbmFlowControl::ControlFlow function. Either way, it might be useful not to filter out DBMIRROR_DBM_EVENT as it can indirectly point to system bottlenecks.
Thanks a million for this post. I’ve often searched around for descriptions of the various core elements but it is rare you find something described as accessible as this.
Thank you.
@Donald Interesting info – it’s similar in that respect to the DBMIRROR_DBM_MUTEX counter. Thanks
DBMIRROR_DBM_EVENT waits can normally be ignored on systems where Database Mirroring is not configured. But if you have mirroring configured and you are seeing the waittime showing abnormally high values on the Principal or Mirror instances, then it could point to database mirroring synchronization taking a long time due to slow IO, other blocking tasks preventing the mirroring threads from continuing further or network latency. The issues that we addressd for Database Mirroring in SQL 2005 SP3 CU10 were for some issues that we identified and additioanl diagnostic information to be printed in the SQL errorlog for mirroring related issues. The manifestation for all these issues would show up as one of the database mirroring waittypes. The common one is: DBMIRROR_DBM_EVENT.
Interesting. Thanks.
Hi Paul, best article about wait stats I ever read. The reference links are also very helpful.
I have seen CXPACKET as the biggest wait on some of our servers. Does anyone has a good method to find out the offending parallel queries? Thanks!
I’d encourage the SQL Server folks to read up on Cary Millsap’s Method R work. All about constraint theory and waits.
Very nice article, though i read this article very late it was interesting and given me alot more indepth knowledge about my system lock waits… Thanks alot
Paul,
Thanks for an excellent blog post! There’s a ton of awesome information in here and I’ll definitely be refering to this in the future.
– Adam
Do you have any sort of rule of thumb for when a wait should be considered troublesome or when you should start looking deeper? Running your code on my server shows the biggest wait (60%) to be dbmirror_worker_queue, which can be ignored. The next two waits constitute 9% and 8% of the total wait time. (And the 8% one is dbmirror_send, which I would expect, since I pause mirroring on a 9 GB db while doing DBCC index maintenance, then resume mirroring and it takes about 3 hours for the mirror to become synchronized again). The others are 3% or less of total wait times. Server is running fine and I have no performance complaints from my users, but I’m just trying to be proactive. I would guess anything over 10%-15% of total wait time would warrant a closer look..
Hi Shaun – anything that’s at the top of the list warrants investigation, but bear in mind that waits are normal and there may be waits you can do nothing about – especially if the system is performing well like you say. Filter out the DBM ones and re-run the query – see what you find then. Cheers
As usual nice survey paul .good collection thanks.
Thanks For the Article. I have a server with waits on PREEMPTIVE_OS_WAITFORSINGLEOBJECT. Can you please guide me on this wait.
A quick Google search shows this: http://social.msdn.microsoft.com/Forums/en/sqldatabaseengine/thread/1bf2c509-568e-47fd-8413-67770ecda025
I had similar CXPACKET wait issues about a year ago. A senior co-worker looked at my code, and mentioned the waits were most likely being generated by my joining of large table variables (> 1000 rows) because the plan views a table variable as a single row, which could cause the optimizer to give unequal amounts of work to parallel processes. When I switched to using temp tables for larger sets of generated data, the CXPACKET waits dissapeared for the particular query.
Check the code running behind the CXPACKET wait. Look for joins on table variables, and see if you can analyze their approximate row count.
Paul, I love this post
No one SQL Server tuner in this universe cant escape from your blogs especially this one
[...] The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. This uses the script from this blog post. [...]
[...] If you’re not familiar with wait statistics, I suggest starting with Paul’s post, Wait statistics, or please tell me where it hurts. Whether you have performance issues on your server or not, I highly recommend looking at [...]
[...] Last December I wrote a long blog post introducing wait statistics, showing how to use the sys.dm_os_wait_stats DMV, giving links to resources, and explaining the most common ones that people see in the field based on data from more than 1800 SQL Servers - see Wait statistics, or please tell me where it hurts. [...]
[...] the misconception that if you have wait stats, then you must have a performance problem. See my long blog post on wait stats for more [...]
[...] wait statistics analysis (see here), if you're seeing LATCH_EX, LATCH_SH, or LATCH_UP wait types are one of the top 3-4 prevalent [...]
[...] Wait statistics, or please tell me where it hurts [...]
[...] through my bookmarks, this is a good in-depth article SQL Server Wait Statistics: Tell me where it hurts – Paul Randal Reply With [...]
[...] find on internet about the “waits and queues” Troubleshooting methodology is Paul Randal’s “wait-statistics-or-please-tell-me-where-it-hurts”. I check it very often and whenever you google a waittype you are most likely to get the link to [...]
[...] [...]
Hi Paul. Thanks for this great post.
Could a fragmented log file, with a lot of VLFs, be related to a high writelog wait?
Regards
Possibly a contributor but unlikely to be the root cause.
Hi Paul,
I want to know, are Inner join and cross apply are RBAR or not? Please justify.
How is this relevant to wait stats?
Yeah obviously it is not relevant to wait stats.
By mistake I have posted my question here, sorry for that.
But still, can you answer it Because I didn’t get any of your articles about RBAR.
I haven’t written any articles about query operator internals – that’s not my area. The term RBAR only applies to applications or T-SQL code that process rows one-by-set instead of either caching results or using set-based operations, respectively.
Paul, I am curious about your statement on a BACKUPIO wait. “This shows up when you’re backing up directly to tape, which is slooooow. I’d be tempted to filter this out.” If this is a significant percentage of wait time, wouldn’t it make sense to look into disk backup or faster tape drives? Thanks.
You could, but if you’re stuck with direct-to-tape, I’d filter it out. Even fast tape drives are still going to be slower than going to disk, and many people don’t have the disk space to go two-step to disk and then off-disk. But a fair point.
Paul I just recently started seeing high OLE DB wait types. Now during those times we have our index rebuild script running as maintenance job. Could that be what’s causing the OLEDB waits?
No – it’s the sys.dm_db_index_physical_stats DMV. DMVs use OLE DB under the covers. Cheers
[...] run one of my favorite [stolen] [...]
[...] find on internet about the “waits and queues” Troubleshooting methodology is Paul Randal’s “wait-statistics-or-please-tell-me-where-it-hurts”. I check it very often and whenever you google a waittype you are most likely to get the link to [...]