Announcing the comprehensive SQL Server Wait Types and Latch Classes Library

It’s finally ready!

For the last two years, I’ve been working on-and-off on a new community resource. It was postponed during 2015 while I mentored 50+ people, but this year I’ve had a bunch of time to work on it.

I present to the community a comprehensive library of all wait types and latch classes that have existed since SQL Server 2005 (yes, it includes 2016 waits and latches).

The idea is that over time, this website will have the following information about all wait types and latch classes:

  • What they mean
  • When they were added
  • How they map into Extended Events (complete for all entries already)
  • Troubleshooting information
  • Example call stacks of where they occur inside SQL Server
  • Email link for feedback and questions

As of today, I have complete information for more than 225 common wait types (897 entries in the library) and 26 common latch classes (185 entries in the library), and I’m adding more information constantly.

If there’s one that I haven’t done yet, and you’re really interested in it, click the email link on the page to let me know. Also, if you’ve seen something be a bottleneck that I haven’t, let me know so I can update the relevant page too.

I’m doing this because there is so little information about waits and latches available online and I know that people are constantly searching for information about them. This is very much a labor of love for me as wait statistics are now my favorite area to teach.

Waiting until it’s complete isn’t feasible, so it’s open and available for use now!

Check it out at https://www.SQLskills.com/help/waits and https://www.SQLskills.com/help/latches, or jump to some examples PAGELATCH_EX, SOS_SCHEDULER_YIELD, WRITELOG, CXPACKET.

Enjoy!

Most common latch classes and what they mean

(Check out my Pluralsight online training course: SQL Server: Performance Troubleshooting Using Wait Statistics and my comprehensive library of all wait types and latch classes.)

Back in May I kicked off a survey about prevalent latches on SQL Server instances across the world (see here). It’s taken me quite a while to get around to collating and blogging about the results, but here they are at last!

I got information back for almost 600 servers, and if you remember, I gave you some code to run that would output the top non-page latches that are being waited on during LATCH_XX waits. Non-page latches are those that are neither PAGELATCH_XX (waiting for access to an in-memory copy of a data-file page) nor PAGEIOLATCH_XX (waiting for a data file page to be read into memory from disk).

Every non-page data structure in SQL Server that must be thread-safe (i.e. can be accessed by multiple threads) must have a synchronization mechanism associated with it – either a latch or a spinlock. I blogged about spinlocks last year – see  – and they’re used for some data structures that are accessed so frequently, and for such a short time, that even the expense of acquiring a latch is too high. Access to all other data structures is controlled through a latch.

Using wait statistics analysis (see here), if you’re seeing LATCH_EX or LATCH_SH wait types are one of the top 3-4 prevalent wait types, contention for a latch might be contributing to performance issues. If that is the case, you’ll need to use sys.dm_os_latch stats to figure out what the most prevalent latch is – using code similar to that below (with some example output):

/*============================================================================
  File:     LatchStats.sql

  Summary:  Snapshot of Latch stats

  SQL Server Versions: 2005 onwards
------------------------------------------------------------------------------
  Written by Paul S. Randal, SQLskills.com

  (c) 2015, SQLskills.com. All rights reserved.

  For more scripts and sample code, check out 
    http://www.SQLskills.com

  You may alter this code for your own *non-commercial* purposes. You may
  republish altered code as long as you include this copyright and give due
  credit, but you must obtain prior permission before blogging this code.
  
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF 
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED 
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
============================================================================*/

WITH [Latches] AS
    (SELECT
        [latch_class],
        [wait_time_ms] / 1000.0 AS [WaitS],
        [waiting_requests_count] AS [WaitCount],
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_latch_stats
    WHERE [latch_class] NOT IN (
        N'BUFFER')
    AND [wait_time_ms] > 0
)
SELECT
    MAX ([W1].[latch_class]) AS [LatchClass], 
    CAST (MAX ([W1].[WaitS]) AS DECIMAL(14, 2)) AS [Wait_S],
    MAX ([W1].[WaitCount]) AS [WaitCount],
    CAST (MAX ([W1].[Percentage]) AS DECIMAL(14, 2)) AS [Percentage],
    CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (14, 4)) AS [AvgWait_S]
FROM [Latches] AS [W1]
INNER JOIN [Latches] AS [W2]
    ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum]
HAVING SUM ([W2].[Percentage]) - MAX ([W1].[Percentage]) < 95; -- percentage threshold
GO
LatchClass                        Wait_S  WaitCount  Percentage  AvgWait_S
——————————— ——- ———- ———– ———-
LOG_MANAGER                       221.43  4659       45.81       0.0475
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  199.56  7017       41.28       0.0284
FGCB_ADD_REMOVE                   35.17   1047       7.27        0.0336
DBCC_OBJECT_METADATA              26.85   256490     5.55        0.0001

You can also see the latch class of the latches being waited for in the resource_description column of sys.dm_os_waiting_tasks DMV when the wait_type column is LATCH_XX.

The survey results are based on running this code, for latch classes that are the top latch class on 10 or more servers.

Just like the wait statistics survey results from 2010, I’m not surprised by these results as we see these over and over on client systems.

For the remainder of this post I’m going to list the 10 latch classes represented in the data above in descending order and explain what they mean. Some of this information is from my knowledge of the internals and some of it is from catching the point that latches are acquired using Extended Events and callstack dumping.

Note that when I’m describing a latch class, there may be many instances of that particular latch class. For instance, an FGCB_ADD_REMOVE latch exists for each filegroup in each database attached to the instance.

  • 164: ACCESS_METHODS_DATASET_PARENT
  • 83: ACCESS_METHODS_SCAN_RANGE_GENERATOR
    • These two latches are used during parallel scans to give each thread a range of page IDs to scan. The LATCH_XX waits for these latches will typically appear with CXPACKET waits and PAGEIOLATCH_XX waits (if the data being scanned is not memory-resident). Use normal parallelism troubleshooting methods to investigate further (e.g. is the parallelism warranted? maybe increase ‘cost threshold for parallelism’, lower MAXDOP, use a MAXDOP hint, use Resource Governor to limit DOP using a workload group with a MAX_DOP limit. Did a plan change from index seeks to parallel table scans because a tipping point was reached or a plan recompiled with an atypical SP parameter or poor statistics? Do NOT knee-jerk and set server MAXDOP to 1 – that’s some of the worst advice I see on the Internet.)
  • 80: ACCESS_METHODS_HOBT_COUNT
    • This latch is used to flush out page and row count deltas for a HoBt (Heap-or-B-tree) to the Storage Engine metadata tables. Contention would indicate *lots* of small, concurrent DML operations on a single table.
  • 72: LOG_MANAGER
    • If you see this latch it is almost certainly because a transaction log is growing because it could not clear/truncate for some reason. Find the database where the log is growing and then figure out what’s preventing log clearing using: SELECT [log_reuse_wait_desc] FROM sys.databases WHERE [name] = N’youdbname’
  • 34: TRACE_CONTROLLER
    • This latch is used by SQL Trace for myriad different things, including just generating trace events. Contention on this latch would imply that there are multiple traces on the server tracing lots of stuff – i.e. you’re over-tracing.
  • 33: DBCC_MULTIOBJECT_SCANNER
    • This latch appears on Enterprise Edition when DBCC CHECK* commands are allowed to run in parallel. It is used by threads to request the next data file page to process. Late last year this was identified as a major contention point inside DBCC CHECK* and there was work done to reduce the contention and make DBCC CHECK* run faster. See KB article 2634571 and Bob Ward’s write-up for more details.
  • 16: ACCESS_METHODS_HOBT_VIRTUAL_ROOT
    • This latch is used to access the metadata for an index that contains the page ID of the index’s root page. Contention on this latch can occur when a B-tree root page split occurs (requiring the latch in EX mode) and threads wanting to navigate down the B-tree (requiring the latch in SH mode) have to wait. This could be from very fast population of a small index using many concurrent connections, with or without page splits from random key values causing cascading page splits (from leaf to root).
  • 16: FGCB_ADD_REMOVE
    • FGCB stands for File Group Control Block. This latch is required whenever a file is added or dropped from the filegroup, whenever a file is grown (manually or automatically), when recalculating proportional-fill weightings, and when cycling through the files in the filegroup as part of round-robin allocation. If you’re seeing this, the most common cause is that there’s a lot of file auto-growth happening. It could also be from a filegroup with lots of file (e.g. the primary filegroup in tempdb) where there are thousands of concurrent connections doing allocations. The proportional-fill weightings are recalculated every 8192 allocations, so there’s the possibility of a slowdown with frequent recalculations over many files.
  • 15: DATABASE_MIRRORING_CONNECTION
    • This latch is involved in controlling the message flow for database mirroring sessions on a server. If this latch is prevalent, I would suspect there are too many busy database mirroring sessions on the server.
  • 10: NESTING_TRANSACTION_FULL
    • This latch, along with NESTING_TRANSACTION_READONLY, is used to control access to transaction description structures (called an XDES) for parallel nested transactions. The _FULL is for a transaction that’s ‘active’, i.e. it’s changed the database (usually for an index build/rebuild), and that makes the _READONLY description obvious. A query that involves a parallel operator must start a sub-transaction for each parallel thread that is used – these transactions are sub-transactions of the parallel nested transaction. For contention on these, I’d investigate unwanted parallelism but I don’t have a definite “it’s usually this problem”. Also check out the comments for some info about these also sometimes being a problem when RCSI is used.

I hope you found this interesting and it helps you out with your performance troubleshooting!

Survey: most prevalent latch waits (code to run)

I first started blogging about latches and some of the deeper parts of SQL Server internals last year (see Advanced performance troubleshooting: waits, latches, spinlocks) and now I’d like to pick up that thread (no scheduling pun intended :-)) and blog some more about some of the common latches that could be a performance bottleneck.

To that end, I’ve got some code below (plus example output) that will show the most common latch waits that have occurred on your system.

WITH Latches AS
(SELECT
latch_class,
wait_time_ms / 1000.0 AS WaitS,
waiting_requests_count AS WaitCount,
100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
FROM sys.dm_os_latch_stats
WHERE latch_class NOT IN (
‘BUFFER’)
AND wait_time_ms > 0

)
SELECT
W1.latch_class AS LatchClass,
CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
W1.WaitCount AS WaitCount,
CAST (W1.Percentage AS DECIMAL(14, 2)) AS Percentage,
CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S
FROM Latches AS W1
INNER JOIN Latches AS W2
ON W2.RowNum <= W1.RowNum
WHERE W1.WaitCount > 0
GROUP BY W1.RowNum, W1.latch_class, W1.WaitS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) – W1.Percentage < 95; — percentage threshold
GO

LatchClass                        Wait_S  WaitCount  Percentage  AvgWait_S
——————————— ——- ———- ———– ———-
LOG_MANAGER                       221.43  4659       45.81       0.0475
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  199.56  7017       41.28       0.0284
FGCB_ADD_REMOVE                   35.17   1047       7.27        0.0336
DBCC_OBJECT_METADATA              26.85   256490     5.55        0.0001

I’d like you to run the code and send me the output (either as a comment or in email). I’ll collate all your output and do some blogging for your enjoyment. [Update: I don’t need any more data – thanks!]

Thanks!