This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!
For the last set of posts in our Accidental DBA series we’re going to focus on troubleshooting, and I want to start with Wait Statistics. When SQL Server executes a task, if it has to wait for anything – a lock to be released from a page, a page to be read from disk into memory, a write to the transaction log to complete – then SQL Server records that wait and the time it had to wait. This information accumulates, and can be queried using the sys.dm_os_wait_stats DMV, which was first available in SQL Server 2005. Since then, the waits and queues troubleshooting methodology has been a technique DBAs can use to identify problems, and areas for optimizations, within an environment.
If you haven’t worked with wait statistics, I recommend starting with Paul’s wait stats post, and then working through the SQL Server Performance Tuning Using Wait Statistics: A Beginners Guide whitepaper.
Viewing Wait Statistics
If you run the following query:
SELECT * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms DESC;
You will get back output that isn’t that helpful, as you can see below:
It looks like FT_IFTS_SCHEDULER_IDLE_WAIT is the biggest wait, and SQL Server’s waited for 1930299679 ms total. This is kind of interesting, but not what I really need to know. How I do really use this data? It needs some filtering and aggregation. There are some waits that aren’t going to be of interest because they occur all the time and are irrelevant for our purposes; we can filter out those wait types. To make the most of our wait stats output, I really want to know the highest wait based on the percentage of time spent waiting overall, and the average wait time for that wait. The query that I use to get this information is the one from Paul’s post (mentioned above). I won’t paste it here (you can get it from his post) but if I run that query against my instance, now I get only three rows in my output:
If we reference the various wait types listed in the MSDN entry for sys.dm_os_wait_stats, we see that the SQLTRACE_WAIT_ENTRIES wait type, “Occurs while a SQL Trace event queue waits for packets to arrive on the queue.”
Well, this instance is on my local machine and isn’t very active, so that wait is likely due to the default trace that’s always running. In a production environment, I probably wouldn’t see that wait, and if I did, I’d check to see how many SQL Traces were running. But for our purposes, I’m going to add that as a wait type to filter out, and then re-run the query. Now there are more rows in my output, and the percentage for the PAGEIOLATCH_SH and LCK_M_X waits has changed:
If you review the original query, you will see that the percentage calculation for each wait type uses the wait_time_ms for the wait divided by the SUM of wait_time_ms for all waits. But “all waits” are those wait types not filtered by the query. Therefore, as you change what wait types you do not consider, the calculations will change. Keep this in mind when you compare data over time or with other DBAs in your company – it’s a good idea to make sure you’re always running the same query that filters out the same wait types.
Capturing Wait Statistics
So far I’ve talked about looking at wait statistics at a point in time. As a DBA, you want to know what waits are normal for each instance. And there will be waits for every instance; even if it’s highly tuned or incredibly low volume, there will be waits. You need to know what’s normal, and then use those values when the system is not performing well.
The easiest way to capture wait statistics is to snapshot the data to a table on a regular basis, and you can find queries for this process in my Capturing Baselines for SQL Server: Wait Statistics article on SQLServerCentral.com. Once you have your methodology in place to capture the data, review it on a regular basis to understand your typical waits, and identify potential issues before they escalate. When you do discover a problem, then you can use wait statistics to aid in your troubleshooting.
Using the Data
At the time that you identify a problem in your environment, a good first step is to run your wait statistics query and compare the output to your baseline numbers. If you see something out of the ordinary, you have an idea where to begin your investigation. But that’s it; wait statistics simply tell you where to start searching for your answer. Do not assume that your highest wait is the problem, or even that it’s a problem at all. For example, a common top wait is CXPACKET, and CXPACKET waits indicate that parallelism is used, which is expected in a SQL Server environment. If that’s your top wait, does that mean you should immediately change the MAXDOP setting for the instance? No. You may end up changing it down the road, but a better direction is to understand why that’s the highest wait. You may have CXPACKET waits because you’re missing some indexes and there are tons of table scans occurring. You don’t need to change MAXDOP, you need to start tuning.
Another good example is the WRITELOG wait type. WRITELOG waits occur when SQL Server is waiting for a log flush to complete. A log flush occurs when information needs to be written to the database’s transaction log. A log flush should complete quickly, because when there is a delay in a log write, then the task that initiated the modification has to wait, and tasks may be waiting behind that. But a log flush doesn’t happen instantaneously every single time, so you will have WRITELOG waits. If you see WRITELOG as your top wait, don’t immediately assume you need new storage. You should only assume that you need to investigate further. A good place to start would be looking at read and write latencies, and since I’ll be discussing monitoring IO more in tomorrow’s post we’ll shelve that discussion until then.
As you can see from these two examples, wait statistics are a starting point. They are very valuable – it’s easy to think of them as “the answer”, but they’re not. Wait statistics do not tell you the entire story about a SQL Server implementation. There is no one “thing” that tells you the entire story, which is why troubleshooting can be incredibly frustrating, yet wonderfully satisfying when you find the root of a problem. Successfully troubleshooting performance issues in SQL Server requires an understanding of all the data available to aid in your discovery and investigation, understanding where to start, and what information to capture to correlate with other findings.
Our online training (Pluralsight) courses that can help you with this topic: