This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!
So far this week, Erin has covered the importance of having a baseline of your SQL Server performance, some of the tools for collecting and analyzing the information collected, and Kimberly has talked about some of the important DMVs. In this post we’ll take a look at some of the most common performance counters for SQL Server, what they mean, and which ones are the most important for routine monitoring to identify problems with SQL Server.
Erin talked about the PAL tool earlier this week and mentioned the template that is provided by the tool for creating a Data Collector Set in Performance Monitor to capture SQL Server-specific performance counters. You can see how to do this in the article that Erin linked to, Free Tools for the DBA: PAL Tool if you haven’t already looked at this. As a DBA, I loved the PAL tool for it’s simplicity and the time it saved me configuring performance counter collections on each of my SQL Servers, and then the time it saved me analyzing the data that was collected.
However, the PAL tool has its limitations, and one of the limitations is that it applies fixed thresholds to certain performance counters, and it uses thresholds that have been published since SQL Server 2000 by Microsoft without major updates. Until just last week, the SQL Server template applied a fixed threshold of 300 to the analysis of Page Life Expectancy which doesn’t fit the changes in server hardware memory sizes that have occurred since that value was first published. Now the template analyzes the Page Life Expectancy based on the value you specify as being normal for the server, or a default of 700 if no value is defined.
While the PAL template for SQL Server is very comprehensive, I also maintain a short list of performance counters that I use for spot checking server performance periodically. To simplify covering each of the counters, why it’s important and what to look for, we’ll look at counters for CPU, memory, and the disk separately, starting with CPU.
- %Processor Time
- %Privileged Time
- Process (sqlservr.exe)
- %Processor Time
- %Privileged Time
The Processor\%Processor Time counter is probably the most familiar counter in performance monitor and gives us information about the total CPU usage for the server we are looking at unless it is a virtual machine. For virtual machines (VMs), the Processor\%Processor Time counter is not a reliable measure of actual CPU usage for the server, and instead shows the percentage of the allocated CPU resources that the VM is actually using at a given point in time. High values for % Processor Time in a VM need to be checked against the host performance counters for the VM to determine if the VM is contending for physical CPU allocations with other guest VMs on the same host or not. In addition to the Processor counter we also want to collect the Process (sqlservr.exe)\%Processor Time counter to be able to balance whether a high CPU condition is actually the result of SQL Server utilization or another application that might be running on the server. For most SQL Server workloads the average processor usage over time should be fairly consistent and unexplained increases from the baseline trending should be investigated further. Values greater than 80% consistently should also be investigated to determine if the workload is exceeding the current hardware capabilities or for problems that are increasing the overall CPU usage for the instance.
- Available Mbytes
- SQL Server:Buffer Manager
- Lazy writes/sec
- Page life expectancy
- Page reads/sec
- Page writes/sec
- SQL Server:Memory Manager
- Total Server Memory (KB)
- Target Server Memory (KB)
The first counter that I start with when looking at memory usage on a SQL Server is the Memory\Available MBytes counter. I start with this counter first to determine if Windows is experiencing memory pressure, which would affect the SQL Server performance counter values potentially. The Memory\Available MBytes counter should ideally be above 150-300MB for the server, which leaves memory available for other applications to run without pushing Windows into a low memory condition. When the Available MBytes counter drops below 64MB, on most servers Windows signals a low memory notification that SQL Server monitors for, and the SQLOS ( the ‘operating system’) inside of SQL Server will reduce memory usage as a result of this notification occurring.
After confirming that Windows has available memory to prevent SQL Server from shrinking memory usage, the next counters I generally look at are Buffer Manager\Page life expectancy and Lazy writes/sec. The Page life expectancy (PLE) should be generally consistent on average, but may fluctuate with changes in the server workload, with lower values at peak periods. In general, the more memory allocated to a SQL Server instance, the higher I’d like to see this counter’s normal value. For example, a server with 230GB RAM allocated to the SQL Server buffer pool and a PLE of 300 would equate roughly to 785MB/sec of I/O activity to maintain the page churn inside of the buffer pool. While it might be possible for the I/O subsystem to keep up with this demand, this represents a significant amount of page churn in the buffer pool and may be a sign of missing indexes, implicit conversions due to mismatched data types, and all kinds of other problems that can be fixed if identified as the root cause.
If you find a low PLE for the server, look at the other performance counters in the Buffer Manager category for correlation of overall memory pressure inside of SQL Server. If Lazy writes/sec is consistently experiencing non-zero values with a low PLE and elevated values for Page reads/sec and Page writes/sec the server is experiencing buffer pool contention and you will need to go about troubleshooting this problem further. If the Memory\Available MBytes counter was fluctuating and under the 64MB low memory threshold, looking at the Memory Manager\Total Server Memory (KB) and Target Server Memory (KB) counters will tell you if that has resulted in SQL Server reducing the size of the buffer pool in response. On a stable system Total Server Memory (KB) will be lower than Target Server Memory (KB) during the initial buffer pool ramp up, but then remain equal under normal operation unless a low memory condition occurs. If the server is a VM running on VMware, look for memory ballooning by the host with the VM Memory\Memory Ballooned (MB) counter for non-zero values to see if the host is causing the low memory condition.
- Physical Disk
Avg. Disk sec/Read
- Avg. Disk Bytes/Read
Avg. Disk sec/Write
- Avg. Disk Bytes/Write
- Paging File
- SQL Server:Access Methods
- Forwarded Records/sec
- Full Scans/sec
- Index Searches/sec
The Avg. Disk sec/Read and /Write counters provide the current average latency for each of the disks on the server. Latency is one of the most important metrics for SQL Server I/O performance, but the latency should be compared to the size of I/Os that are being performed before determining whether or not a specific value is an indicator of a problem or not. As the size of the I/O operations increases, so does the latency for the operation, so it would be normal to see higher latency values for a reporting workload doing large table scans vs. a transactional workload with smaller I/O operations occurring. The Microsoft general recommendations for I/O latency for SQL Server are:
- < 8ms: excellent
- < 12ms: good
- < 20ms: fair
- > 20ms: poor
Over the last two years consulting, only a few of the servers that I’ve looked at during health checks meet the <20ms latency numbers for data and log files. Most servers tend to fall into the < 30ms range for I/O latency per disk on average. SSDs are changing this rapidly, and we’re seeing more and more disk configurations that include SSDs and the result is very low I/O latency. When looking at our overall I/O for the server, reviewing the Access Method counters for Full Scans/sec, Forwarded Records/sec, and Index Searches/sec can give us a clue to the type of workload that is occurring to generate the I/O.
An important consideration when looking at performance counters, or any monitoring data from SQL Server for that matter, is that no single data point will tell you the root cause of a problem, if one occurs. For performance counters, you need to look across multiple counters for correlating information to pinpoint the root of problems.
For example, if we look at disk I/O and see high latency numbers with lower I/O sizes on average, we could infer that we have a disk bottleneck and we need to improve our I/O performance. However, if we look at the Buffer Manager\Page life expectancy and find that it is lower than our normal baseline numbers for the server, and then see the Buffer Manager\Page Reads/sec is higher than usual, the data would point to a memory contention in the buffer pool which is going to affect I/O performance to keep up with the page churn occurring in the buffer pool.
Understanding the counters and how they relate makes it much easier to spot when problems are occurring and then pinpoint where the actual root of the problem might be.
3 thoughts on “The Accidental DBA (Day 21 of 30): Essential PerfMon counters”
Thank you for the feedback on the PAL tool. I’m the author of the tool and I work with experts in the field who own and defend the threshold files. David Pless owns the SQL Server related threshold files. With that in mind, anyone (including you) can easily create your own threshold files opitionally based off of other threshold files. There is no limitations on the analysis you can do. It is all Powershell driven and you can code it how you see fit. If you need assistance with an *ultimate* SQL Server threshold file, or just want to update the existing one, then let me know at email@example.com. Thank you!
Keep in mind that page file writes are done at 1 MB IO sizes which will can easily take 35 ms or more on average and this is acceptable, so IO size and queue length need to also be considered in disk analysis. Disk analysis is quite complicated, but its important to give people an easy indicator. In the next minor update to PAL, I have a disk overwhelmed analysis and a physical memory overwhelmed analysis. These take all of the counters and patterns into consideration and simplify them. Also, I am covering these details in depth in my upcoming book, “The Microsoft Windows Performance Analysis Field Guide” due out May 2014.
Hi Jonathan – First of all THANK YOU for this great in depth series of posts here (from all of you).
I am bit confused or may not understand clearly from your point mentioning “On a stable system Total Server Memory (KB) will be lower than Target Server Memory (KB)”.
If Total Server Memory < Target Server Memory, Does that mean it can be a sign of memory pressure? Please correct me if I am wrong. Am I interpreting this wrongly?