(If you’re having persistent problems with your virtual machine configurations and SQL Server performance, Jonathan can help you fix it – fast! Click here for details.)
In the last month I have had to explain how to interpret CPU Ready Time information for SQL Server VMs running on VMware to a number of people. The first time was on Twitter, and the topic is really too big for a 140 character discussion, but I went ahead and gave it a shot. It is rare in my experience for SQL professionals to know anything about virtualization under VMware or Hyper-V, which is why we spend a half a day, or more, on virtualization during IE3: High Availability & Disaster Recovery. Based on my experience, I was surprised to see the question being asked on #sqlhelp at all. It turns out that Idera Diagnostics Manager has been updated to include information about VM performance that is collected from the Virtual Center if you have it installed in your data center.
Now, overall this is a good thing because DBAs now have additional information that they didn’t have access to unless they had negotiated access to Virtual Center with their VM Administrator. My experience consulting is that few DBAs have access to, or even know about Virtual Center, even though it is one of the critical tools for troubleshooting performance problems. Another item that most DBAs don’t know is that their VM administrator can configure roles and read-only security access to allow anyone the ability to see performance and configuration information inside of Virtual Center, but not be able to make changes, so there really is no good reason for a DBA to not have access to the performance data available in Virtual Center.
Now that DBAs have access to the data, it is important to know what is being displayed and what it really means. Unfortunately, the way the information is being presented makes it easy to misinterpret what is being shown, so the end result is confusion for DBAs that don’t know a lot about VMware, or what the information they are looking at actually means. An example of the VM CPU Ready information from Diagnostics Manager is shown in the chart in Figure 1.
Figure 1 – VM CPU Ready graph
Looking at this chart on the report what would your first interpretation of it be? So far, 3 out of 3 people, including the one on twitter, have thought that there was a problem with this VM. The reality of the situation is that there is absolutely nothing wrong with this VM, it is doing just fine. It just so happens that the scale of this graph makes it easy to jump to an interpretation that this is showing a percentage value, and we all know that anything performance related that is over 80% can’t be good. It doesn’t say that, but it doesn’t provide any context for the information’s meaning either. To make matters even more confusing, CPU Ready in VMware is available as a summation value, which happens to be what is shown here in milliseconds, and as a percentage (RDY% in esxtop) of time spent waiting with work to do to be scheduled by the hypervisor. We can confirm that the information presented is the summation value and not the percentage by looking at the real-time information available in Virtual Center for the same server as shown in the chart in Figure 2.
Figure 2 – CPU Ready real-time summation from Virtual Center
So now that we know it’s a summation and represented in milliseconds what exactly does that tell us? Unfortunately, it actually doesn’t tell us anything on it’s own.
What is CPU Ready Time and why do we even care?
CPU Ready Time is the time that the VM waits in a ready-to-run state (meaning it has work to do) to be scheduled on one or more of the physical CPUs by the hypervisor. It is generally normal for VMs to have small values for CPU Ready Time accumulating even if the hypervisor is not over subscribed or under heavy activity, it’s just the nature of shared scheduling in virtualization. For SMP VMs with multiple vCPUs the amount of ready time will generally be higher than for VMs with fewer vCPUs since it requires more resources to schedule/co-schedule the VM when necessary and each of the vCPUs accumulates the time separately.
At what point does CPU Ready Time start to affect performance?
To be honest it is always having some minimal effect, but it really depends on a lot of different factors, for example which CPU Ready value you are looking at and where you are getting the information. If you are looking at raw RDY% values from esxtop, the value has a completely different meaning than the summation values that are available from Virtual Center. Inside of Virtual Center, the level of summation you are looking at when reading the values also affects the meaning that the value has, and you have to perform calculations to convert the summation into a percentage to know the effect as documented in the VMware Knowledge Base. In this case, for a real-time summation, the data point is actually a 20 second summation of ready time accumulation.
At one point VMware had a recommendation that anything over 5% ready time per vCPU was something to monitor. In my experience for a SMP SQL VM, anything over 5% per vCPU is typically a warning level and anything over 10% per vCPU is critical. The reason this specifically says per vCPU is that each vCPU allocates 100% to the VM’s scheduling total, so a 4 vCPU VM would have a scheduling total of 400%. A 10% CPU Ready on a 4 vCPU VM only equates to 2.5% per vCPU. If this isn’t already complex, it gets worse.
This makes providing a general recommendation impossible for this counter, because it depends on a lot of different factors. For example, if the VM has had a CPU Limit placed on it, whenever the VM exceeds its allocated limit it will accumulate CPU Ready time while it waits to be allowed to execute again. If the CPU Limit is being enforced under business SLAs or a chargeback system, the VM could easily have high CPU Ready values that fit what is required for the configuration. Using the formula from the KB article to convert a summation value to percent, if we round the average of 81.767 down to 80 for simple math, this results in:
(80 / (20s * 1000)) * 100 = 0.4% CPU ready
Four tenths of a percent CPU Ready time, which is not going to negatively impact the performance of the VM. The example VM shown above, also has 8 vCPUs allocated to it, so after taking this into account, it really only has 0.05% per vCPU, well below the older recommended value.
What scenarios cause high CPU Ready times?
While there are a number of scenarios where high CPU Ready times can occur, there are generally two common scenarios that I see when I am consulting. The most common reason tends to be host over subscription, where too many vCPUs have been allocated per pCPU ratio wise. While ESX 5 supports a maximum of 25 vCPUs per physical CPU, this is definitely a case where just because you can, doesn’t mean it’s good to do. As always your mileage may vary based on your specific VM workloads, but typically I start to see problems when a host is in the range of 2-2.5X over subscribed for server workloads.
The second common scenario that I see where CPU Ready times are high is when a larger SMP VM for SQL Server, for example one with 4-8 vCPUs is running on a host that has a lot of smaller VMs with 1-2 vCPUs for application servers. Depending on the number of physical processors, and the total number of vCPUs allocated on the host, the larger resource allocation for the SQL Server VM results in it having to wait longer for the hypervisor to preempt the necessary physical CPUs to schedule/co-schedule the workload. Often in cases where this occurs, after asking some questions I find that the number of vCPUs for the SQL Server was increased from four to eight due to performance problems for the VM. Unfortunately, if CPU Ready time was the original problem, increasing the vCPUs actually doesn’t improve performance, it generally makes things worse.
What do I do if this is actually a problem?
If you have gone through the information and you can see that CPU Ready is really a problem for your VMs there are a couple of different things that can be done. The correct one depends on your virtual infrastructure. If the problem is purely host over subscription vCPU to pCPU ratio wise, start off by evaluating whether the VMs need to have the number of configured vCPUs to determine if any of them can be reduced to lower the ratio. If this can’t be done, the only real answer is to add additional hosts to allow the load to be balanced better and reduce the over subscription rates. If the problem is specific to the larger SMP VMs for applications like SQL Server, evaluate whether you can consolidate the larger VMs onto one or most hosts and move the smaller VMs to the other hosts to separate the VMs based on their sizes. This has worked well for a number of clients that I have worked with were they truly needed eight or more vCPUs for their workload.
Understanding the data that you are looking at and what it actually means is critical to making the right decisions about what is happening in a virtualized environment. CPU Ready time specifically requires a good understanding of what the value actually is showing and how it relates to the configuration of the VM, the other VMs on the host, and the physical host resources. If you are looking at summation data for the CPU Ready time, converting it to a CPU Ready percent value is what provides the proper meaning to the data for understanding whether or not it is actually a problem. However, keep in mind that other configuration options like CPU Limits can affect the accumulated CPU Ready time and must be checked as well. Whenever I am performing a health check of a SQL Server VM on VMware, I make sure that I get screenshots of the CPU Ready information from Virtual Center for each of the summation levels available so that I can determine whether or not it is affecting the performance of the VM, but I am always careful to calculate using the correct formula what the percentage value actually works out to and then review the rest of the VM configuration before making any conclusions. In the worst case I’ve seen, for one client the CPU Ready time was roughly 63% per vCPU, and you could visibly see the VM freeze while moving the mouse in a RDP session. Reviewing the configuration showed that the VM had 8 vCPUs on a host with 8 physical CPUs that was also running 10 other VMs with a total of 14 additional vCPUs. Moving that VM back down to 2 vCPUs was instant relief to their biggest bottleneck, and then we started talking about hardware changes to fit their increased virtualization usage. If you’d like expert assistance with implementing, configuring, troubleshooting, or understanding SQL Server on VMware we have a number of different services to fit your needs.