A few months ago while I was teaching wait statistics, I was asked whether there’s any expected differences with waits stats when SQL Server is running in a virtual machine.
My answer was yes – there’s a possibility of seeing longer wait times if something prevents the VM from running, as the wait times are based on the __rdtsc counter difference (essentially the processor clock tick count) between the wait starting and ending.
In VMware or Hyper-V, if a thread inside of SQLOS is waiting for a resource, and the VM has to wait to be scheduled to execute by the hypervisor due to the host being oversubscribed with vCPUs based on the hardware pCPUs, then the actual resource wait time noted in SQL Server will include that time that the VM was unable to run, and so the wait time will appear to be longer than it would have been had the VM not been delayed.
It’s an interesting discussion on whether this is problematic or not, but my view is that it could lead to someone chasing a SQL Server performance problem that’s actually a VM performance problem. Note: this isn’t a problem with the hypervisor, this is because of a misconfiguration of the virtual environment.
Anyway, after the class I got to thinking about thread scheduling in general on a VM that is periodically delayed from running and whether it could cause any other interesting effects around wait statistics.
Specifically, I was concerned about SOS_SCHEDULER_YIELD waits. This is a special wait type that occurs when a thread is able to run for 4ms of CPU time (called the thread quantum) without needing to get suspended waiting for an unavailable resource. In a nutshell, a thread must call into the SQLOS layer every so often to see whether it has exhausted its thread quantum, and if so it must voluntarily yield the processor. When that happens, a context switch occurs, and so a wait type must be registered: SOS_SCHEDULER_YIELD. A deeper explanation of this wait type is in my waits library here.
My theory was this: if a VM is prevented from running for a few milliseconds or more, that could mean that a thread that’s executing might exhaust its thread quantum without actually getting 4ms of CPU time, and so yield the processor causing an SOS_SCHEDULER_YIELD wait to be registered. If this happened a lot, it could produce a set of wait statistics for a virtualized workload that appears to have lots of SOS_SCHEDULER_YIELDs, when in fact it’s actually a VM performance problem and the SOS_SCHEDULER_YIELD waits are really ‘fake’.
I discussed this with my good friend Bob Ward from the SQL Product Group and after some internal discussions, they concurred that it’s a possibility because the thread quantum exhaustion time is calculated using the __rdtsc intrinsic when the thread starts executing, so any delay in the VM running could produce the effect I proposed.
Given that I’m a virtual machine neophyte, I asked Jonathan to run some tests inside of our VMware lab environment to see if he could show the issue happening. He ran a known workload that we use in our Immersion Events to demonstrate the performance impact of host oversubscription, causing a VM to be delayed, and lo and behold, he saw a substantially elevated level of SOS_SCHEDULER_YIELD waits (around 20x more) for the workload, compared to running the same workload on the same VM without any delays.
These same tests were repeated in our Hyper-V lab environment that is identical in hardware and VM configuration to the VMware environment and similar levels of elevated SOS_SCHEDULER_YIELD waits were also seen, so the issue is definitely not specific to any given hypervisor or virtual platform, it’s purely related to the host being oversubscribed for the workloads being run and the SQL Server VM having to wait for CPU resources to continue execution.
I’m deliberately not presenting Jonathan’s test results here because I’m not qualified to explain VMware esxtop output or Hyper-V performance counter values and how they correlate to the SOS_SCHEDULER_YIELD numbers to show the problem occurring. Jonathan will do a follow-up post in the next week or two that explains the results from a virtualization perspective.
However, with a simple set of tests we were able to show that with a VM that gets delayed from running, a SQL Server workload can show a much higher level of SOS_SCHEDULER_YIELD waits because of the use of the __rdtsc intrinsic to calculate thread quantum exhaustion times.
This is really interesting because this is a VM performance issue *causing* a wait type to appear, not just causing waits to be longer.
You should definitely consider this phenomenon if you’re investigating a high number of SOS_SCHEDULER_YIELD waits, a workload performance problem, and your workload is running in a VM. In the next post on this, which I’ll link to here when it’s published, Jonathan will explain how to correlate these waits with signs of a VM performance problem.
Hope this helps!