CPU Ready Impact on SOS_SCHEDULER_YIELD

At the end of May 2017, Paul and I had a discussion about SOS_SCHEDULER_YIELD waits and whether or not they could become skewed in SQL Server as a result of hypervisor scheduling issues for a VM running SQL Server.  I did some initial testing of this in my VM lab using both Hyper-V and VMware and was able to create the exact scenario that Paul was asking about, which he explains in this blog post.  I’ve recently been working on reproducing another VM scheduling based issue that shows up in SQL Server, where again the behaviors being observed are misleading as a result of the hypervisor being oversubscribed, only this time it was not for the SQL Server VM, it was instead for the application servers, and while I was building a repro of that in my lab, I decided to take some time and rerun the SOS_SCHEDULER_YIELD tests and write a blog post to show the findings in detail.

The test environment that I used for this is a portable lab I’ve used for demos for VM content over the last eight years teaching our Immersion Events. The ESX host has 4 cores and 8GB RAM and hosts three virtual machines, a 4vCPU SQL Server with 4GB RAM, and two 2vCPU Windows Server VM’s with 2GB RAM that are used strictly to run Geekbench to produce load.  Within SQL Server, I a reproducible workload that drives parallelism and is repeatable consistently, that I have also used for years in teaching classes.

For the tests, I first ran a baseline where the SQL Server VM is the only machine executing any tasks/workload at all, but the other VMs are powered on and just sitting there on the hose.  This establishes a base metric in the host and had 1% RDY time average during the workload execution, after which I collected the wait stats for SOS_SCHEDULER_YIELD.

clip_image002

Then I reran the tests but ran four copies of GeekBench on one of the application VM’s to drive the CPU usage for that VM and keep it having to be scheduled by the hypervisor, and then reran the SQL Server workload.  This put the SQL Server VM at 5% RDY time in the hypervisor during the tests, which is my low watermark for where you would expect performance issues to start showing.  When the SQL workload completed, I recorded the waits again.

clip_image004

Then I repeated the test again, but with both of the application VM’s running four copies of Geekbench.  The SQL Server VM had an average of 28% RDY time, and when the workload completed the waits were recorded a final time.

clip_image006

As you can see below there is a direct increase in the wait occurrence with the increase in RDY%.

TestAVG RDY%wait_typewaiting_tasks_countwait_time_msmax_wait_time_mssignal_wait_time_msDuration
2 APP VM28.4SOS_SCHEDULER_YIELD1045522893124228033:59
1 APP VM5.1SOS_SCHEDULER_YIELD251436186236081:57
Baseline1.2SOS_SCHEDULER_YIELD1392974129631:32

High CPU Ready time is something that should be consistently monitored for with SQL Server VM’s where low latency response times are required.  It also should be monitored for application VMs where latency is important as well.  One thing to be aware of with any of the wait types on a virtualized SQL Server, not just the SOS_SCHEDULER_YIELD wait, is that the time spent waiting is going to also include time the VM has to wait to be scheduled by the hypervisor as well.  The SOS_SCHEDULER_YIELD wait type is unique in that the number of occurrences of the wait type increases because of the time shifts that happen causing quantum expiration internally for a thread inside of SQLOS forcing it to yield, even if it actually didn’t get to make progress due to the hypervisor not scheduling the VM.

Simplifying Availability Group Troubleshooting

The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred.  While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task.  One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time.  Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open.  It’s not impossible, but it does make things more difficult.

Merging Extended Events Files

When the UI for Extended Events was introduced in SSMS 2012, Microsoft included the ability to merge multiple XEL files into a single view in the UI, which can be really useful during AG problem analysis. All you have to do is grab the AlwaysOn_health*.xel files from each of the replica servers, copy them into a folder and point the UI to the folder using the File > Open > Merge Extended Event Files… menu item and you can see all of the events from all of the servers in a single place.

image

Adding server_instance_name to Events

EXCEPT…. there is no server instance name contained in the events by default so this becomes of limited use without modifying the AlwaysOn_health event session to add the sqlserver.server_instance_name action to all of the events.  This is easily accomplished in the UI using the multi-select options of the Configure tab on the event session properties as shown below, or may also be accomplished using the simple DDL script included at the bottom of this post.

image

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
DROP EVENT sqlserver.alwayson_ddl_executed, 
DROP EVENT sqlserver.availability_group_lease_expired, 
DROP EVENT sqlserver.availability_replica_automatic_failover_validation, 
DROP EVENT sqlserver.availability_replica_manager_state_change, 
DROP EVENT sqlserver.availability_replica_state, 
DROP EVENT sqlserver.availability_replica_state_change, 
DROP EVENT sqlserver.error_reported, 
DROP EVENT sqlserver.hadr_db_partner_set_sync_state, 
DROP EVENT sqlserver.lock_redo_blocked;
GO

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_group_lease_expired(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_automatic_failover_validation(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_manager_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.server_instance_name)
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265) OR [error_number]=(41188) OR [error_number]=(41189))), 
ADD EVENT sqlserver.hadr_db_partner_set_sync_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.lock_redo_blocked(
    ACTION(sqlserver.server_instance_name))
GO

Once the event session is modified, the files can easily be merged together in the UI for all future troubleshooting and each event will have the originating server_instance_name attached to it by the action.

Database Master Keys and Availability Groups

Recently I received an email from a client experimenting with Availability Groups for the first time in a development environment that had run into an issue with adding one database out of fifteen to the Availability Group in the UI.  The database in question had Password required beside it in the UI instead of the Meets prerequisites link all the other databases had. Clicking the Password required link produced a non-informative popup dialog saying “This database is encrypted by database master key, you need to provide valid password when adding it to the availability group.”

image

This was something I ran into a couple of years ago, and I did an Insider Demo video of this UI behavior for our SQLskills Insider Newsletter to demonstrate how the UI changed in SSMS 2016 to support encrypted databases and databases with encrypted content not using TDE as well. This is not caused by TDE, but by a database master key (DMK) that is protecting something like a certificate or an encrypted column in a table. The UI actually has a Password column next to the Status column, though it’s not very intuitive or clear that there are text boxes available for each database that you can click on to provide the password for the database master key.

clip_image002

In the screenshot above, the SSISDB requires a password to be added to the Availability Group, and the password has been provided in the Password column of the UI.  To do this, actually requires a series of steps that aren’t exactly intuitive. 

  1. First, double click in the Password column and it will become a text box that is editable. 
  2. Type the password in.
  3. Click Refresh at the bottom of the screen which will make the check box for the database enabled.
  4. Check the checkbox for the database.
  5. Click Refresh a second time, which will enable the Next button at the bottom.
  6. Click Next to progress to the next screen in the wizard.

This took me 20 minutes to figure out in the back of the room during one of our Immersion Events after SQL Server 2016 released so I could answer a question for someone that wasn’t even related to this UI functionality about the SSISDB catalog and how it behaved in an Availability Group if an failover occurs during package execution.