CPU Ready Impact on SOS_SCHEDULER_YIELD

At the end of May 2017, Paul and I had a discussion about SOS_SCHEDULER_YIELD waits and whether or not they could become skewed in SQL Server as a result of hypervisor scheduling issues for a VM running SQL Server.  I did some initial testing of this in my VM lab using both Hyper-V and VMware and was able to create the exact scenario that Paul was asking about, which he explains in this blog post.  I’ve recently been working on reproducing another VM scheduling based issue that shows up in SQL Server, where again the behaviors being observed are misleading as a result of the hypervisor being oversubscribed, only this time it was not for the SQL Server VM, it was instead for the application servers, and while I was building a repro of that in my lab, I decided to take some time and rerun the SOS_SCHEDULER_YIELD tests and write a blog post to show the findings in detail.

The test environment that I used for this is a portable lab I’ve used for demos for VM content over the last eight years teaching our Immersion Events. The ESX host has 4 cores and 8GB RAM and hosts three virtual machines, a 4vCPU SQL Server with 4GB RAM, and two 2vCPU Windows Server VM’s with 2GB RAM that are used strictly to run Geekbench to produce load.  Within SQL Server, I a reproducible workload that drives parallelism and is repeatable consistently, that I have also used for years in teaching classes.

For the tests, I first ran a baseline where the SQL Server VM is the only machine executing any tasks/workload at all, but the other VMs are powered on and just sitting there on the hose.  This establishes a base metric in the host and had 1% RDY time average during the workload execution, after which I collected the wait stats for SOS_SCHEDULER_YIELD.

clip_image002

Then I reran the tests but ran four copies of GeekBench on one of the application VM’s to drive the CPU usage for that VM and keep it having to be scheduled by the hypervisor, and then reran the SQL Server workload.  This put the SQL Server VM at 5% RDY time in the hypervisor during the tests, which is my low watermark for where you would expect performance issues to start showing.  When the SQL workload completed, I recorded the waits again.

clip_image004

Then I repeated the test again, but with both of the application VM’s running four copies of Geekbench.  The SQL Server VM had an average of 28% RDY time, and when the workload completed the waits were recorded a final time.

clip_image006

As you can see below there is a direct increase in the wait occurrence with the increase in RDY%.

TestAVG RDY%wait_typewaiting_tasks_countwait_time_msmax_wait_time_mssignal_wait_time_msDuration
2 APP VM28.4SOS_SCHEDULER_YIELD1045522893124228033:59
1 APP VM5.1SOS_SCHEDULER_YIELD251436186236081:57
Baseline1.2SOS_SCHEDULER_YIELD1392974129631:32

High CPU Ready time is something that should be consistently monitored for with SQL Server VM’s where low latency response times are required.  It also should be monitored for application VMs where latency is important as well.  One thing to be aware of with any of the wait types on a virtualized SQL Server, not just the SOS_SCHEDULER_YIELD wait, is that the time spent waiting is going to also include time the VM has to wait to be scheduled by the hypervisor as well.  The SOS_SCHEDULER_YIELD wait type is unique in that the number of occurrences of the wait type increases because of the time shifts that happen causing quantum expiration internally for a thread inside of SQLOS forcing it to yield, even if it actually didn’t get to make progress due to the hypervisor not scheduling the VM.

Downgrading SQL Server Editions

At some point in your career working with SQL Server, you will run into a situation where the wrong edition of SQL Server has been installed on a server and will need to change the edition for licensing reasons.  Whether it is Enterprise Edition where Standard Edition should have been installed, Enterprise Edition where Developer Edition should have been used, or my favorite, Evaluation Edition where the 180 day trial has expired and Enterprise Edition isn’t going to be used, the only route available for downgrading the edition is to uninstall and reinstall SQL Server entirely.  SQL Server Setup makes upgrading editions a piece of cake with SKUUPGRADE as a command line option for going from Standard/Developer/Evaluation to Enterprise, but anything else requires a full uninstall and reinstall to change the SKU/Edition and then restore all of the system and user databases to the new instance, which typically means a lot of work.  I hate having to restore system databases and avoid having to do it if possible, so here is how I do this process and minimize the work required:

No matter what you are going to have to do an uninstall and reinstall of the SQL Server instance to downgrade the SKU.  However, you can save yourself some time and the headache of trying to restore the system databases if you are careful about what you do.  I have done a plenty of SKU downgrades in the past and the easiest way to do it, and I am not saying this is the Microsoft supported way but that it works if done correctly, is to:

  1. Take a good backup of all of your databases (system and user).  
  2. Run SELECT @@VERSION and note the specific build number of SQL Server that you are currently on.
  3. Shut down your existing instance of SQL Server.  
  4. Copy the master, model, and msdb database files (both mdf and ldf), don’t move them copy them, from the current location to a new folder that you mark as readonly. 
  5. Uninstall SQL Server from the system.
  6. Reboot the server.
  7. Install SQL Server Standard Edition.
  8. Apply the necessary Service Pack and/or Cumulative Updates to bring the instance up to your previous build number.
  9. Shutdown SQL Server.
  10. Copy the master, model, and msdb database files (both mdf and ldf) from the folder you saved them in to the correct location for the new install and remove the readonly flag from the files, and change the file ACL’s so that the SQL Service account has Full Control over the files.
  11. Startup SQL Server and if you did it correctly it will startup and be exactly where you were before you made any changes, with all of your user databases online and you should be ready to let applications connect and resume operations.

If you screw something up in the above, you still have your backups and you can run setup to rebuild the system databases and then go about following the Microsoft supported path for restoring the system databases and then user databases into the system to bring it online.  Essentially the file copy is no different that what would occur through attach/detach you are just doing it with system databases which is not explicitly supported, but it does work.  The key is to have your backups from before you do anything so you have the supported route available if you encounter an issue.  The only issue I have ever had doing this set of steps is that I didn’t set the file ACL’s correctly and the database engine threw Access Denied errors and failed to start until I fixed the ACL’s correctly.  This can save you many hours of frustration and downtime trying to restore everything since the database files are already there and it is just some small copy operations to put them where you need them to be.

ALTER DATABASE failed. The default collation of database ‘%.*ls’ cannot be set to %.*ls.

Last week I was working with a client on upgrading one of their systems from SQL Server 2000 to SQL Server 2012, while also performing a collation change of the database and all of the table columns from Latin1_General_BIN to SQL_Latin1_General_CP1_CI_AS.  What started out as a straight forward upgrade, this actually became quite a challenge.  After upgrading to SQL Server 2008R2 since SQL Server doesn’t support direct upgrades from SQL Server 2000 to SQL Server 2012, I found metadata corruption.  We’ve seen and dealt with this before, so back to SQL Server 2000 to fix the orphaned entries, and then another upgrade attempt to SQL Server 2008R2.

At this point I had a corruption free database and started running the scripts I had generated to migrate from Latin1_General_BIN to SQL_Latin1_General_CP1_CI_AS.  When I got to the point of changing the database default collation I was dismayed to get the following error back from SQL Server:

Msg 1505, Level 16, State 1, Line 1
The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name ‘dbo.sysschobjs’ and the index name ‘nc1’. The duplicate key value is (0, 1, person).
Msg 5072, Level 16, State 1, Line 1
ALTER DATABASE failed. The default collation of database ‘TestCollationChange’ cannot be set to SQL_Latin1_General_CP1_CI_AS.

Thinking about the previous metadata corruption, I was certain that there was something wrong with the database still, but I couldn’t find anything with CHECKDB or CHECKCATALOG.  It turns out, there is nothing wrong with the database, there is something wrong with my expectations and assumptions.  To demonstrate this, consider the following example:

CREATE DATABASE [TestCollationChange]
ON  PRIMARY
( NAME = N'TestCollationChange', FILENAME = N'C:\SQLData\TestCollationChange.mdf')
LOG ON
( NAME = N'TestCollationChange_log', FILENAME = N'C:\SQLData\TestCollationChange_log.ldf')
COLLATE Latin1_General_BIN;
GO
USE [TestCollationChange];
GO
CREATE TABLE dbo.Person
(    RowID int NOT NULL IDENTITY (1, 1),
FirstName varchar(30) NOT NULL,
LastName varchar(30) NOT NULL);
GO
ALTER TABLE dbo.Person ADD CONSTRAINT PK_Person PRIMARY KEY CLUSTERED (RowID);
GO
CREATE TABLE dbo.person
(    RowID int NOT NULL IDENTITY (1, 1),
FirstName varchar(30) NOT NULL,
LastName varchar(30) NOT NULL);
GO
ALTER TABLE dbo.person ADD CONSTRAINT PK_person PRIMARY KEY CLUSTERED (RowID);
GO

Under the Latin1_General_BIN collation, this is a completely valid schema because case sensitivity is applied.  However, when you try and switch to a case insensitive collation with:

ALTER DATABASE [TestCollationChange] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
ALTER DATABASE [TestCollationChange] COLLATE SQL_Latin1_General_CP1_CI_AS;

these immediately become duplicate objects.  So where do we go from here?  First, the error message tells us that the object name is ‘person’, so you might consider doing a query against sys.objects:

SELECT * FROM sys.objects where name = 'person';

The only problem is that this will return 1 row, remember we are still in Latin1_General_BIN so case sensitivity is being applied.  To get around this, we need to change our query to collate the name column using our new collation:

SELECT * FROM sys.objects where name COLLATE SQL_Latin1_General_CP1_CI_AS = 'person';

This will show us both of the objects and it becomes immediately clear why we have a duplication issue, different cases.In the case of the actual database I was working on, the duplicate objects were two stored procedures (actually four if you think about), and the duplicates had object definitions similar to the following:

CREATE PROCEDURE [TestProcedure] AS
BEGIN
SELECT …….
-- Lots more logic, etc
END
GO

CREATE PROCEDURE [TESTProcedure] AS RETURN;
GO

I have no idea what the intent of the second procedure was, but after consulting with the client, it was determined that these duplicate stubs could be dropped, which then allowed the database collation change to SQL_Latin1_General_CP1_CI_AS. This might not be a viable solution if the application actually relies on the case sensitive nature of the naming convention, though I wouldn’t personally ever build a database with duplicate object names regardless of the collation.