SQLskills SQL101: Practicing disaster recovery

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

At the weekend Kimberly and I attended our first ever SQLSaturday (in Dublin) and as well as a workshop on wait stats, I presented a session on Advanced Data Recovery Techniques. The contents of that session are way too advanced for a 101-level post (you can watch a video of it from the PASS Summit 2014 here if you’re interested) but one of the things I stressed at the start was that practicing disaster recovery techniques is crucial for success when a disaster happens for real.

It doesn’t matter how experienced you are with SQL Server, if you’re responsible for a SQL Server instance, you have to know the basics of how to recover when a disaster strikes (I touched on that earlier in the series in the post SQLskills SQL101: Dealing with SQL Server corruption) and you have to have practiced.

In this post I want to pose a short (by no means exhaustive) series of questions to you about what practicing you have (or haven’t) done, and make you think about an honest answer to each one.

Part 1: Information

  • Do you know where the latest copy of the disaster recovery handbook/run book is? Does everyone else know? (Here’s an example template.)
  • Do you know where the scripts are for automating restores of your backups?
  • Do you know where the installation media for Windows and SQL Server are kept in your environment?
  • Do you know where the Windows and SQL Server product keys are?
  • Do you know how you’ll be able to get new servers if your data center is destroyed?
  • And do you know where they will be installed? What about network? Power? HVAC?
  • Do you know who to call when/if you get stuck during the disaster recovery process?
  • Do you know the priority order for restoring databases/instances in your environment?
  • Do you know where the various SQL Server passwords and encryption keys are stored?

Part 2: Techniques

  • When was the last time you performed a full restore sequence, including tail-of-the-log backups?
  • When was the last time you performed a failover to your secondary servers/data center?
  • When was the last time you performed a bare-metal install?
  • When was the last time you rebuilt or restored master on a server?
  • When was the last time you practiced a recovery as if your main server was completely dead? (And did you successfully recover?)
  • When was the last time you practiced a recovery as if your main SAN was completely dead? (And did you successfully recover?)
  • When was the last time you practiced a recovery as if you didn’t have onsite backups and the SAN was dead? (And did you successfully recover?)

Summary

Think through the answers to the questions above and consider whether you’re comfortable with your responses. Now,think whether you’d be comfortable if someone responsible for some of your data (e.g. your bank, 401-k/retirement account holder, doctor’s office, favorite airline, credit-card companies) made those same answers about their disaster-recovery preparedness. My feeling is that you should be able to answer ‘yes’ for all the Part 1 questions, and answer ‘within the last 3 months’ for all the Part 2 questions, to feel fully comfortable that you’re practicing enough.

Increased SOS_SCHEDULER_YIELD waits on virtual machines

A few months ago while I was teaching wait statistics, I was asked whether there’s any expected differences with waits stats when SQL Server is running in a virtual machine.

My answer was yes – there’s a possibility of seeing longer wait times if something prevents the VM from running, as the wait times are based on the __rdtsc counter difference (essentially the processor clock tick count) between the wait starting and ending.

In VMware or Hyper-V, if a thread inside of SQLOS is waiting for a resource, and the VM has to wait to be scheduled to execute by the hypervisor due to the host being oversubscribed with vCPUs based on the hardware pCPUs, then the actual resource wait time noted in SQL Server will include that time that the VM was unable to run, and so the wait time will appear to be longer than it would have been had the VM not been delayed.

It’s an interesting discussion on whether this is problematic or not, but my view is that it could lead to someone chasing a SQL Server performance problem that’s actually a VM performance problem. Note: this isn’t a problem with the hypervisor, this is because of a misconfiguration of the virtual environment.

Anyway, after the class I got to thinking about thread scheduling in general on a VM that is periodically delayed from running and whether it could cause any other interesting effects around wait statistics.

Specifically, I was concerned about SOS_SCHEDULER_YIELD waits. This is a special wait type that occurs when a thread is able to run for 4ms of CPU time (called the thread quantum) without needing to get suspended waiting for an unavailable resource. In a nutshell, a thread must call into the SQLOS layer every so often to see whether it has exhausted its thread quantum, and if so it must voluntarily yield the processor. When that happens, a context switch occurs, and so a wait type must be registered: SOS_SCHEDULER_YIELD. A deeper explanation of this wait type is in my waits library here.

My theory was this: if a VM is prevented from running for a few milliseconds or more, that could mean that a thread that’s executing might exhaust its thread quantum without actually getting 4ms of CPU time, and so yield the processor causing an SOS_SCHEDULER_YIELD wait to be registered. If this happened a lot, it could produce a set of wait statistics for a virtualized workload that appears to have lots of SOS_SCHEDULER_YIELDs, when in fact it’s actually a VM performance problem and the SOS_SCHEDULER_YIELD waits are really ‘fake’.

I discussed this with my good friend Bob Ward from the SQL Product Group and after some internal discussions, they concurred that it’s a possibility because the thread quantum exhaustion time is calculated using the __rdtsc intrinsic when the thread starts executing, so any delay in the VM running could produce the effect I proposed.

Given that I’m a virtual machine neophyte, I asked Jonathan to run some tests inside of our VMware lab environment to see if he could show the issue happening. He ran a known workload that we use in our Immersion Events to demonstrate the performance impact of host oversubscription, causing a VM to be delayed, and lo and behold, he saw a substantially elevated level of SOS_SCHEDULER_YIELD waits (around 20x more) for the workload, compared to running the same workload on the same VM without any delays.

These same tests were repeated in our Hyper-V lab environment that is identical in hardware and VM configuration to the VMware environment and similar levels of elevated SOS_SCHEDULER_YIELD waits were also seen, so the issue is definitely not specific to any given hypervisor or virtual platform, it’s purely related to the host being oversubscribed for the workloads being run and the SQL Server VM having to wait for CPU resources to continue execution.

I’m deliberately not presenting Jonathan’s test results here because I’m not qualified to explain VMware esxtop output or Hyper-V performance counter values and how they correlate to the SOS_SCHEDULER_YIELD numbers to show the problem occurring. Jonathan will do a follow-up post in the next week or two that explains the results from a virtualization perspective.

However, with a simple set of tests we were able to show that with a VM that gets delayed from running, a SQL Server workload can show a much higher level of SOS_SCHEDULER_YIELD waits because of the use of the __rdtsc intrinsic to calculate thread quantum exhaustion times.

This is really interesting because this is a VM performance issue *causing* a wait type to appear, not just causing waits to be longer.

You should definitely consider this phenomenon if you’re investigating a high number of SOS_SCHEDULER_YIELD waits, a workload performance problem, and your workload is running in a VM. Jonathan explains how to correlate these waits with signs of a VM performance problem in this post.

Hope this helps!