Virtualization makes a sysadmin's life easier by abstracting away the hardware. The hypervisor (like VMware ESX, Microsoft Hyper-V, or Citrix Xen) sits as a layer between the physical server and the virtual server, offering a set of general hardware services like network cards, video cards, SCSI adapters, and so on. In Device Manager, these drivers look like the real thing, but like a child watching a magician, it's hard for us to tell the real thing from the fakes:
Note that I said this makes a sysadmin's life easier.
I didn't say this makes a database administrator's life easier.
In fact, it's quite the opposite. For the last decade, each new version of SQL Server has gotten closer to the metal, integrating closer with the underlying hardware in order to gain faster performance. No other Microsoft product offers settings like affinity masking, MAXDOP, and its own CPU scheduling, and DBAs have grown accustomed to tweaking these settings in order to wring every last bit of performance out of their servers.
To tune performance for virtual SQL Servers, we have to know more about this mysterious hypervisor layer that purports to abstract our hardware away. We need to learn about this gauzy fabric, understand what it's doing right, see what it struggles with, and adapt our tuning techniques to work with the hypervisor instead of against it. In the next few posts, I'll show you some of the most important concepts to help you get started, and I'll point you to resources to help learn more. Today's post starts with how to configure various SQL Server CPU options under virtual environments.
How to Configure Max Degrees of Parallelism (Maxdop)
When a query executes, the SQL Server engine's Query Optimizer decides whether any tasks in that plan would run faster if they were broken out into multiple tasks. It can break those tasks out into several parallel tasks that run simultaneously to take advantage of multiple idle cores. The sp_configure setting for "Max Degree of Parallelism" (BOL) controls the maximum number of cores that will get used by default. This isn't just a server-level setting either – ambitious developers have been known to code their T-SQL statements with a MAXDOP hint. Parallelism is discussed in chapter 5, CPU and Query Processing, of our book, Professional SQL Server 2008 Internals and Troubleshooting.
The Internet is chock full of advice on how to set MAXDOP. Got an OLTP server? Set MAXDOP to 1 to avoid problems with runaway queries. Got a NUMA-equipped server? Set it to the max number of cores in a particular NUMA node. Got hyperthreading? Well, you should turn it off, don'tcha know, but –
See, stop right there. Even though the Internet is just a jumbled mass of electrons, it still respects Newton's First Law of Motion: an object at rest tends to stay at rest. Once a piece of advice is written on the Internet, it just sits there. It doesn't take into account the blazing fast rate of change of hardware and software. Most of the advice out there was written long before virtualization even caught on in the datacenter, before Intel's Nehalem processors unveiled a new generation of hyperthreading, and before NUMA servers were common. This advice is just wrong, but since advice doesn't come with an expiration date, it's up to you to ask questions about what you're reading – including this very post – is still relevant right now.
The Problem with CPU Settings in Virtualization
First, before you change any configuration setting, you need to know with certainty that you're solving a problem. Blindly changing configuration options is a surefire recipe for unpredictable and unreliable performance. You also need to know that the best way of solving that problem lies in changing SQL Server – and in virtual environments, that's often not the case.
Before tweaking any SQL Server settings, start by asking whether the virtual server itself has the right number of virtual CPU cores. The sysadmin can build virtual servers with any number of virtual cores, and those numbers don't need to have any relation to the number of physical cores in the underlying hardware.
Even when you know the exact host hardware configuration, your knowledge is only a snapshot in time. With tools like VMware vMotion, Microsoft Live Migration, and Citrix XenMotion, your sysadmin can move your virtual server from one host to another in real time without you knowing. If you haven't used one of these technologies yourself, it can sound like smoke and mirrors, but it really does work when set up correctly. You don't have to stop the SQL Server services or get everyone out of the application. Some TCP/IP packets can get dropped in the process, and it's not wise to perform a vMotion during extremely heavy loads, but I'll address that concept in a later post.
The important thing to know is that the new host hardware may have:
- Faster or slower CPUs
- More or less CPUs or cores
- Hyperthreading turned on or off
- More or less virtual servers sharing those same cores
This move might even happen when no one is looking, because technologies like VMware Distributed Resource Scheduling will automatically move guests around to different hosts based on performance needs. Whether or not these tools should be used with SQL Server is a separate discussion, but the fact that they can be used transparently is enough for us to stop and ask if we're solving the right problem with the right tools when we set MAXDOP or affinity masking. We DBAs like to make our CPU configuration decisions by monitoring CPU % metrics with Perfmon or Task Manager, but there's our next problem.
Task Manager is a dirty, filthy liar – CPU % is absolutely meaningless on virtual servers. If the server is running at 100% CPU, that doesn't mean the server is actually doing any work. It just means the server is using 100% of the CPU cycles available to it, which is determined by other running guests, throttled limits set by the virtualization sysadmin, or hypervisor CPU scheduling issues.
The Real Question: How Many vCores Do We Need?
Rather than making fine-grained CPU settings at the SQL Server or query level, we need to take a step back and look at what the hypervisor is doing to our SQL Server CPU cycles. In my next post, I'll talk about CPU scheduling under the different hypervisors and why you need to start learning about NUMA.