Some Thoughts on Clustering SQL Server Virtual Machines

On my blog post Virtualizing SQL on VMware Reference List, Oscar Zamora (Blog | Twitter) asked the following question in a comment:

As a virtualized instance has the benefit of "failing over" to another physical box, would you consider clustering a virtualized instance?

The answer to this question more than I want to write up in a comment, so I decided instead to blog my response.  Since at least ESX 3, VMware has provided a number of high availability features in their enterprise server virtualization product.  A detailed listing of the HA features available in VSphere 4 can be found in the vSphere Availability Guide.
In this post I’ll talk about the most popular ones and the ones that VMware marketing uses to try to convince people that VMware HA solves all of their High Availability needs.

High Availability and DRS Clusters

ESX hosts can be clustered together providing high availability from the hardware failure of a host for the guests running inside of the cluster.  If a host fails, the VM’s that were running on that host fail as well, but the cluster detects this and powers the VM’s up on other hosts inside of the cluster automatically, bringing the VM’s back online and restoring application serviceability.  Other features such as Server vMotion and Storage vMotion allow movement of the VM’s and storage dynamically to allow for hardware maintenance and upgrades with little to no downtime to the applications running on the VM’s. In addition to this the VMware Tools installed inside of the guests can provide monitoring of the guest to detect Operating System failures and lock ups inside of the VM and automatically restart the VM.

VMware High Availability: Easily Deliver High Availability for All of Your Virtual Machines
VMware High Availability: Concepts, Implementation, and Best Practices

Fault Tolerance

VSphere 4 introduced a new high availability feature for VM guests called Fault Tolerance.  Fault Tolerance creates a synchronized Secondary virtual machine on another host in the high availability cluster that is lock stepped with the Primary VM.  In the event of a host failure, guests that have Fault Tolerance enabled immediately failover to their Secondary in a manner that is similar to vMotion preventing application downtime from occurring.  When this occurs a new Secondary is created on another host inside of the cluster and synchronized with the new primary maintaining the fault tolerance of the guest inside of the environment.

VMware vSphere™ 4 Fault Tolerance: Architecture and Performance
VMware® Fault Tolerance Recommendations and Considerations on VMware vSphere™ 4
Protecting Mission-Critical Workloads with VMware Fault Tolerance

What does this all mean to SQL Server?

These features are really great features provided by virtualization, but that doesn’t make them the solution to all of your High Availability needs.  SQL Server is often considered and treated by server administrators as just another application server, especially when it comes to virtualization.  However, SQL Server is not just another application and SQL Server provides its own High Availability options, like clustering, that may be more appropriate based on your environmental requirements.  There are specific reasons that the above features may not be acceptable HA features for SQL Server. 

Fault Tolerance is currently limited to single vCPU guest VM’s only, so unless your SQL Server VM’s are all single vCPU, that’s not going to help you out.  This leaves you with VMware HA and the potential for failure with automatic restart on another host.  However, what none of the VMware features provides is minimization of downtimes associated with planned Windows Updates of the guest VM’s or the application of SQL Server Service Packs.  If you have the ability to take periodically planned downtimes of one to two hours for patching the basic HA features of VMware will probably meet your needs.  This may be an acceptable configuration in your environment, and if it is, I won’t fault you for deciding that it meets your high availability needs.  I have VM’s in production that are protected first, by good database backups, and then by VMware HA. 

However, I also have systems that have minimal downtime requirements, and because of this, relying on VMware HA as my primary HA solution doesn’t provide the level of availability required for those systems.  This leads to SQL Server clustering, which is supported in virtualization if it meets specific requirements as documented on the Support policy for Microsoft SQL Server products that are running in a hardware virtualization environment and in Bob Wards blog post SQL Server Support Policy for Failover Clustering and Virtualization gets an update.  This brings us to the question that started this discussion: “Would I consider clustering a virtualized instance?”  Maybe, if the host environment had the resources to support it, but using VM’s for my cluster nodes wouldn’t be my first choice for clustering for a couple of reasons.

First if a database has a minimal downtime requirement it probably also has a minimum performance requirement that is coupled with it.  While you can, and should, setup reservations for the resources allocated to a SQL Server VM, in NUMA enabled hosts like newer Nahalem systems, the maximum recommended size of a single VM is the resources available in a single NUMA node.  That means if you have a quad socket quad core server with 128GB RAM with 4 NUMA nodes, on per socket, each node would have 4 cores and 32GB RAM, making the largest VM 4 vCPU and 32GB RAM, unless you trade off the NUMA optimizations in ESX and memory locality.  In addition to this, you have to be careful where your SQL Server VM cluster nodes exist inside of the VMware host cluster to actually maintain the high availability of the SQL Server cluster.  If both of the SQL Server VM cluster nodes exist on the same physical host, and that host fails, both of the SQL Server VM cluster nodes are going to fail as well.  On top of these reasons, configuration of the VM guests for MSCS is not a trivial process as shown by the 36 page whitepaper Setup for Failover Clustering and Microsoft Cluster Service.

In my current environment we have multiple clustered SQL Server instances, and none of them are virtualized and at the current time we are planning to build additional clustered SQL Server instances to migrate databases that have high availability and minimal downtime requirements to.  Despite having a dedicated VMware host cluster for our SQL VM environment comprised of very powerful hardware, there isn’t any added benefit to building the clusters inside of virtual machines.  Your environment may be different and clustering inside of virtual machines may make sense for your specific requirements, but in the end it is the database downtime requirements that should determine whether or not the database should be clustered, not that VMware provides built in High Availability, because that only covers one aspect of minimizing downtime.

Virtualizing SQL on VMware Reference List

I’ve been managing SQL Server virtualized in production environments for nearly five years now, and in that time, I’ve had to do a significant amount of reading/learning about VMware as a hypervisor so that I could properly track down performance problems and in a lot of cases, prove to vendors that the problem wasn’t virtualization.  It wasn’t very long ago that nobody supported virtualization, including Microsoft, despite the fact that it generally wasn’t the cause of problems.  Lately I’ve seen more people virtualizing SQL Servers and asking questions about running SQL Server in a virtualized environment and in a lot of cases, when the platform selected is VMware, I can answer their questions by pointing them to one of the many whitepapers I’ve read over the last few years.  To make this easier, and to share what I consider to be important references for VMware I am going to list them in this blog post  by category.  I’ve made it a point to try and list the most recent papers but in some cases, the only paper I know of or could find searching is for a version or two back in the ESX version.

VMware vSphere™ 4: The CPU Scheduler in VMware® ESX™ 4
VMware® vSphere™: TheCPU Scheduler in VMware ESX® 4.1

Understanding Memory Resource Management in VMware® ESX™ Server
Understanding Memory Resource Management in VMware ESX 4.1
Virtual Performance: Love Your Balloon Driver
Large Page Performance

PVSCSI Storage Performance
VMware vSphere 4 Performance with Extreme I/O Workloads
Comparison of Storage Protocol Performance in VMware vSphere™ 4
VMware® vStorage Virtual Machine File System Technical Overview and Best Practices
Recommendations for Aligning VMFS Partitions

SQL Server Specific
Performance and Scalability of Microsoft® SQL Server® on VMware vSphere™ 4
Microsoft SQL Server and VMware Virtual Infrastructure
Availability Guide for Deploying SQL Server on VMware® vSphere
Microsoft® SQL Server on VMware® Best Practices Guide

Performance Best Practices General
Performance Best Practices for VMware vSphere™ 4.1
Performance Troubleshooting for VMware vSphere 4