Traditionally questions about how much memory SQL Server needs were aimed at how to appropriately set the 'max server memory' sp_configure option in SQL Server, and in my book the recommendation that I make is to reserve 1 GB of RAM for the OS, 1 GB for each 4 GB of RAM installed from 4–16 GB, and then 1 GB for every 8 GB RAM installed above 16 GB RAM.  This has typically worked out well for servers that are dedicated to SQL Server.  You can also get much more technical with determining where to set 'max server memory' by working out the specific memory requirements for the OS, other applications, the SQL Server thread stack, and other multipage allocators.  Typically this would be  ((Total system memory) – (memory for thread stack) – (OS memory requirements ~ 2-4GB) – (memory for other applications) - (memory for multipage allocations; SQLCLR, linked servers, etc)), where the memory for thread stack = ((max worker threads) *(stack size)) and the stack size is 512KB for x86 systems, 2MB for x64 systems and 4MB for IA64 systems.  The value for 'max worker threads' can be found in the max_worker_count column of sys.dm_os_sys_info.  However, the assumption with either of these methods is that you want SQL Server to use everything that is available on the machine, unless you've made reservations in the calculations for other applications.

As more shops move towards virtualizing SQL Servers in their environment this question is more and more geared towards determining what is the minimum amount of memory that a SQL Server will need to run as a VM.  Unfortunately there is no way to calculate out what the ideal amount of RAM for a given instance of SQL Server might actually be since SQL Server is designed to cache data in the buffer pool, and it will typically use as much memory as you can give it.  One of the things to keep in mind when you are looking at reducing the RAM allocated to a SQL Server instance is that you will eventually get to a point where the lower memory gets traded off for higher disk I/O access in the environments.

If you need to figure out the ideal configuration for SQL Server memory in an environment that has been over provisioned the best way to try to go about doing this is start off with a baseline of the environment and the current performance metrics.  Counters to begin monitoring would include:

  • SQL Server:Buffer Manager\Page Life Expectancy
  • SQL Server:Buffer Manager\Page reads/sec
  • Physical Disk\Disk Reads/sec

Typically if the environment has excess memory for the buffer pool, the Page Life Expectancy value will continue to increase by a value of one every second and it won't typically drop off under the workload because all of the data pages end up being cached.  At the same time, the number of SQL Server:Buffer Manager\Page reads/sec will be low after the cache ramp up occurs which will also correspond to a low value for Physical Disk\Disk Reads/sec. 

Once you have your baseline for the environment, make a change to the sp_configure 'max server memory' option to reduce the size of the buffer pool by 1GB and then monitor the impact to the performance counters after things stabilize from the initial cache flushing that may typically occur when RECONFIGURE is run in the environment.  If the level of Page Life Expectancy remains acceptable for your environment (keeping in mind that a fixed target of >= 300 is ridiculous for servers with large amounts of RAM installed), and the number of SQL Server:Buffer Manager\Page reads/sec is within what the disk I/O subsystem can support without performance degradation, repeat the process of reducing the sp_configure value for 'max server memory' by 1GB and continuing to monitor the impact to the environment. 

This morning Paul received an email from a member of the community that was asking if it is safe to run SQL Server inside of a VM and whether we would suggest running databases in virtual machines or not.  The root of the question being asked was an article that the person had read back in January titled SQL Server and VMware: A Potentially Fatal Combination.  This article is an amazing example of how misinformation on the internet can lead to confusion and an incorrect decisions about what options are feasible for a given environment or not.  Most people that read my blog know that I have been running SQL Server virtualized on VMware in production environments since late 2004/early 2005.  In all that time, not once have I ever experienced a data corruption issue associated with SQL Server running inside of a virtual machine, even on my laptop where the machine has powered off due to loss of battery life and the machines experienced a hard crash during operation.  Why?

Analyzing the Source of the Misinformation

Lets start out at the top of the article and debunk the misinformation that is contained in it about running SQL Server on VMware.  The first thing is that the article makes some very broad claims about VMware based on the use of VMware Player, which is akin to Microsoft Virtual PC for Windows XP Mode in Windows 7.  This is not a server class hypervisor, nor is it in any way representative of VMware’s main products ESX/ESXi or vSphere.  This is like comparing SQL Server to Microsoft Access, they aren’t even in the same class of functionality or features.

The next claim in the article is that the database is at risk due to hypervisor caching of the writes being performed by SQL Server, which uses the FILE_FLAG_WRITE_THROUGH flag when calling the CreateFile function to open the database, instructing the system to write through any cache directly to disk.  The article claims that VMware Player caches I/O operations and can result in lost writes.  Unfortunately, the example provided is a machine running on Windows Vista which as a hosted platform always uses unbuffered I/O for virtual machines, so it is not caching the writes being performed.  You don’t have to take my word for this, it is documented in the Storage IO crash consistency with VMware products KB article on their site:

For hosted products, write handling depends on the host operating system.

On Linux hosts, VMware does not use unbuffered IO, because it is not safe or supported across all the Linux versions that VMware supports. So currently, VMware hosted products on Linux hosts always use buffered IO.

On Windows hosts, VMware hosted products use unbuffered IO by default.

Unfortunately the entire foundation of the article is based on misinformation that can be proven by reading VMware’s documentation.  The VMware ESX hypervisors DO NOT cache I/O under any circumstance, which is also covered in the same KB article:

VMware ESX acknowledges a write or read to a guest operating system only after that write or read is acknowledged by the hardware controller to ESX. Applications running inside virtual machines on ESX are afforded the same crash consistency guarantees as applications running on physical machines or physical disk controllers. 

If this is the case, you might be wondering how the database in the article became corrupt? Unfortunately it is impossible to know because the necessary parameters surrounding the test configuration are not provided in the article.  Instead the article provides a Disclaimer that tells you the specifics are “rather unimportant”

Disclaimer:
In many articles you will find the execution times of different queries. Please do not interpret these results scientifically or as official benchmarks. You will not even find the configuration of the computer used for these experiments because it is rather unimportant.

I disagree that this kind of information is unimportant, especially when making claims that SQL Server will suffer database corruption issues if running under VMware. 

Setting up a Repeatable Test Scenario

For scientific purposes, I decided to take one of my standard SQL Server 2008R2 VMs running inside of my Dell M6500 laptop and run a series of repeatable tests using VMware Workstation 7 on Windows 7.  The VM is configured with 2 vCPUs and 2GB RAM, and like the article a thumb drive will be used to store the database data and log files.  To setup the environment for the tests, the following script will be used:

-- Create a "Safe" database on the VMDK stored on my laptop SSD
CREATE DATABASE [VMSafe]
ON PRIMARY
( NAME = N'VMSafe',
  FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\VMSafe.mdf',
  SIZE = 6144KB,
  FILEGROWTH = 1024KB )
LOG ON
( NAME = N'VMSafe_log',
  FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\VMSafe_log.ldf',
  SIZE = 1024KB ,
  FILEGROWTH = 1024KB)
GO

-- Create a "Test" database on a removable thumb drive
CREATE DATABASE [VMTest]
ON PRIMARY
( NAME = N'VMTest',
  FILENAME = N'E:\SQLskills\VMTest.mdf',
  SIZE = 6144KB ,
  FILEGROWTH = 1024KB )
LOG ON
( NAME = N'VMTest_log',
  FILENAME = N'E:\SQLskills\VMTest_log.ldf' ,
  SIZE = 1024KB ,
  FILEGROWTH = 1024KB)
GO

-- Create the baseline table in the "Safe" database
USE [VMSafe];
GO
IF OBJECT_ID('dbo.TestTable') IS NOT NULL
BEGIN
    DROP TABLE dbo.TestTable;
END
GO

CREATE TABLE dbo.TestTable
(RowID INT IDENTITY PRIMARY KEY,
Col1 INT,
Col2 CHAR(400) DEFAULT('ABC123'),
Col3 NCHAR(400) DEFAULT('789XYZ'))
GO
DECLARE @StartTime DATETIME = CURRENT_TIMESTAMP
DECLARE @loop INT = 1
BEGIN TRANSACTION
WHILE @loop <=100000
BEGIN
    INSERT INTO TestTable (Col1) VALUES (@loop);
    SET @loop = @loop + 1;
END
COMMIT TRANSACTION
SELECT DATEDIFF(ms, @StartTime, CURRENT_TIMESTAMP) AS Table1_Time
GO
UPDATE TestTable
SET Col1 = RowID;
GO

-- Create a duplicate of the table in the "Test" database
USE [VMTest]
GO
CREATE TABLE dbo.TestTable
(RowID INT IDENTITY PRIMARY KEY,
Col1 INT,
Col2 CHAR(400) DEFAULT('ABC123'),
Col3 NCHAR(400) DEFAULT('789XYZ'))
GO
SET IDENTITY_INSERT dbo.TestTable ON
INSERT INTO TestTable (RowID, Col1, Col2, Col3)
SELECT RowID, Col1, Col2, Col3
FROM VMSafe.dbo.TestTable
SET IDENTITY_INSERT dbo.TestTable OFF
GO

-- Issue CHECKPOINT to flush dirty buffer pages to disk
CHECKPOINT
GO

Now that we have our databases created and identical data in them so that we can validate the consistency of the environment in the event of a forced crash by removing the USB thumbdrive from the computer, lets issue a UPDATE against all the rows in the VMTest databases to prefix them with a 1_ for the Col2 column.

image

With the UPDATE complete, pull out the USB thumbdrive and then issue another CHECKPOINT operation to force a flush of dirty buffers to the disk.  This will raise a 9001 error for the VMTest database since the log is not available to write the CHECKPOINT log records to.

image

At this point we have a crashed database, so we can plug in our USB thumbdrive and restart the SQL Server instance to allow crash recovery to run for the databases.  Immediately upon instance restart we will see:

image

OMG!  It has the same status as the article, that must mean we have a problem!  NO! The In Recovery status tells you that crash recovery is running for the database.  If the database is In Recovery you should not set it to Emergency Mode as the article states.  Instead open up the ErrorLog for the instance and look at what is happening with the recovery, the database engine logs informational messages about the progression of crash recovery for the databases while it is running.

image

image

In this case, within a few minutes of opening the database the estimated time for crash recovery to be performed is already at 86 seconds.  The only thing you have to do here is be patient and allow the Database Engine to perform the Undo/Redo operations the way it is designed to.

image

image

All it took was 103 seconds for crash recovery to be performed and the database is available online and in a consistent manner.  The log records describing the transaction were hardened in the transaction log before the commit of the update operation occurred ensuring that the database could be recovered in a consistent state.

I have run this series of tests a number of different ways and the database always starts up in a crash consistent state.  Feel free to test this yourself by pulling the USB thumbdrive out while the UPDATE is running, and it will force a rollback of the transaction during crash recovery.

Summary

In Summary, there is nothing inherently unsafe about running SQL Server on VMware.  Microsoft has supported SQL Server on VMware for nearly three years at this point as a validated Server Virtualization Validation Program (SVVP) configuration (http://support.microsoft.com/?id=956893).  This article is further evidence of the danger of misinformation on the internet, and the need to validate information online before making decisions based on the information.  A SQL Server running as a virtual machine in a SVVP validated configuration is going to be as safe as a SQL Server running on physical hardware.

On my blog post Virtualizing SQL on VMware Reference List, Oscar Zamora (Blog | Twitter) asked the following question in a comment:

As a virtualized instance has the benefit of "failing over" to another physical box, would you consider clustering a virtualized instance?

The answer to this question more than I want to write up in a comment, so I decided instead to blog my response.  Since at least ESX 3, VMware has provided a number of high availability features in their enterprise server virtualization product.  A detailed listing of the HA features available in VSphere 4 can be found in the vSphere Availability Guide.
In this post I’ll talk about the most popular ones and the ones that VMware marketing uses to try to convince people that VMware HA solves all of their High Availability needs.

High Availability and DRS Clusters

ESX hosts can be clustered together providing high availability from the hardware failure of a host for the guests running inside of the cluster.  If a host fails, the VM’s that were running on that host fail as well, but the cluster detects this and powers the VM’s up on other hosts inside of the cluster automatically, bringing the VM’s back online and restoring application serviceability.  Other features such as Server vMotion and Storage vMotion allow movement of the VM’s and storage dynamically to allow for hardware maintenance and upgrades with little to no downtime to the applications running on the VM’s. In addition to this the VMware Tools installed inside of the guests can provide monitoring of the guest to detect Operating System failures and lock ups inside of the VM and automatically restart the VM.

VMware High Availability: Easily Deliver High Availability for All of Your Virtual Machines
VMware High Availability: Concepts, Implementation, and Best Practices

Fault Tolerance

VSphere 4 introduced a new high availability feature for VM guests called Fault Tolerance.  Fault Tolerance creates a synchronized Secondary virtual machine on another host in the high availability cluster that is lock stepped with the Primary VM.  In the event of a host failure, guests that have Fault Tolerance enabled immediately failover to their Secondary in a manner that is similar to vMotion preventing application downtime from occurring.  When this occurs a new Secondary is created on another host inside of the cluster and synchronized with the new primary maintaining the fault tolerance of the guest inside of the environment.

VMware vSphere™ 4 Fault Tolerance: Architecture and Performance
VMware® Fault Tolerance Recommendations and Considerations on VMware vSphere™ 4
Protecting Mission-Critical Workloads with VMware Fault Tolerance

What does this all mean to SQL Server?

These features are really great features provided by virtualization, but that doesn’t make them the solution to all of your High Availability needs.  SQL Server is often considered and treated by server administrators as just another application server, especially when it comes to virtualization.  However, SQL Server is not just another application and SQL Server provides its own High Availability options, like clustering, that may be more appropriate based on your environmental requirements.  There are specific reasons that the above features may not be acceptable HA features for SQL Server. 

Fault Tolerance is currently limited to single vCPU guest VM’s only, so unless your SQL Server VM’s are all single vCPU, that’s not going to help you out.  This leaves you with VMware HA and the potential for failure with automatic restart on another host.  However, what none of the VMware features provides is minimization of downtimes associated with planned Windows Updates of the guest VM’s or the application of SQL Server Service Packs.  If you have the ability to take periodically planned downtimes of one to two hours for patching the basic HA features of VMware will probably meet your needs.  This may be an acceptable configuration in your environment, and if it is, I won’t fault you for deciding that it meets your high availability needs.  I have VM’s in production that are protected first, by good database backups, and then by VMware HA. 

However, I also have systems that have minimal downtime requirements, and because of this, relying on VMware HA as my primary HA solution doesn’t provide the level of availability required for those systems.  This leads to SQL Server clustering, which is supported in virtualization if it meets specific requirements as documented on the Support policy for Microsoft SQL Server products that are running in a hardware virtualization environment and in Bob Wards blog post SQL Server Support Policy for Failover Clustering and Virtualization gets an update.  This brings us to the question that started this discussion: “Would I consider clustering a virtualized instance?”  Maybe, if the host environment had the resources to support it, but using VM’s for my cluster nodes wouldn’t be my first choice for clustering for a couple of reasons.

First if a database has a minimal downtime requirement it probably also has a minimum performance requirement that is coupled with it.  While you can, and should, setup reservations for the resources allocated to a SQL Server VM, in NUMA enabled hosts like newer Nahalem systems, the maximum recommended size of a single VM is the resources available in a single NUMA node.  That means if you have a quad socket quad core server with 128GB RAM with 4 NUMA nodes, on per socket, each node would have 4 cores and 32GB RAM, making the largest VM 4 vCPU and 32GB RAM, unless you trade off the NUMA optimizations in ESX and memory locality.  In addition to this, you have to be careful where your SQL Server VM cluster nodes exist inside of the VMware host cluster to actually maintain the high availability of the SQL Server cluster.  If both of the SQL Server VM cluster nodes exist on the same physical host, and that host fails, both of the SQL Server VM cluster nodes are going to fail as well.  On top of these reasons, configuration of the VM guests for MSCS is not a trivial process as shown by the 36 page whitepaper Setup for Failover Clustering and Microsoft Cluster Service.

In my current environment we have multiple clustered SQL Server instances, and none of them are virtualized and at the current time we are planning to build additional clustered SQL Server instances to migrate databases that have high availability and minimal downtime requirements to.  Despite having a dedicated VMware host cluster for our SQL VM environment comprised of very powerful hardware, there isn’t any added benefit to building the clusters inside of virtual machines.  Your environment may be different and clustering inside of virtual machines may make sense for your specific requirements, but in the end it is the database downtime requirements that should determine whether or not the database should be clustered, not that VMware provides built in High Availability, because that only covers one aspect of minimizing downtime.

I’ve been managing SQL Server virtualized in production environments for nearly five years now, and in that time, I’ve had to do a significant amount of reading/learning about VMware as a hypervisor so that I could properly track down performance problems and in a lot of cases, prove to vendors that the problem wasn’t virtualization.  It wasn’t very long ago that nobody supported virtualization, including Microsoft, despite the fact that it generally wasn’t the cause of problems.  Lately I’ve seen more people virtualizing SQL Servers and asking questions about running SQL Server in a virtualized environment and in a lot of cases, when the platform selected is VMware, I can answer their questions by pointing them to one of the many whitepapers I’ve read over the last few years.  To make this easier, and to share what I consider to be important references for VMware I am going to list them in this blog post  by category.  I’ve made it a point to try and list the most recent papers but in some cases, the only paper I know of or could find searching is for a version or two back in the ESX version.

CPU/Scheduling
VMware vSphere™ 4: The CPU Scheduler in VMware® ESX™ 4
VMware® vSphere™: TheCPU Scheduler in VMware ESX® 4.1


Memory
Understanding Memory Resource Management in VMware® ESX™ Server
Understanding Memory Resource Management in VMware ESX 4.1
Virtual Performance: Love Your Balloon Driver
Large Page Performance


DiskIO
PVSCSI Storage Performance
VMware vSphere 4 Performance with Extreme I/O Workloads
Comparison of Storage Protocol Performance in VMware vSphere™ 4
VMware® vStorage Virtual Machine File System Technical Overview and Best Practices
Recommendations for Aligning VMFS Partitions

SQL Server Specific
Performance and Scalability of Microsoft® SQL Server® on VMware vSphere™ 4
Microsoft SQL Server and VMware Virtual Infrastructure
Availability Guide for Deploying SQL Server on VMware® vSphere
Microsoft® SQL Server on VMware® Best Practices Guide

Performance Best Practices General
Performance Best Practices for VMware vSphere™ 4.1
Performance Troubleshooting for VMware vSphere 4

Categories:
Virtualization | Whitepapers

Theme design by Nukeation based on Jelle Druyts