SQL Server and VMware: A Potentially Fatal Combination? That’s Nonsense!

This morning Paul received an email from a member of the community that was asking if it is safe to run SQL Server inside of a VM and whether we would suggest running databases in virtual machines or not.  The root of the question being asked was an article that the person had read back in January titled SQL Server and VMware: A Potentially Fatal Combination.  This article is an amazing example of how misinformation on the internet can lead to confusion and an incorrect decisions about what options are feasible for a given environment or not.  Most people that read my blog know that I have been running SQL Server virtualized on VMware in production environments since late 2004/early 2005.  In all that time, not once have I ever experienced a data corruption issue associated with SQL Server running inside of a virtual machine, even on my laptop where the machine has powered off due to loss of battery life and the machines experienced a hard crash during operation.  Why?

Analyzing the Source of the Misinformation

Lets start out at the top of the article and debunk the misinformation that is contained in it about running SQL Server on VMware.  The first thing is that the article makes some very broad claims about VMware based on the use of VMware Player, which is akin to Microsoft Virtual PC for Windows XP Mode in Windows 7.  This is not a server class hypervisor, nor is it in any way representative of VMware’s main products ESX/ESXi or vSphere.  This is like comparing SQL Server to Microsoft Access, they aren’t even in the same class of functionality or features.

The next claim in the article is that the database is at risk due to hypervisor caching of the writes being performed by SQL Server, which uses the FILE_FLAG_WRITE_THROUGH flag when calling the CreateFile function to open the database, instructing the system to write through any cache directly to disk.  The article claims that VMware Player caches I/O operations and can result in lost writes.  Unfortunately, the example provided is a machine running on Windows Vista which as a hosted platform always uses unbuffered I/O for virtual machines, so it is not caching the writes being performed.  You don’t have to take my word for this, it is documented in the Storage IO crash consistency with VMware products KB article on their site:

For hosted products, write handling depends on the host operating system.

On Linux hosts, VMware does not use unbuffered IO, because it is not safe or supported across all the Linux versions that VMware supports. So currently, VMware hosted products on Linux hosts always use buffered IO.

On Windows hosts, VMware hosted products use unbuffered IO by default.

Unfortunately the entire foundation of the article is based on misinformation that can be proven by reading VMware’s documentation.  The VMware ESX hypervisors DO NOT cache I/O under any circumstance, which is also covered in the same KB article:

VMware ESX acknowledges a write or read to a guest operating system only after that write or read is acknowledged by the hardware controller to ESX. Applications running inside virtual machines on ESX are afforded the same crash consistency guarantees as applications running on physical machines or physical disk controllers. 

If this is the case, you might be wondering how the database in the article became corrupt? Unfortunately it is impossible to know because the necessary parameters surrounding the test configuration are not provided in the article.  Instead the article provides a Disclaimer that tells you the specifics are “rather unimportant”

Disclaimer:
In many articles you will find the execution times of different queries. Please do not interpret these results scientifically or as official benchmarks. You will not even find the configuration of the computer used for these experiments because it is rather unimportant.

I disagree that this kind of information is unimportant, especially when making claims that SQL Server will suffer database corruption issues if running under VMware. 

Setting up a Repeatable Test Scenario

For scientific purposes, I decided to take one of my standard SQL Server 2008R2 VMs running inside of my Dell M6500 laptop and run a series of repeatable tests using VMware Workstation 7 on Windows 7.  The VM is configured with 2 vCPUs and 2GB RAM, and like the article a thumb drive will be used to store the database data and log files.  To setup the environment for the tests, the following script will be used:

— Create a "Safe" database on the VMDK stored on my laptop SSD
CREATE DATABASE [VMSafe]
ON PRIMARY
( NAME = N’VMSafe’,
  FILENAME = N’C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\VMSafe.mdf’,
  SIZE = 6144KB,
  FILEGROWTH = 1024KB )
LOG ON
( NAME = N’VMSafe_log’,
  FILENAME = N’C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\VMSafe_log.ldf’,
  SIZE = 1024KB ,
  FILEGROWTH = 1024KB)
GO

— Create a "Test" database on a removable thumb drive
CREATE DATABASE [VMTest]
ON PRIMARY
( NAME = N’VMTest’,
  FILENAME = N’E:\SQLskills\VMTest.mdf’,
  SIZE = 6144KB ,
  FILEGROWTH = 1024KB )
LOG ON
( NAME = N’VMTest_log’,
  FILENAME = N’E:\SQLskills\VMTest_log.ldf’ ,
  SIZE = 1024KB ,
  FILEGROWTH = 1024KB)
GO

— Create the baseline table in the "Safe" database
USE [VMSafe];
GO
IF OBJECT_ID(‘dbo.TestTable’) IS NOT NULL
BEGIN
    DROP TABLE dbo.TestTable;
END
GO

CREATE TABLE dbo.TestTable
(RowID INT IDENTITY PRIMARY KEY,
Col1 INT,
Col2 CHAR(400) DEFAULT(‘ABC123’),
Col3 NCHAR(400) DEFAULT(‘789XYZ’))
GO
DECLARE @StartTime DATETIME = CURRENT_TIMESTAMP
DECLARE @loop INT = 1
BEGIN TRANSACTION
WHILE @loop <=100000
BEGIN
    INSERT INTO TestTable (Col1) VALUES (@loop);
    SET @loop = @loop + 1;
END
COMMIT TRANSACTION
SELECT DATEDIFF(ms, @StartTime, CURRENT_TIMESTAMP) AS Table1_Time
GO
UPDATE TestTable
SET Col1 = RowID;
GO

— Create a duplicate of the table in the "Test" database
USE [VMTest]
GO
CREATE TABLE dbo.TestTable
(RowID INT IDENTITY PRIMARY KEY,
Col1 INT,
Col2 CHAR(400) DEFAULT(‘ABC123’),
Col3 NCHAR(400) DEFAULT(‘789XYZ’))
GO
SET IDENTITY_INSERT dbo.TestTable ON
INSERT INTO TestTable (RowID, Col1, Col2, Col3)
SELECT RowID, Col1, Col2, Col3
FROM VMSafe.dbo.TestTable
SET IDENTITY_INSERT dbo.TestTable OFF
GO

— Issue CHECKPOINT to flush dirty buffer pages to disk
CHECKPOINT
GO

Now that we have our databases created and identical data in them so that we can validate the consistency of the environment in the event of a forced crash by removing the USB thumbdrive from the computer, lets issue a UPDATE against all the rows in the VMTest databases to prefix them with a 1_ for the Col2 column.

image

With the UPDATE complete, pull out the USB thumbdrive and then issue another CHECKPOINT operation to force a flush of dirty buffers to the disk.  This will raise a 9001 error for the VMTest database since the log is not available to write the CHECKPOINT log records to.

image

At this point we have a crashed database, so we can plug in our USB thumbdrive and restart the SQL Server instance to allow crash recovery to run for the databases.  Immediately upon instance restart we will see:

image

OMG!  It has the same status as the article, that must mean we have a problem!  NO! The In Recovery status tells you that crash recovery is running for the database.  If the database is In Recovery you should not set it to Emergency Mode as the article states.  Instead open up the ErrorLog for the instance and look at what is happening with the recovery, the database engine logs informational messages about the progression of crash recovery for the databases while it is running.

image

image

In this case, within a few minutes of opening the database the estimated time for crash recovery to be performed is already at 86 seconds.  The only thing you have to do here is be patient and allow the Database Engine to perform the Undo/Redo operations the way it is designed to.

image

image

All it took was 103 seconds for crash recovery to be performed and the database is available online and in a consistent manner.  The log records describing the transaction were hardened in the transaction log before the commit of the update operation occurred ensuring that the database could be recovered in a consistent state.

I have run this series of tests a number of different ways and the database always starts up in a crash consistent state.  Feel free to test this yourself by pulling the USB thumbdrive out while the UPDATE is running, and it will force a rollback of the transaction during crash recovery.

Summary

In Summary, there is nothing inherently unsafe about running SQL Server on VMware.  Microsoft has supported SQL Server on VMware for nearly three years at this point as a validated Server Virtualization Validation Program (SVVP) configuration (http://support.microsoft.com/?id=956893).  This article is further evidence of the danger of misinformation on the internet, and the need to validate information online before making decisions based on the information.  A SQL Server running as a virtual machine in a SVVP validated configuration is going to be as safe as a SQL Server running on physical hardware.

5 thoughts on “SQL Server and VMware: A Potentially Fatal Combination? That’s Nonsense!

  1. We have a some limited experience with support SQL on a VM but I am sure not as much as Jonathan and this group. Also, appearances can be deceing but the server level staff we have supporting the SQL env often seem to over-source the host and then we cant see the problem from the VM. That is, it seems harder to find the problem because the actual hardware is hidden fron us as DBAs. We focus on being DBAs not vm/server experts. With server staff of varying levels of experience, would you say it is accurate when evaluating TCO and whether to go physical/virtual, to factor in the fact there is a greated disconnect from SQL and the hardware, increased complexity, and the server staff really need to know what they are doing? We seem to find virtual works great when it works BUT far more painful to troubleshoot than a physical system … and that should be factored in when putting all the proverbial eggs in one basket. That is, unless you make all DBAs VM and server experets too.

  2. Thanks Jonathan. Nice article. Just wanted to echo it with my real experience (not regarding VM but about the situation you simulated in your test). We had similar issue recently where we received sev 21 alerts “The log for database **** is not available” when the SAN controller was over loaded (100%) for a short period of time. The particular database went into Suspect status. Then we failed over the instance (we knew load on SAN has come back to normal. We didn’t lose the log drive) which fixed the issue and I saw the crash recovery related messages in error log.

    The other day (before the issue described above) one staging environment which is on the same SAN had the same issue for same reason. I did DBCC repair which fixed the issue. I am just wondering if I had failed over the instance, would that fix the problem? Probably yes I guess.

    Notes for people who may read my comments: I am not saying restarting instance is the way how you fix a Suspect database. Just saying in that particular situation, restart did fix the issue.

    Now lets sort of get back to the original topic. Our server guys investigated the issue with the SAN and told me one particular vm sql cluster was smashing SAN. Then I went back and got a list of IO intensive queries. Well, a lot of frequently executed queries (same pattern in terms of how they were written and what they do) were doing massive IO. The queries have a “Top” and a join with a remote view on a linked server which kind of screw cardinality hence cause houndreds of thousands scan count on a local index (non clustered index seek). By the way, I don’t know why this happened. So I dropped the index which changed the plan to use table scan. After that, 99% IO was reduced.

    Well, what I am trying to say by using above example is honestly I hardly see issues with OS no matter its phyical or virtual (in fact we are virtulising our whole environment) if sql server is built and configured properly. Most of time issues come from database design, application. So additional layer introduced by virtualisation? Not a problem to me (Notes again. I am not saying you can just mess around vm and storage. You still need to be careful about setting up vm making sure best practices are followed). As a DBA, I don’t really worry about if its a vm or pm because most of time I deal with issues coming from db and application design (if an environment is properly configured and managed)

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Bitten by SSD Bit Rot

Back in 2012, I wrote a post titled Looking at External Disk Performance using USB 3.0 and eSATA with SSD, where I tested a number

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.