Reducing Long R720 POST Boot Time

I was recently working with a client that uses R720 servers for SQL Server and during restarts of the server, the POST (Power On Self Test) would stall for as long as 15 minutes before the Windows Server OS would actually begin starting.  I did some searching and found the Dell PowerEdge 12G Server Bios PDF which covers PCIe Slot Disablement, which allows the PCIe slots to be set to Boot Driver Disabled to prevent the Option ROM from running during the POST to allow for quicker boot times.  What this does is disable the preboot services for the device when it is a non-bootable device, while allowing the device to remain available to the OS.  On the systems where the boot times were taking over 15 minutes, none of the PCIe devices were used for booting the server, and were PCIe SSDs and NIC’s.  Setting the slots to Boot Driver Disabled reduced the POST time and allowed the server to reboot significantly faster.

To change these settings, first press F2 to enter into the BIOS during POST.  You will have to wait for all of the preboot before the System Setup screen will show up.  Then click on the Integrated Devices menu item:

image

Then click on the Slot Disablement menu item:

image

Then set the slots that are not required for bootable devices to Boot Driver Disabled:

image

Save the configuration and reboot the server and the POST should complete significantly faster now that it doesn’t have to initialize the device preboot and load the Option ROM from the device to allow it to be used for a bootable device.  Since the OS driver is still available once the OS boots, the device will function inside of Windows, but as with anything your mileage may vary and you need to test this change before introducing it on production systems.

CPU Ready Time in VMware and How to Interpret its Real Meaning

(If you’re having persistent problems with your virtual machine configurations and SQL Server performance, Jonathan can help you fix it – fast! Click here for details.)

In the last month I have had to explain how to interpret CPU Ready Time information for SQL Server VMs running on VMware to a number of people. The first time was on Twitter, and the topic is really too big for a 140 character discussion, but I went ahead and gave it a shot. It is rare in my experience for SQL professionals to know anything about virtualization under VMware or Hyper-V, which is why we spend a half a day, or more, on virtualization during IE3: High Availability & Disaster Recovery. Based on my experience, I was surprised to see the question being asked on #sqlhelp at all. It turns out that Idera Diagnostics Manager has been updated to include information about VM performance that is collected from the Virtual Center if you have it installed in your data center.

Now, overall this is a good thing because DBAs now have additional information that they didn’t have access to unless they had negotiated access to Virtual Center with their VM Administrator. My experience consulting is that few DBAs have access to, or even know about Virtual Center, even though it is one of the critical tools for troubleshooting performance problems. Another item that most DBAs don’t know is that their VM administrator can configure roles and read-only security access to allow anyone the ability to see performance and configuration information inside of Virtual Center, but not be able to make changes, so there really is no good reason for a DBA to not have access to the performance data available in Virtual Center.

Now that DBAs have access to the data, it is important to know what is being displayed and what it really means. Unfortunately, the way the information is being presented makes it easy to misinterpret what is being shown, so the end result is confusion for DBAs that don’t know a lot about VMware, or what the information they are looking at actually means. An example of the VM CPU Ready information from Diagnostics Manager is shown in the chart in Figure 1.

clip_image002[8]
Figure 1 – VM CPU Ready graph

Looking at this chart on the report what would your first interpretation of it be? So far, 3 out of 3 people, including the one on twitter, have thought that there was a problem with this VM. The reality of the situation is that there is absolutely nothing wrong with this VM, it is doing just fine. It just so happens that the scale of this graph makes it easy to jump to an interpretation that this is showing a percentage value, and we all know that anything performance related that is over 80% can’t be good. It doesn’t say that, but it doesn’t provide any context for the information’s meaning either. To make matters even more confusing, CPU Ready in VMware is available as a summation value, which happens to be what is shown here in milliseconds, and as a percentage (RDY% in esxtop) of time spent waiting with work to do to be scheduled by the hypervisor. We can confirm that the information presented is the summation value and not the percentage by looking at the real-time information available in Virtual Center for the same server as shown in the chart in Figure 2.

clip_image004
Figure 2 – CPU Ready real-time summation from Virtual Center

So now that we know it’s a summation and represented in milliseconds what exactly does that tell us? Unfortunately, it actually doesn’t tell us anything on it’s own.

What is CPU Ready Time and why do we even care?

CPU Ready Time is the time that the VM waits in a ready-to-run state (meaning it has work to do) to be scheduled on one or more of the physical CPUs by the hypervisor. It is generally normal for VMs to have small values for CPU Ready Time accumulating even if the hypervisor is not over subscribed or under heavy activity, it’s just the nature of shared scheduling in virtualization. For SMP VMs with multiple vCPUs the amount of ready time will generally be higher than for VMs with fewer vCPUs since it requires more resources to schedule/co-schedule the VM when necessary and each of the vCPUs accumulates the time separately.

At what point does CPU Ready Time start to affect performance?

To be honest it is always having some minimal effect, but it really depends on a lot of different factors, for example which CPU Ready value you are looking at and where you are getting the information. If you are looking at raw RDY% values from esxtop, the value has a completely different meaning than the summation values that are available from Virtual Center. Inside of Virtual Center, the level of summation you are looking at when reading the values also affects the meaning that the value has, and you have to perform calculations to convert the summation into a percentage to know the effect as documented in the VMware Knowledge Base. In this case, for a real-time summation, the data point is actually a 20 second summation of ready time accumulation.

At one point VMware had a recommendation that anything over 5% ready time per vCPU was something to monitor. In my experience for a SMP SQL VM, anything over 5% per vCPU is typically a warning level and anything over 10% per vCPU is critical. The reason this specifically says per vCPU is that each vCPU allocates 100% to the VM’s scheduling total, so a 4 vCPU VM would have a scheduling total of 400%. A 10% CPU Ready on a 4 vCPU VM only equates to 2.5% per vCPU. If this isn’t already complex, it gets worse.

This makes providing a general recommendation impossible for this counter, because it depends on a lot of different factors. For example, if the VM has had a CPU Limit placed on it, whenever the VM exceeds its allocated limit it will accumulate CPU Ready time while it waits to be allowed to execute again. If the CPU Limit is being enforced under business SLAs or a chargeback system, the VM could easily have high CPU Ready values that fit what is required for the configuration. Using the formula from the KB article to convert a summation value to percent, if we round the average of 81.767 down to 80 for simple math, this results in:

(80 / (20s * 1000)) * 100 = 0.4% CPU ready

Four tenths of a percent CPU Ready time, which is not going to negatively impact the performance of the VM. The example VM shown above, also has 8 vCPUs allocated to it, so after taking this into account, it really only has 0.05% per vCPU, well below the older recommended value.

What scenarios cause high CPU Ready times?

While there are a number of scenarios where high CPU Ready times can occur, there are generally two common scenarios that I see when I am consulting. The most common reason tends to be host over subscription, where too many vCPUs have been allocated per pCPU ratio wise. While ESX 5 supports a maximum of 25 vCPUs per physical CPU, this is definitely a case where just because you can, doesn’t mean it’s good to do. As always your mileage may vary based on your specific VM workloads, but typically I start to see problems when a host is in the range of 2-2.5X over subscribed for server workloads.

The second common scenario that I see where CPU Ready times are high is when a larger SMP VM for SQL Server, for example one with 4-8 vCPUs is running on a host that has a lot of smaller VMs with 1-2 vCPUs for application servers. Depending on the number of physical processors, and the total number of vCPUs allocated on the host, the larger resource allocation for the SQL Server VM results in it having to wait longer for the hypervisor to preempt the necessary physical CPUs to schedule/co-schedule the workload. Often in cases where this occurs, after asking some questions I find that the number of vCPUs for the SQL Server was increased from four to eight due to performance problems for the VM. Unfortunately, if CPU Ready time was the original problem, increasing the vCPUs actually doesn’t improve performance, it generally makes things worse.

What do I do if this is actually a problem?

If you have gone through the information and you can see that CPU Ready is really a problem for your VMs there are a couple of different things that can be done. The correct one depends on your virtual infrastructure. If the problem is purely host over subscription vCPU to pCPU ratio wise, start off by evaluating whether the VMs need to have the number of configured vCPUs to determine if any of them can be reduced to lower the ratio. If this can’t be done, the only real answer is to add additional hosts to allow the load to be balanced better and reduce the over subscription rates. If the problem is specific to the larger SMP VMs for applications like SQL Server, evaluate whether you can consolidate the larger VMs onto one or most hosts and move the smaller VMs to the other hosts to separate the VMs based on their sizes. This has worked well for a number of clients that I have worked with were they truly needed eight or more vCPUs for their workload.

Summary

Understanding the data that you are looking at and what it actually means is critical to making the right decisions about what is happening in a virtualized environment. CPU Ready time specifically requires a good understanding of what the value actually is showing and how it relates to the configuration of the VM, the other VMs on the host, and the physical host resources. If you are looking at summation data for the CPU Ready time, converting it to a CPU Ready percent value is what provides the proper meaning to the data for understanding whether or not it is actually a problem. However, keep in mind that other configuration options like CPU Limits can affect the accumulated CPU Ready time and must be checked as well. Whenever I am performing a health check of a SQL Server VM on VMware, I make sure that I get screenshots of the CPU Ready information from Virtual Center for each of the summation levels available so that I can determine whether or not it is affecting the performance of the VM, but I am always careful to calculate using the correct formula what the percentage value actually works out to and then review the rest of the VM configuration before making any conclusions.  In the worst case I’ve seen, for one client the CPU Ready time was roughly 63% per vCPU, and you could visibly see the VM freeze while moving the mouse in a RDP session.  Reviewing the configuration showed that the VM had 8 vCPUs on a host with 8 physical CPUs that was also running 10 other VMs with a total of 14 additional vCPUs.  Moving that VM back down to 2 vCPUs was instant relief to their biggest bottleneck, and then we started talking about hardware changes to fit their increased virtualization usage.  If you’d like expert assistance with implementing, configuring, troubleshooting, or understanding SQL Server on VMware we have a number of different services to fit your needs.

Clustering SQL Server on Virtual Machines (Round 2)

Recently there was lengthy discussion on the #sqlhelp hash tag on Twitter about clustering SQL Server on VMs and whether or not that was a good idea or not. Two years ago I first blogged about this same topic on my blog post, Some Thoughts on Clustering SQL Server Virtual Machines. If you haven’t read that post, I recommend reading it before continuing with this one because it gives a lot of background that I won’t be rehashing as a part of this post. However, a lot has changed in VMware and Hyper-V since I wrote that original post and those changes really affect the recommendations that I would make today.

As I stated in the twitter discussion, we have a number of clients at SQLskills that run WSFC clusters across VMs for SQL Server HA and few have problems with the clusters related to them being VMs. There are some additional considerations that have to be made when working with VMs for the cluster nodes, a big one is that you should plan for those nodes to be on different hosts so that a hardware failure of the host doesn’t take out both of your cluster nodes, which defeats the purpose of having a SQL cluster entirely. There is also the additional layer of the hypervisor and resource management that plays into having a cluster on VMs but with proper planning and management of the VM infrastructure this won’t be a problem, it’s just another layer you have to consider if you do happen to have a problem.

In response to the discussion, Chuck Boyce Jr (Blog|Twitter) wrote a blog post that provided his opinion, which was not to do it, that started a separate discussion later on Twitter. The biggest problem Chuck points out is rooted in problems with inter-team communications within an IT shop. To be honest, Chuck’s point is not an incorrect one, I see this issue all the time, but it’s not specific to VMs. If you work in a shop that has problems with communication between DBAs, Windows administrators, VM administrators, the networking team, and any other IT resource in the business, the simple fact is those problems can be just as bad for a physical implementation of a SQL cluster as they might be for a VM implementation. The solution is to fix the communication problem and find ways to make the “team” cooperate better when problems arise, not avoid merging technologies in the hopes of preventing problems that will still occur in a physical implementation as well.

Am I saying that clustering VMs for SQL Server is for every shop? No, certainly not. There are plenty of places where clustering isn’t the best solution overall. However, with virtualization, depending on the infrastructure, the other SQL HA options that exist might not be a better decision as they would in a physical world either. One of the biggest things to think about is where are the VMs ultimately going to be stored? If the answer is a shared SAN then options like Database Mirroring and Log Shipping don’t really provide you with the same advantages that they do in physical implementation, the big one being that you have a second copy of the database on completely different storage generally. Yes I know that you could have two SQL Servers connected to the same SAN physically that use Database Mirroring, and my response to that would be that a cluster probably would make more sense because the SAN is your single point of failure in either configuration.

If you are new to clustering SQL Server, I wouldn’t recommend that you start out with VMs for your failover cluster. The odds are that you also don’t have a lot of VM experience and if there is a problem, you aren’t going to be able to troubleshoot it as effectively because you have two new technologies that you have to try and dig through. If you are comfortable with clustering SQL Server, adding virtualization as a new technology to the mix is really not that big of a deal, you just need to read the configuration guides and whitepapers for how to setup the VMs, usually your VM administrator is going to have to do this so it’s a good area to break the ice with them and work together to start the open lines of communication, to allow for a supported WSFC implementation and then finally install SQL Server and manage it like you would any other SQL Server failover cluster.

Where else would I recommend not implementing a cluster on VMs? iSCSI SANs that only offer 1Gb/s connectivity over Ethernet, simply because you are likely to run into I/O limitations quickly, and to build the cluster you have to use the software initiator for iSCSI so there is a CPU overhead associated with the configuration. Generally the host has limited ports so you end up sharing the networking between normal traffic and the iSCSI traffic which can be problematic as well. Does that mean it’s not possible? No – I have a number of clients that have these setups and they work fine for their workloads, but it’s not a configuration I would recommend if we were planning a new setup from the ground up.

The big thing that I work through with clients when they are considering whether to cluster VMs for SQL Server is the business requirements for availability and whether or not those can be met without having to leverage one of the SQL HA options or not. With the changes in VMware ESX 5 and Hyper-V 2012, you can scale VMs considerably, and both platforms allow for virtualized NUMA configurations inside of the guest VM for scalability, so the performance and sizing considerations I had two years ago are no longer primary concerns for the implementation to me. If we need 16 vCPUs and 64GB RAM for the nodes, with the correct host configuration, we can easily do that, and we can do it without performance problems while using Standard Edition licensing if we plan the infrastructure correctly.

In my previous post on this topic I linked to a number of VMware papers, and in the post prior to that one I linked to even more papers that include best practice considerations for configuration and sizing of the VMs, how to configure the VMs for clustering, and many other topics. Newer versions of these documents exist for ESX 5 and a number also exist for Hyper-V as well. I recommend that anyone looking at running SQL in a VM, whether as a clustered instance or not, spend some time reading through the papers about the hypervisor you want to run the VM on so you understand how it works, the best practices for running SQL Server on that hypervisor, and what to look for while troubleshooting problems should they occur.

In the end, Microsoft supports SQL Server failover clustering on SVVP certified hypervisors, so there isn’t a hard reason to not consider using VMs objectively to evaluate whether they might be an appropriate fit your business requirements.  When I teach about virtualization in our IE3: High Availability & Disaster Recovery class, most of the perceptions at the start of the virtualization module are negative towards SQL Server on VMs, often from past experiences of failed implementations.  By the end of the demos for the module, most of the opinions have changed, and in a lot of cases attendees have found and been able to communicate correctly with their VM administrator to get the problem fixed while I am performing demos of specific problems and their causes.  In the last year I’ve setup a number of SQL Server clusters on VMs for clients where it was the best fit for their needs.  If you would like assistance with reviewing the infrastructure, business requirements, and determining the best configuration for your needs, I’d be happy to work with you as well.