The Accidental DBA (Day 29 of 30): Troubleshooting Deadlocks

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Deadlocks occur in SQL Server whenever two processes attempt to acquire locks on the same resources in reverse order creating a persistently blocked condition, where neither of the sessions can continue to execute until the other session releases its locks.  In SQL Server, the Lock Monitor background task is responsible for detecting and resolving deadlocks whenever they occur (basically terminating the session that has performed the least amount of work so far), and the resulting 1205 error can be a sign of problems that require further evaluation.  In a third-party vendor application, it may not be possible to make the changes necessary to eliminate deadlocks, but you can still collect information about the deadlocks to assist the third-party vendor in analysis and possibly identifying a solution to the problem.

Collecting Information

Prior to SQL Server 2008, collecting deadlock information from SQL Server required enabling a trace flag, configuring a SQL Trace, Event Notifications, or using a WMI Alert.  Trace Flags 1222, 1205, or 1204 write the deadlock information as text into the ERRORLOG.  SQL Trace, Event Notifications and  WMI Alerts allow collection of the deadlock information as XML. Since the introduction of Extended Events and the new system_health event session in SQL Server 2008, deadlock information has been captured by default in SQL Server and no longer requires enabling additional data collection for analysis.

Analysis

The definitive source for understanding the output from trace flag 1222 is a series of blog posts written by Bart Duncan. His three-part series uses the output from trace flag 1222 to demonstrate how to read the XML deadlock graph information, starting with Deadlock Troubleshooting, Part 1. The same method of analysis applies to deadlock graphs collected by SQL Trace, Event Notifications, WMI, and even Extended Events.  The format of the deadlock graph defines the deadlock victim(s), each of the processes involved in the deadlock (within the process-list node), and the resources contributing to the deadlock (within the resource-list node).  The processes each have an assigned processid that is used to uniquely identify each of the processes and the resources being locked or requested by the process in the graph.  Within each of the process’ information, the execution tsql_stack will show the most recently deadlocked statement backwards to the start of the execution call stack.

One of the key areas of focus for deadlock analysis is the resource-list portion of the graph, which contains all the information about the resources involved and the lock types being held and requested by each of the processes.  This will also contain the index and object names, or the allocation unit associated with the object, which can be used to determine the name of the object and index.  Understanding the locking order between the processes is essential for deadlock troubleshooting.

In addition to viewing the raw XML or text information for the deadlock, it is also possible to view the information graphically as explained in my blog post Graphically Viewing Extended Events Deadlock Graphs.  The graphical view in Management Studio will not show all of the same details as the XML or text, but can be a fast start for understanding the type of deadlock and locking order.  It may be necessary to look at the text or XML for further information in some scenarios, or you can also open the graph graphically in SQL Sentry’s excellent Plan Explorer Pro to get the full output parsed as a table as well.

Possible solutions

There are a lot of potential solutions to prevent deadlocks occurring, and the correct one will depend on the specific deadlock condition that is occurring. In some deadlock scenarios an index change to cover one of the queries executing may be all that is necessary to prevent the deadlock condition from being possible.  In other scenarios, it may be necessary to change isolation levels, or use locking hints to force a blocking lock that is incompatible with other locks to prevent the deadlock condition from being encountered.  Proper analysis of the deadlock graph will help with determining the appropriate solution to the problem, but in most cases simple error handling logic in Transact-SQL or .NET application code to handle the 1205 error and attempt to resubmit the victim transaction can prevent end users from being negatively affected by deadlocks occurring.

Summary

Troubleshooting deadlocks in SQL Server starts off with first collecting the deadlock graph information using one of the available methods.  Extended Events in SQL Server 2008 collect the information by default and eliminate the need to enable further collection and then waiting for the deadlocks to reoccur to gather the information.  Full details of how to configure deadlock graph collection and analysis of specific scenarios can be found in my SimpleTalk article Handling Deadlocks in SQL Server and in my Pluralsight online training course SQL Server: Deadlock Analysis and Prevention.

The Accidental DBA (Day 24 of 30): Virtualization High Availability

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Virtualization has been popular for many years, and more and more businesses are moving low-latency line-of-business applications like SQL Server into virtual machines every day.  One of the common reasons that I’ve heard over the years for moving SQL Server to a virtual machine is that high availability is built-in.  Usually what this translates into is, “We don’t need to use SQL Server availability options because the VM already has HA.”  This may be the case for some scenarios but as the saying goes “there’s no such thing as a free lunch.”  In this post we’ll look at the high availability provided to virtual machines and the considerations that need to be taken into account when determining whether or not to implement SQL Server high availability while using virtual machines.

Basic Virtual Machine HA

The high availability provided through virtualization depends on the configuration of the host environment on which the VMs are running.  Typically for a high-availability configuration for virtualization, multiple host servers are clustered together using a shared-storage solution on a SAN, NFS, or NAS for the virtual machine hard disks.  This provides resilience against failure of one of the host servers by allowing the virtual machines to restart on one of the other hosts.  Both Hyper-V and VMware provide automated detection of guest failures in the event of a problem and will restart the VMs automatically on another host, provided that sufficient resources exist to meet any reservations configured for the individual VMs.

VMs also gain better availability over physical servers through features like Live Migration/vMotion and the ability to perform online storage migrations to move the virtual hard disks from one storage array to another one available to the host(s).  This can be very useful for planned maintenance windows, SAN upgrades, or for balancing load across the host servers to maximize performance in response to performance problems. The VM tools that are installed in the guest, to improve performance and integration with the host server, can also monitor availability of the guest through regular ‘heart-beats’ allowing the host to determine that a VM has crashed, for example a blue screen of death (BSOD), and automatically restart the guest VM in response.

VM Specific HA Features

Addition to the basic high availability provided by virtualization, there are VM-specific HA features that are offered by both VMware and Hyper-V for improving availability of individual VMs.  VMware introduced a feature for VM guests called Fault Tolerance in vSphere 4 that creates a synchronized secondary virtual machine on another host in the high-availability cluster that is lock stepped with the primary.  In the event of a host failure, guests that have Fault Tolerance enabled immediately failover to their secondary in a manner that is similar to a vMotion operation, preventing application downtime from occurring.  At the same time, a new secondary VM is created on another host inside of the cluster and synchronized with the new primary maintaining the fault tolerance of the guest inside of the environment. Unfortunately this is limited to a single virtual CPU, even in ESX 5.1 so it’s not likely to be used with SQL Server VMs.

Hyper-V does not currently provide an equivalent feature to VMware Fault Tolerance, even in Server 2012.  Hyper-V 2012 introduced Replica’s which are provide disaster recovery through replication to a remote data center with manual failover, but it doesn’t provide automated failover in a similar manner to Fault Tolerance.

SQL Server Considerations

The primary consideration I ask about when it comes to SQL Server high availability on virtualization is whether or not it is acceptable to incur planned down times associated with routine maintenance tasks like Windows Server OS patching, and SQL Server patching with Service Packs or Cumulative Updates. If a planned down time is possible to allow for patching then the high availability provided by virtualization may meet your business requirements.  However, I would always recommend testing a host failure to determine the amount of time required to detect the failure, and then restart the VM on another host, including the time required for Windows to boot, and SQL Server to perform crash recovery to make the databases available again.  This may take 3-5 minutes, or even longer depending on the environment, which may not fit within your downtime SLAs.

If planned down time for applying server patches is not possible, you will need to pick a SQL Server availability option using the same considerations as you would for a physical server implementation.  Support for Failover Clustering of SQL Server on SVVP-certified platforms was introduced in 2008, and Database Mirroring and Availability Groups are also supported under server virtualization.  However, none of the SQL Server high availability options are supported in conjunction with Hyper-V Replicas, so there are additional limitations that need to be considered whenever you combine features on top of server virtualization.  One of the limitations that should always be factored into the decision to virtualize SQL Server and use SQL native high availability options is the added complexity that exists by adding the virtualization layer to the configuration.

The Accidental DBA (Day 22 of 30): Determining a High-Availability Strategy

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

A lot of times when I talk to people about high availability for SQL Server the first thing that they think of is a failover cluster. While failover clustering is one of the high-availability features provided by SQL Server, it’s not the only option available. Selecting the correct high-availability strategy should be a part of the initial planning of a SQL Server installation, but commonly I find that high-availability considerations only become important after a problem has occurred that resulted in downtime for SQL Server. When high availability is considered as an after thought, the costs for implementing the chosen strategy may be higher than if it was implemented initially. In this post we’ll take a look at the necessary considerations for choosing a high-availability strategy

SQL Server offers many high-availability features such as database mirroring, failover clustered instances, availability groups, and log shipping. Even transactional replication can be used as a high-availability option.

Gather Requirements

The first step in the process of determining a high availability strategy for SQL Server is gathering business requirements to establish SLAs.  Paul already covered RTO and RPO in his blog post (Day 6 of 30): Backups: Understanding RTO and RPO, and these are the key requirements that need to be understood as a part of requirements gathering.  It is important to ensure that realistic expectations are set for the availability requirements for the solution.  Otherwise, when we get to the next step and begin evaluating our limitations we may need to come back to this first step again to reevaluate what is actually going to be possible.

Evaluate Limitations

Once we know the expected SLAs for the solution we can then begin to evaluate the limitations that exist to determine if we will be able to meet those SLAs or not. Limitations can generally be categorized as technical or non-technical.  For example, the budget is a non-technical limitation that is going to be important in picking technologies and determining if you can meet the SLA requirements.  There is no point in designing an entire solution around multiple data centers and SAN-based replication if the budget for high availability is only $10K.  Other important non-technical limitations to consider are the skill-set requirements for maintaining an implementation, space availability in the data center, power and cooling requirements, and even the time requirements for the implementation.  Any or all of these can be overcome if the budget supports expanding data-center space, additional training for employees, or even hiring experienced consultants to perform/help with the initial implementation when time constraints exist.

Technical limitations are limitations around SQL Server that can affect the ability of specific technologies to be used.  For example, if you can only use SQL Server Standard Edition in SQL Server 2012, the hardware will be limited to 16 cores, 64GB RAM and you won’t be able to leverage newer features like availability groups.  Other technical factors such as the volume of transaction log generation, average transaction size, database recovery model, and whether or not you can modify the application will also limit the options available when you begin selecting technologies based on the limitations and requirements.

Selecting Technologies

Once you have the requirements and limitations sorted out it is time to begin reviewing the available technologies to determine if it’s actually going to be possible to meet the SLAs within the existing limitations or not.  It is not uncommon to find that the existing limitations will not support the business requirements and SLAs, and when this occurs it is important to present the limitations and explain the SLAs won’t be achievable within the current constraints.  One of two things can happen at this point: the business requirements may be revised to work within the available limitations, or the requirements can be prioritized to determine the order of importance for the design.  The specifics the availability options in SQL Server will be covered in the next few posts in this series, but it is critical to ensure all of the features required by the application are going to be met by the selected technology.  For example, if the database uses FILESTREAM, database mirroring won’t be a viable high availability option, likewise for databases that don’t use the FULL recovery model, which is required for database mirroring and availability groups.

Testing, Validation, and Documentation

Testing and validation are generally a part of the initial implementation of a new high-availability strategy, but they should also become routine tasks to continuously validate that the current implementation continues to meet the business requirements and SLAs.  Documentation of the solution configuration, as well as the failover and recovery plan is an important part of implementing any high-availability solution.  The documentation should ideally be written by a senior team member, but then tested by the most junior person to ensure that all of the necessary steps for performing a failover or recovering from a problem have been appropriately addressed in the documentation.  During ongoing validation of the configuration, documentation updates should be made for any new limitations that are discovered or as the configuration changes to continue to meet the business SLAs.

Summary

Care must be taken when formulating a high-availability strategy to ensure that it meets  the requirements while also working within the limitations.

You can read more about these whitepapers that Paul wrote: