This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!
SQL Server uses locking for a variety of things – from protecting and isolating resources to “indicators” showing that someone is using a database or accessing a table. Many locks are compatible with other locks; it’s not always about limiting a resource to only one session. But, when locks are incompatible blocking can occur. If the blocker is efficient the blocked user might not even realize that they were momentarily blocked. And, a normal system will always have some blocking; it’s a natural occurrence. However, if there are long-running transactions or transactions that affect large numbers of rows – blocking can really feel like a major problem. But, is it really the locks? Or, instead the blocker?
Root Cause Analysis
In many cases, root cause analysis will reveal some inefficiency in your environment:
- Imagine a poorly coded update that sets every column to the value in the dialog because they’re too lazy to determine WHICH column(s) have actually changed. This causes modifications in the data row and across numerous indexes that aren’t even changing.
- Imagine an update that only needs to modify a subset of rows but has no index to help search
- Imagine a transaction that begins, modifies some data, and then waits for user input. The locks are held at the time the data is modified and modification-related locks are not released until the transaction completes. In this case it’s an indefinite amount of wait time. This can cause HUGE problems – not just blocking but also in logging (and therefore recovery).
My point, locking is not always the main cause of the problem but, it often gets blamed. Instead, locking and blocking is really a symptom of some inefficiency and further analysis will help you to better understand where your real problem is. But, how do you analyze it?
When performance is poor there are many options to check. In general, we always recommend starting with wait statistics. In Erin’s post The Accidental DBA (Day 25 of 30): Wait Statistics Analysis she mentions using sys.dm_os_wait_stats. Regularly using this and collecting a baseline of your server’s general characteristics will help you when your system is slow or to see if something’s changed (or changing) over time. Be sure to read Erin’s post as well as the posts she references. The more you know about your server when it’s healthy, the more equipped you’ll be when there’s a problem.
And, if you have a blocking situation right now then the DMV to use is sys.dm_os_waiting_tasks. This can tell you if someone is blocked and which SPID (server process ID) is the blocker. However, this can quickly become complicated if there are multiple connections (SPIDs) involved. Sometimes, finding who is at the head of the chain is part of the problem. And, since you’ll need to know more about what’s going on, you’ll want to use sys.dm_tran_locks. And, instead of reinventing the wheel, check out Glenn Berry’s A DMV A Day – Day 27 (sys.dm_tran_locks), specifically, for the blocking query that orders by wait_duration_ms DESC. This will give you an idea of who’s at the head of the chain because the lock held the longest will be at the top – showing who they’re being blocked by. This will lead you to the blocker. But, what are they doing?
Once you know the SPID at the head of the chain, you can use a variety of commands to start piecing together what’s happening. But, I’d actually recommend a few other things instead:
- The completely OLD SCHOOL method is sp_blocker_pss08. You can get the code from this KB article [KB 271509]. The article says it’s only for SQL Server 2000 and SQL Server 2005 but it still works well – even on SQL Server 2012. And, if your company has an issue with running third party products, then this might work out well for you. It’s simple, it’s just TSQL and it gives you a variety of pieces of information if something is blocked right now.
- The up-to-date way to determine locking problems is to use SQLDiag. But, there’s a bit of a learning curve with this as it’s command-line based and requires a bit more work than just the execution of an sp. You should definitely get some time with it but if you’re trying to troubleshoot a blocking problem right now, now is not the time to learn SQLDiag.
- The easiest (third-party tool) is Adam Machanic’s sp_WhoIsActive and it really does a nice job of producing the information that the old sp_blocker_pss08 produces but in tabular form. And, Adam has blogged quite a bit of information about using this utility.
And, if you’re trying to see if patterns exist over time, consider Using the Blocked Process Report in SQL Server 2005/2008. Jonathan did a great write-up of how to set this and use this to generate a trace of the activity that’s running at the time a process hits 5 seconds of being blocked.
Ultimately, you need to find out who is doing the blocking first – why is their transaction taking so long? If it’s due to inefficiencies in the query – can you rewrite it? If it’s due to inefficiencies in the plan – can you add an index? If it’s modifying a large amount of data – can you break it down into smaller chunks so that each set is locked for a shorter period of time? These are ALWAYS the thing to try first.
Consider Row Versioning
If you truly have an optimized system and it’s highly active with both readers and writers who are just constantly getting in each other’s way (and causing blocking), then you might consider using a form of row versioning. This is much more complicated than a quick post can capture but I’ve see “snapshot isolation” (as it’s often called) explained incorrectly numerous places. Simply put, you can have your database in one of FOUR states:
- Read Committed using Locking: this is the default – with NONE of the row versioning options enabled.
- Statement-level Read Consistency (or, read committed using row versioning): this is what you run if you turn on ONLY the database option read_commmitted_snapshot. This causes readers (in read committed isolation) to use versioned-based reads guaranteeing them a definable point in time to which their QUERY (or, statement) reconciles. Each statement reconciles to the point in time when that statement began.
- Transaction-level Read Consistency (or, Snapshot Isolation): this is what you get if you turn on ONLY the database option allow_snapshot_isolation. This ALLOWs users to request a versioned-based read and in a transaction, will cause ALL reads in the transaction to reconcile to when the transaction began. However, it’s important to note that this option adds the overhead of versioning but readers will use locking unless they request a snapshot isolation session using: SET TRANSACTION ISOLATION LEVEL SNAPSHOT.
- The forth state is when BOTH database options have been set. If you turn on both read_committed_snapshot and allow_snapshot_isolation then all statement’s reconcile to the point in time when the statement started (in read-committed). Or, if you’ve changed your isolation to snapshot then each statement will reconcile to the point in time when the transaction began.
NOTE: There are numerous places [online] where it’s stated that both of these options are required for versioning; this is incorrect. You can have neither, only one, or both. All four states produce different behaviors.
The beauty of versioning is that readers don’t block writers and writers don’t block readers. Yes, I know it sounds fantastic but be careful, it’s not free. You can read the detailed whitepaper that I wrote about it when it was first released in SQL Server 2005 (updated by Neal Graves): SQL Server 2005 Row Versioning-Based Transaction Isolation. And, I also did a video on it here: http://technet.microsoft.com/en-US/sqlserver/gg545007.aspx. And, if you think that the overhead might be too much for your system, check out the case study from Nasdaq: Real-Time Reporting with Snapshot Isolation.
Ultimately, versioning might be what you’re looking for but don’t jump right to this without thoroughly tuning your environment.
- Find where you’re waiting most
- Do root cause analysis to get to the code or SPID that’s causing you grief
- Analyze the code to see if it can be changed to reduce the time that the locks are held
- Consider changing isolation
Thanks for reading!
PS – Stay tuned for day 29 when Jonathan talks about when blocking becomes deadly! :)