Continuing my series on advanced performance troubleshooting - see these two posts for the scripts I'll be using and an introduction to the series:

In this blog post I'd like to show you an example of SOS_SCHEDULER_YIELD waits occurring and how it can seem like a spinlock is the cause.

I originally published this blog post and then had a discussion with good friend Bob Ward from Product Support who questioned my conclusions given what he's seen (thanks Bob!). After digging in further, I found that my original post was incorrect, so this is the corrected version.

The SOS_SCHEDULER_YIELD wait means that an executing thread voluntarily gave up the CPU to allow other threads to execute. The SQL Server code is sprinkled with "voluntary yields" in places where high CPU usage may occur.

One such place where a thread will sleep but not explicitly yield is when backing off after a spinlock collision waiting to see if it can get access to the spinlock. A spinlock is a very lightweight synchronization mechanism deep inside SQL Server that protects access to a data structure (not database data itself). See the end of the second blog post above for more of an explanation on spinlocks.

When a thread voluntarily yields, it does not go onto the waiting tasks list - as it's not waiting for anything - but instead goes to the bottom of the RUNNABLE queue. SOS_SCHEDULER_YIELD waits by themselves are not cause for concern unless they are the majority of waits on the system, and performance is suffering.

To set up the test, I'll create a simple database and table:

USE master;
GO
DROP DATABASE YieldTest;
GO

CREATE DATABASE YieldTest;
GO

USE YieldTest;
GO

CREATE TABLE SampleTable (c1 INT IDENTITY);
GO
CREATE NONCLUSTERED INDEX SampleTable_NC ON SampleTable (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO SampleTable DEFAULT VALUES;
GO 100

Then I'll clear out wait stats and latch stats:

DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);
GO

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);
GO

And then fire up 50 clients running the following code (I just have a CMD script that fires up 50 CMD windows, each running the T-SQL code):

USE YieldTest;
GO

SET NOCOUNT ON;
GO

DECLARE @a int
WHILE (1=1)
BEGIN 
    SELECT @a = COUNT (*) FROM YieldTest..SampleTable WHERE c1 = 1;
END;
GO

And the CPUs in my laptop are jammed solid (snapped from my desktop and rotated 90 degrees to save space):

 

Wow!

Looking at perfmon:

 

The CPU usage is not Windows (% User Time), and of all the counters I usually monitor (I've cut them all out here for clarity) I can see a sustained very high Lock Requests/sec for Object and Page locks (almost 950 thousand requests per second for both types! Gotta love my laptop :-)

So let's dig in. First off, looking at wait stats (using the script in the wait stats post referenced above):

WaitType            Wait_S  Resource_S Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S
------------------- ------- ---------- -------- --------- ---------- --------- -------- --------
SOS_SCHEDULER_YIELD 4574.77 0.20       4574.57  206473    99.43      0.0222    0.0000   0.0222

SOS_SCHEDULER_YIELD at almost 100% of the waits on the system means that I've got CPU pressure - as I saw from the CPU graphs above. The fact that nothing else is showing up makes me suspect this is a spinlock issue.

Checking in the sys.dm_os_waiting_tasks output (see script in the second blog post referenced above), I see nothing waiting, and if I refresh a few times I see the occasional ASYNC_NETWORK_IO and/or PREEMPTIVE_OS_WAITFORSINGLEOBJECT wait type - which I'd expect from the CMD windows running the T-SQL code.

Checking the latch stats (again, see script in the second blog post referenced above) shows no latch waits apart from a few BUFFER latch waits.

So now let's look at spinlocks. First off I'm going to take a snapshot of the spinlocks on the system:

-- Baseline
DROP TABLE ##TempSpinlockStats1;
GO
SELECT * INTO ##TempSpinlockStats1 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

(On 2005 you'll need to use DBCC SQLPERF ('spinlockstats') and use INSERT/EXEC to get the results into a table - see the post above for example code.)

Then wait 10 seconds or so for the workload to continue... and then grab another snapshot of the spinlocks:

-- Capture updated stats
DROP TABLE ##TempSpinlockStats2;
GO
SELECT * INTO ##TempSpinlockStats2 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

Now running the code I came up with to show the difference between the two snapshots:

-- Diff them
SELECT
    '***' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] AS [DiffCollisions],
    ts2.[spins] AS [DiffSpins],
    ts2.[spins_per_collision] AS [SpinsPerCollision],
    ts2.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] AS [DiffBackoffs]
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NULL
UNION
SELECT
    '' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] - ts1.[collisions] AS [DiffCollisions],
    ts2.[spins] - ts1.[spins] AS [DiffSpins],
    CASE (ts2.[spins] - ts1.[spins]) WHEN 0 THEN 0
        ELSE (ts2.[spins] - ts1.[spins]) / -- > 0 spins = > 0 collisions
        (ts2.[collisions] - ts1.[collisions]) END AS [SpinsPerCollision],
    ts2.[sleep_time] - ts1.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] - ts1.[backoffs] AS [DiffBackoffs]
    --, ts2.*
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NOT NULL
    AND ts2.[collisions] - ts1.[collisions] > 0
ORDER BY [New] DESC, [Spinlock] ASC;
GO

New  Spinlock                        DiffCollisions DiffSpins  SpinsPerCollision DiffSleepTime DiffBackoffs
---- ------------------------------- -------------- ---------- ----------------- ------------- ------------
     LOCK_HASH                       6191134        4005774890 647               686           1601383
     OPT_IDX_STATS                   1164849        126549245  108               57            7555
     SOS_OBJECT_STORE                73             305        4                 0             0
     SOS_WAITABLE_ADDRESS_HASHBUCKET 115            44495      386               0             3
     XDESMGR                         1              0          0                 0             0

This is telling me that there is massive contention for the LOCK_HASH spinlock, further backed up by the SOS_WAITABLE_ADDRESS_HASHBUCKET spinlock. The LOCK_HASH spinlock protects access to the hash buckets used by the lock manager to efficiently keep track of the lock resources for locks held inside SQL Server and to allow efficient searching for lock hash collisions (i.e. does someone hold the lock we want in an incompatible mode?). In this case the contention is so bad that instead of just spinning, the threads are actually backing off and letting other threads execute to allow progress to be made.

And that makes perfect sense because of what my workload is doing - 50 concurrent connections all trying to read the same row on the same page in the same nonclustered index.

But is that the cause of the SOS_SCHEDULER_YIELDs? To prove it one way or the other, I created an XEvent session that would capture callstacks when a wait occcured:

CREATE EVENT SESSION MonitorWaits ON SERVER
ADD EVENT sqlos.wait_info
 (ACTION (package0.callstack))
      ADD TARGET package0.asynchronous_bucketizer (
            SET filtering_event_name = 'sqlos.wait_info',
            source_type = 1, source = 'package0.callstack')        
      WITH (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE,
      max_dispatch_latency=5 seconds)
GO

I also made sure to have the correct symbol files in the \binn directory (see How to download a sqlservr.pdb symbol file). After running the workload and examining the callstacks, I found the majority of the waits were coming from voluntary yields deep in the Access Methods code. An example call stack is:

IndexPageManager::GetPageWithKey+ef [ @ 0+0x0
GetRowForKeyValue+146 [ @ 0+0x0
IndexRowScanner::EstablishInitialKeyOrderPosition+10a [ @ 0+0x0
IndexDataSetSession::GetNextRowValuesInternal+7d7 [ @ 0+0x0
RowsetNewSS::FetchNextRow+12a [ @ 0+0x0
CQScanRangeNew::GetRow+6a1 [ @ 0+0x0
CQScanCountStarNew::GetRowHelper+44 [ @ 0+0x0
CQScanStreamAggregateNew::Open+70 [ @ 0+0x0
CQueryScan::Uncache+32f [ @ 0+0x0
CXStmtQuery::SetupQueryScanAndExpression+2a2 [ @ 0+0x0
CXStmtQuery::ErsqExecuteQuery+2f8 [ @ 0+0x0
CMsqlExecContext::ExecuteStmts<1,1>+cca [ @ 0+0x0
CMsqlExecContext::FExecute+58b [ @ 0+0x0
CSQLSource::Execute+319

[Edit: check out how to do this in the spinlock whitepaper

This is clearly (to me) nothing to do with the LOCK_HASH spinlock, so that's a red herring. In this case, I'm just CPU bound. When a thread goes to sleep when backing off from a spinlock, it directly calls Windows Sleep() - so it does not show up as a SQL Server wait type at all, even though the sleep call is made from the SQLOS layer. Which is a bummer.

How to get around that and reduce CPU usage? This is really contrived workload, but this can occur for real. Even if I try using WITH (NOLOCK), the NOLOCK seek will take a table SCH_S (schema-stability) lock to make sure the table structure doesn't change while the seek is occurring, so that only gets rid of the page locks, not the object locks and doesn't help CPU usage. With this (arguably weird) workload, there are a few things I could do (just off the top of my head):

  • Scale-out the connections to a few copies of the database, updated using replication
  • Do some mid-tier or client-side caching, with some notification mechanism of data changes (e.g. a DDL trigger firing a Service Broker message)

So this was an example of where wait stats lead to having to look at spinlock stats, but that the spinlock, on even *deeper* investigation, wasn't the issue at all. Cool stuff.

Next time we'll look at another cause of SOS_SCHEDULER_YIELD waits.

Hope you enjoyed this - let me know your thoughts!

Cheers

PS In the latest SQLskills Insider Quick Tips newsletter from last week, I did a video demo of looking at transaction log IOs and waits - you can get it here.

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

Another really commonly-held belief...

Myth #23: lock escalation goes row-to-page and then page-to-table.

FALSE

Nope, never. Lock escalation in SQL Server 2005 and before goes directly to a table lock always.

In SQL Server 2005 (and 2008) you can change the behavior of lock escalation (if you really know what you're doing) using these trace flags:

  • 1211 - disables lock escalation totally and will allow lock memory to grow to 60% of dynamically allocated memory (non-AWE memory for 32-bit and regullar memory for 64-bit) and will then further locking will fail with an out-of-memory error
  • 1224 - disables lock escalation until 40% of memory is used and then re-enables escalation

1211 takes precedence over 1224 if they're both set - so be doubly careful. You can find more info on these trace flags in Books Online.

In SQL Server 2008, you can change the behavior of lock escalation per table using the ALTER TABLE blah SET (LOCK_ESCALATION = XXX) where XXX is one of:

  • TABLE: always escalate directly to a table lock.
  • AUTO: if the table is partitioned, escalate to a partition-level lock, but then don't escalate any further.
  • DISABLE: disable lock escalation. This doesn't disable table locks - as the Books Online entry says, a table lock may be required under some circumstances, like a table scan of a heap under the SERIALIZABLE isolation level.

Back in January 2008 I blogged an example of setting up a partitioned table and showing partition-level lock escalation in action - see SQL Server 2008: Partition-level lock escalation details and examples.

You may ask why the AUTO option isn't the default in SQL Server 2008? It's because some early-adopters found that their applications started to deadlock using that option. So, just as with the lock escalation trace flags, be careful about turning on the AUTO option in SQL Server 2008.

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

Another short one today as I'm still teaching a class. (Ok - I'm actually writing this at lunchtime on 4/7/10 - and it turned out to be a little longer than I thought...)

Myth #8: Online index operations do not acquire locks.

FALSE

Online index operations are not all unicorns and rainbows (for information about unicorns and rainbows - see http://whiteboardunicorns.com/ - safe for work).

Online index operations acquire short-term locks at the beginning and end of the operation, which can cause significant blocking problems.

At the start of the online index operation, a shared (S lock mode) table lock is required. This lock is held while the new, empty index is created; the versioned scan of the old index is started; and the minor version number of the table schema is bumped by 1.

The problem is, this S lock is queued along with all other locks on the table. No locks that are incompatible with an S table lock can be granted while the S lock is queued or granted. This means that updates are blocked until the lock has been granted and the operation started. Similarly, the S lock cannot be granted until all currently executing updates have completed, and their IX or X table locks dropped.

Once all the setup has been done (this is very quick), the lock is dropped, but you can see how blocking of updates can occur. Bumping the minor version number of the schema causes all query plans that update the table to recompile, so they pick up the new query plan operators to maintain the in-build index.

While the long-running portion of the index operation is running, no locks are held (where 'long' is the length of time your index rebuild usually takes).

When then index operation has completed, the new and old indexes are in lock-step as far as updates are concerned. A schema-modification lock (SCH_M lock mode) is required to complete the operation. You can think of this as a super-table-X lock - it's required to bump the major version number of the table - no operations can be running on the table, and no plans can be compiling while the lock is held.

There's a blocking problem here that's similar to when the S lock was acquired at the start of the operation - but this time, no read or write operations can start while the schema-mod lock is queued or granted, and it can't be granted until all currently running read and write activity on the table has finished.

Once the lock is held, the allocation structures for the old index are unhooked and put onto the deferred-drop queue, the allocation structure for the new index are hooked into the metadata for the old index (so the index ID doesn't change), the table's major version number is bumped, and hey presto! You've got a sparkly new index.

As you can see - plenty of potential for blocking at the start and end of the online index operation. So it should really be called 'almost online index operations', but that's not such a good marketing term...

You can read more about online index operations in the whitepaper Online Indexing Operations in SQL Server 2005.

One of the things I love teaching is how the transaction log and logging/recovery work. I presented a session on this at both PASS and SQL Connections in the last two weeks, and in both sessions I promised to write some blog posts about the deep internals of logging operations. This is the first one in the series. Previous blog posts that dive into logging operations are:

Ok, on with the show. 

SQL Server 2005 introduced a feature called 'fast recovery' in Enterprise Edition. This allows a database to become available for use after the first part of recovery (REDO) completes and before the (usually longer running) second part of recovery (UNDO) completes. See my TechNet Magazine article Understanding Logging and Recovery in SQL Server if you don't know what I'm talking about. But how does SQL Server do this?

The answer is lock logging. A log record describes a single change made to a database. For log records describing changes that can be used as part of UNDO (yes, some changes to the database are one-way only - for instance PFS page changes), from 2005 onwards the log record also includes a description of which locks were being held at the time the change was made. These locks were necessary to protect the change being made when the original transaction was running (before the crash) and so the same locks will be necessary to protect the anti-operation which reverses the change. I'll explain more about these anti-operations in one of the next in-depth logging blog posts.

The Storage Engine does two passes through the log as part of crash recovery. The first pass does REDO and also reads the log records that will be processed as part of the second pass (UNDO), looking at the lock description and actually acquiring those locks. For fast recovery, at that point the database is brought online. This is possible because the recovery system knows that it already has the correct locks to guarantee that it can safely generate and perform the anti-operations necessary to perfom UNDO. One side-effect of this is that although the database is available for use, a query may bump into one of the locks being held to allow fast recovery - in which case it will have to wait for that lock to be dropped as UNDO progresses.

Cool eh?

That was the introduction to allow me to do some gratuitous spelunking around the internals :-) I'm going to create a few simple examples to show you lock logging in the log. Now, don't get confused - it's not logging actual locks (the memory used to hold the lock itself), it's just logging a description of which locks were held and in which modes.

Here's the script to create a database with a simple table. I'm using a LOB column and specifically setting it to be stored off row (see Importance of choosing the right LOB storage technique) so we can see some text page locks too. I'm using the SIMPLE recovery model for simplicity (ha ha) - so I can clear the log when a checkpoint occurs rather than having to muck around with log backups. I'll insert the first row and then clear the log.

CREATE DATABASE LockLogging;
GO
USE LockLogging;
GO

CREATE TABLE LockLogTest (c1 INT, c2 INT, c3 VARCHAR (MAX));
GO
EXEC sp_tableoption 'LockLogtest', 'large value types out of row', 'on';
GO

INSERT INTO LockLogTest VALUES (1, 1, 'a');
GO

ALTER DATABASE LockLogging SET RECOVERY SIMPLE;
GO
CHECKPOINT;
GO

Now let's try the first operation - a simple insert - and look at the log records using fn_dblog (and I'm skipping the checkpoint log records):

INSERT INTO LockLogTest VALUES (2, 2, 'b');
GO
SELECT [Operation], [Context], [Page ID], [Slot ID], [Number of Locks] AS Locks, [Lock Information]
FROM fn_dblog (NULL, NULL);
GO

Operation        Context       Page ID        Slot ID  Locks Lock Information
---------------- ------------- -------------- -------- ----- -------------------------------------------------------------------
LOP_BEGIN_XACT   LCX_NULL      NULL           NULL     NULL  NULL
LOP_INSERT_ROWS  LCX_TEXT_MIX  0001:00000098  1        2     ACQUIRE_LOCK_IX PAGE: 18:1:152; ACQUIRE_LOCK_X RID: 18:1:152:1
LOP_INSERT_ROWS  LCX_HEAP      0001:0000009a  1        3     ACQUIRE_LOCK_IX OBJECT: 18:2073058421:0;
                                                                 ACQUIRE_LOCK_IX PAGE: 18:1:154; ACQUIRE_LOCK_X RID: 18:1:154:1
LOP_COMMIT_XACT  LCX_NULL      NULL           NULL     NULL  NULL

We can see page IX and row X locks for the LOB value being inserted into the text page, plus table IX, page IX, and row X locks for the data record being inserted into the heap. The lock resources break out as follows:

  • 18:1:152 is page 152 in file 1 of database ID 18
  • 18:1:152:1 is slot 1 on page 152 in file 1 of database ID 18
  • 18:2073058421:0 is object ID 2073058421 (the object ID of the table LockLogTest) in database ID 18

Notice also the LOP_BEGIN_XACT and LOP_COMMIT_XACT log records - even though I didn't do an explicit transaction, SQL Server has to start one internally for me (called an implicit transaction) so that there's a boundary for where to rollback if something goes wrong during the operation. 

And now an update operation (with a checkpoint first to clear out the log):

CHECKPOINT;
GO
UPDATE LockLogTest SET c1 = 3;
GO
SELECT [Operation], [Context], [Page ID], [Slot ID], [Number of Locks] AS Locks, [Lock Information]
FROM fn_dblog (NULL, NULL);
GO

Operation        Context       Page ID        Slot ID  Locks Lock Information
---------------- ------------- -------------- -------- ----- -------------------------------------------------------------------
LOP_BEGIN_XACT   LCX_NULL      NULL           NULL     NULL  NULL
LOP_MODIFY_ROW   LCX_HEAP      0001:0000009a  0        3     ACQUIRE_LOCK_IX OBJECT: 18:2073058421:0;
                                                                 
ACQUIRE_LOCK_IX PAGE: 18:1:154; ACQUIRE_LOCK_X RID: 18:1:154:0
LOP_MODIFY_ROW   LCX_HEAP      0001:0000009a  1        3     ACQUIRE_LOCK_IX OBJECT: 18:2073058421:0;
                                                                 ACQUIRE_LOCK_IX PAGE: 18:1:154; ACQUIRE_LOCK_X RID: 18:1:154:1
LOP_COMMIT_XACT  LCX_NULL      NULL           NULL     NULL  NULL

Just as we expected - a table IX lock, a page IX lock, and two row X locks on that page.

Now, what about something more complicated like a TRUNCATE TABLE? Have you heard the myth about it not being logged? Right - it's a myth:

CHECKPOINT;
GO
TRUNCATE TABLE LockLogTest;
GO
SELECT [Operation], [Context], [Page ID], [Slot ID], [Number of Locks] AS Locks, [Lock Information]
FROM fn_dblog (NULL, NULL);
GO

Operation        Context       Page ID        Slot ID  Locks Lock Information
---------------- ------------- -------------- -------- ----- -------------------------------------------------------------------
LOP_BEGIN_XACT   LCX_NULL      NULL           NULL     NULL  NULL
LOP_LOCK_XACT    LCX_NULL      NULL           NULL     1     ACQUIRE_LOCK_SCH_M OBJECT: 18:2073058421:0
LOP_MODIFY_ROW   LCX_IAM       0001:0000009b  0        1     ACQUIRE_LOCK_X RID: 18:1:155:0
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:154
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:155
LOP_MODIFY_ROW   LCX_IAM       0001:00000099  0        1     ACQUIRE_LOCK_X RID: 18:1:153:0
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:152
LOP_MODIFY_ROW   LCX_PFS       0001:00000001  0        1     ACQUIRE_LOCK_X PAGE: 18:1:153
LOP_SET_BITS     LCX_SGAM      0001:00000003  1        NULL  NULL
LOP_SET_BITS     LCX_GAM       0001:00000002  1        NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000014  89       NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000011  78       NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000014  90       NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000041  164      NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000041  165      NULL  NULL
LOP_COUNT_DELTA  LCX_CLUSTERED 0001:00000041  166      NULL  NULL
LOP_HOBT_DDL     LCX_NULL      NULL           NULL     NULL  NULL
LOP_MODIFY_ROW   LCX_CLUSTERED 0001:00000014  89       2     ACQUIRE_LOCK_IX OBJECT: 18:7:0;
                                                                
ACQUIRE_LOCK_X KEY: 18:458752 (0000c2681664)
LOP_HOBT_DDL     LCX_NULL      NULL           NULL     NULL  NULL
LOP_MODIFY_ROW   LCX_CLUSTERED 0001:00000014  90       2     ACQUIRE_LOCK_IX OBJECT: 18:7:0;
                                                                 ACQUIRE_LOCK_X KEY: 18:458752 (00007a581379)
LOP_MODIFY_ROW   LCX_CLUSTERED 0001:00000011  78       2     ACQUIRE_LOCK_IX OBJECT: 18:5:0;
                                                                 ACQUIRE_LOCK_X KEY: 18:327680 (00001df3833b)
LOP_COMMIT_XACT  LCX_NULL      NULL           NULL     NULL  NULL

Lots of logging and lots of locks. If you look at the Context column, you'll see that the operation is modifying allocation bitmaps (LCX_IAM, LCX_PFS, LCX_SGAM, LCX_GAM) but taking locks on the table pages, not on the allocation bitmaps themselves - they're only ever latched (an internal, much lighter-weight, synchronization mechanism). This is done as the pages comprising the table are deallocated - this is all done because the table's small enough that the Storage Engine chooses to deallocate all the storage immediately, instead of pushing it all onto the task queue for the deferred drop background task. See my previous post Search Engine Q&A #10: When are pages from a truncated table reused? which discusses this too.

There are no actual row operations performed on the table itself. The only table row operations are down at the bottom on table with object IDs 7 and 5 (sysallocunits and sysrowsets, respectively) to update the page counts, first IAM, and first page entries for the table.

So - hopefully this has been useful to you. In the next post in the series, I'll discuss compensation log records and how rollback operations work.

PS Send me an email or put in a comment if there's something in particular about the log (or log records) you'd like to see explained.

I've just read a very good, very deep, and very interesting blog post by James Rowland-Jones. In the post, James investigates some locking issues using a variety of means and explains about the undocumented %%lockres%% function with you can use to figure out what the wait resource will be for individual table rows (basically the lock-hash value). He also shows something I've known about but never seen before - how the lock hash algorithm isn't perfect and can actually cause lock collisions where you wouldn't expect them - and how to mitigate the problem.

Excellent post and well worth reading. Check it out at The Curious Case of the Dubious Deadlock and the Not So Logical Lock.

In the TechNet Magazine article I wrote on Advanced Troubleshooting with Extended Events, I mentioned the always-on event session called system_health. Jonathan Kehayias, a fellow MVP and blogging mad-man, posted a great article with SQL Server Central today about how to get historical deadlock graph info from it. His article explains some of the pre-reqs for doing this (like installing 2008 RTM CU1 - see KB 956717 - to get it working and a bug with the XML output) and gives some code.

After wrestling with the download of CU1 (it took 5 tries to get it to download!), I got it installed in one of my VPCs and gave Jon's code a whirl. I forced a deadlock by creating two tables, locking them from separate transactions and then trying to select from the other table. See the code below:

-- Connection 1:
CREATE TABLE t1 (c1 INT);
INSERT INTO t1 VALUES (1);
GO

BEGIN TRAN
UPDATE t1 SET c1 = 2;
GO

-- Connection 2:
CREATE TABLE t2 (c1 INT);
INSERT INTO t2 VALUES (1);
GO

BEGIN TRAN
UPDATE t2 SET c1 = 2;
GO

-- Connection 1:
SELECT * FROM t2;
GO

-- Connection 2:
SELECT * FROM t1;
GO

And one of them will be killed by the deadlock monitor.

Msg 1205, Level 13, State 45, Line 1
Transaction (Process ID 57) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Now running Jon's (slightly reformatted) code:

SELECT CAST (
    REPLACE (
        REPLACE (
            XEventData.XEvent.value ('(data/value)[1]', 'varchar(max)'),
            '<victim-list>', '<deadlock><victim-list>'),
        '<process-list>', '</victim-list><process-list>')
    AS XML) AS DeadlockGraph
FROM (SELECT CAST (target_data AS XML) AS TargetData
    FROM sys.dm_xe_session_targets st
    JOIN sys.dm_xe_sessions s ON s.address = st.event_session_address
    WHERE [name] = 'system_health') AS Data
CROSS APPLY TargetData.nodes ('//RingBufferTarget/event') AS XEventData (XEvent)
    WHERE XEventData.XEvent.value('@name', 'varchar(4000)') = 'xml_deadlock_report';

Gives the XML deadlock graph (which unfortunately can't be displayed graphically because the format differs from the Profiler/trace-flag output)

<deadlock-list>
  <deadlock>
    <victim-list>
      <victimProcess id="process53d9000" />
    </victim-list>
    <process-list>
      <process id="process53d9000" taskpriority="0" logused="256" waitresource="RID: 1:1:304:0" waittime="331" ownerId="1868" transactionname="user_transaction" lasttranstarted="2009-02-23T13:32:05.100" XDES="0x4eb8c10" lockMode="S" schedulerid="1" kpid="2216" status="suspended" spid="57" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2009-02-23T13:32:33.093" lastbatchcompleted="2009-02-23T13:32:05.100" clientapp="Microsoft SQL Server Management Studio - Query" hostname="CHICAGO" hostpid="2552" loginname="CHICAGO\Administrator" isolationlevel="read committed (2)" xactid="1868" currentdb="1" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
        <executionStack>
          <frame procname="" line="1" sqlhandle="0x02000000c70b7b1daba27f2ddf3e772600949464d524b6f2" />
        </executionStack>
        <inputbuf>
SELECT * FROM t1;
    </inputbuf>
      </process>
      <process id="process4985558" taskpriority="0" logused="256" waitresource="RID: 1:1:294:0" waittime="10181" ownerId="1877" transactionname="user_transaction" lasttranstarted="2009-02-23T13:32:08.067" XDES="0x4eb9be0" lockMode="S" schedulerid="1" kpid="2020" status="suspended" spid="56" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2009-02-23T13:32:23.240" lastbatchcompleted="2009-02-23T13:32:08.067" clientapp="Microsoft SQL Server Management Studio - Query" hostname="CHICAGO" hostpid="2552" loginname="CHICAGO\Administrator" isolationlevel="read committed (2)" xactid="1877" currentdb="1" lockTimeout="4294967295" clientoption1="671090784" clientoption2="390200">
        <executionStack>
          <frame procname="" line="1" sqlhandle="0x02000000f4d34532698b7ad324df813feb2ba5024730cb79" />
        </executionStack>
        <inputbuf>
SELECT * FROM t2;
    </inputbuf>
      </process>
    </process-list>
    <resource-list>
      <ridlock fileid="1" pageid="304" dbid="1" objectname="" id="lock47aea80" mode="X" associatedObjectId="72057594039042048">
        <owner-list>
          <owner id="process4985558" mode="X" />
        </owner-list>
        <waiter-list>
          <waiter id="process53d9000" mode="S" requestType="wait" />
        </waiter-list>
      </ridlock>
      <ridlock fileid="1" pageid="294" dbid="1" objectname="" id="lock47af200" mode="X" associatedObjectId="72057594038976512">
        <owner-list>
          <owner id="process53d9000" mode="X" />
        </owner-list>
        <waiter-list>
          <waiter id="process4985558" mode="S" requestType="wait" />
        </waiter-list>
      </ridlock>
    </resource-list>
  </deadlock>
</deadlock-list>

Cool!

OK - last content post today. I forgot that the February TechNet Magazine also has the latest edition of my regular SQL Q&A column. This month's column covers:

  • Should backup compression be enabled at the instance level?
  • Client redirection during database mirroring failovers
  • Partition-level lock escalation in SQL Server 2008
  • Is it ever safe to rebuild a transaction log?

Check out the column at http://technet.microsoft.com/en-us/magazine/2009.02.sqlqa.aspx

Over the last few weeks I've seen (and helped correct) quite a few myths and misconceptions about index rebuild operations. There's enough now to make it worthwhile doing a blog post (and it's too hot here in Orlando for us to go sit by the pool so we're both sitting here blogging)...

Myth 1:  index rebuild pre-allocates the necessary space

This myth has two variations:

  1. The space for the new copy of the index is pre-allocated
  2. The space for the sort portion of the rebuild is pre-allocated

Neither of these are true. Index rebuild (whether online or offline, and at least as far back as 7.0) will create a new copy of the index before dropping the old copy. The pages and extents required to do this will always be allocated as needed, as with any other operation in SQL Server. The sort phase of an index rebuild, if required (in certain cases it is skipped in 2005), will adhere to the same allocation behavior.

Myth 2: indexes are rebuilt within a single file in a multi-file filegroup

This is a new one that I just heard yesterday - (paraphrasing) "In a two-file filegroup, an index in file 1 will be rebuilt into file 2. The next time it is rebuilt, it will be built in file 1. And so on".

This is untrue. Any time any allocations are done in a multi-file filegroup, the allocations are spread amongst all the files using the allocation system's proportional fill algorithm. In a nutshell, this says that space will be allocated more frequently from larger files with more free space than from smaller files with less free space. There is no concept in SQL Server of limiting allocations to a particular file in a multi-file filegroup.

Myth 3: non-clustered indexes are always rebuilt when a clustered index is rebuilt

This is untrue. The rules are a little complex here but can be summed up as follows:

  • In 2005+, rebuilding a unique or non-unique clustered index (without changing its definition) will NOT rebuild the non-clustered indexes
  • In 2000:
    • Rebuilding a non-unique clustered index WILL rebuild the non-clustered indexes
    • Rebuilding a unique clustered index will NOT rebuild the non-clustered indexes

The first few service packs of 2000 had bugs that changed the behavior of rebuilding unique clustered indexes back and forth - this is the source of much of the confusion around this myth.

For a much more detailed discussion of this, see my blog post from last Fall - Indexes From Every Angle: What happens to non-clustered indexes when the table structure is changed?.

Myth 4: BULK_LOGGED recovery mode decreases the size of the transaction log and log backups for an index rebuild

This myth is partly true.

Switching to the BULK_LOGGED recovery mode while doing an index rebuild operation WILL reduce the amount of transaction log generated, which is very useful for limiting the size of the transaction log file (note I say 'file', not 'files' - you only need one log file).

Switching to the BULK_LOGGED recovery mode while doing an index rebuild will NOT reduce the size of the transaction log BACKUP. Although the operation will be minimally-logged, the next transaction log backup will read all the transaction log since the last backup plus all the extents that were changed by the minimally-logged index rebuild. This will result in a log backup that's almost exactly the same size as for a fully-logged index rebuild. The ONLY time a log backup will contain data extents is when a minimally-logged operation has taken place since the last log backup - see here on MSDN for more info.

If you're considering using the BULK_LOGGED recovery mode, beware that you lose the ability to do point-in-time recovery to ANY point covered by a transaction log backup that contains even a single minimally-logged operation. Make sure that there's nothing else happening in the database that you may need to effectively roll-back with P.I.T. recovery. The operations you should perform if you're going to do this are:

  • In FULL recovery mode, take log backup immediately before switching to BULK_LOGGED
  • Switch to BULK_LOGGED and do the index rebuild
  • Switch back to FULL and immediately take a log backup

This limits the time period in which you can't do P.I.T. recovery.

Myth 5: online index rebuild doesn't take any locks

This myth is untrue. The 'online' in 'online index operations' is a bit of a misnomer.  Online index operations need to take two very short-term table locks. An S (Shared) table lock at the start of the operation to force all write plans that could touch the index to recompile, and a SCH-M (Schema-Modification - think of it as an Exclusive) table lock at the end of operation to force all read and write plans that could touch the index to recompile.

The most recent time this came up on the forums was someone noticing insert queries timing out after an online index rebuild operation had just started. The problem is that the  table lock that online index rebuild needs has to be entered into the grant queue in the lock manager until it can be acquired - and it will stay there until existing transactions that are holding conflicting locks either commit or roll-back. Any transaction that requires a conflicting lock AFTER the index rebuild lock has been queued but not acquired (and then released) will wait behind it in the lock grant queue. If the query timeout is reached before the transaction can get it's lock, it will timeout.

This is still much better than the table lock being held for the entire duration of the index rebuild operation. For more info, checkout this whitepaper on Online Index Operations in SQL Server 2005.

Phew - last week Kimberly and I spent 3 days teaching the ins-and-outs of SQL Server 2008 for DBAs/IT-Pros to about 130 Microsoft SQL Server experts and MVPs (like Kalen Delaney, Adam Machanic and Ron Talmage). This was the (95% complete) Beta delivery of a course we've been developing for the last six months for Microsoft that they'll use to train their SQL experts around the world on the new release. It's been very interesting watching the features develop through the CTPs (especially since I left the fold last August) - and making demos work on pre-release builds of the CTPs.

Teaching the course was a *blast* - the thing I love about teaching a really geeky crowd is the plethora of great questions and opportunities for going deep with explanations. Our team actually wrote and delivered the concurrently presented Developer and BI tracks as well. As you can see from the list below (and this is just the features a DBA needs to use/know about), SQL Server 2008 isn't a dot release of Yukon at all, as some people have suggested. Over the three days we covered:

  • Database Mirroring (D)
  • Backup Compression
  • Peer-to-Peer Replication (D)
  • Transparent Data Encryption (D)
  • Extensible (Off-Box) Key Management
  • All Actions Audited (D)
  • Policy-Based Management
  • Resource Governor (D)
  • Extended Events (D)
  • Spatial Indexes
  • Integrated Full-Text Search
  • Sparse Columns (D)
  • Filtered Indexes
  • Change Tracking
  • Change Data Capture (D)
  • FILESTREAM (D)
  • Performance Data Collection
  • Query Optimizer Enhancements
  • Data Compression (D)
  • Service Broker
  • Partition-Level Lock Escalation (D)

The features marked with a (D) are ones I demo'd during the course (Kimberly demo'd a bunch of the others - especially the tools features). Some of the demos were challenging to make work in time as we only got a pre-CTP6 build mid-January just before we headed off to China.

So why am I posting this? Well, a bunch of these features are in CTP-6, which should be just around the corner, and I have some easy-to-understand demos of them that I'll be posting here over the next month or so. Also, if this course sounds interesting, Kimberly and I will be teaching it in various configurations over the next year - starting with SQL Connections in April, a soon-to-be-announced class in Iceland in March, and the ITPro portion of TechEd in June.

Watch this space starting next week (today's the last day of six straight weeks of teaching for us so this weekend's a break :-))

Back in October 2007 I blogged about partition-level lock escalation in SQL Server 2008 (see here) and I promised to do a follow-up once CTP-5 came out with syntax etc. So here it is.

A brief recap - lock escalation in SQL Server 2005 and before only allowed table-level lock escalation. If you have a partitioned table with queries going against different partitions, then table-level escalation is a pain because the whole table is suddenly locked and concurrent queries against distinct partitions can't run. SQL Server 2008 gives the ability to escalate to a parttition lock, which won't affect the queries on the other partitions.

The lock escalation policy can only be set with ALTER TABLE after a table has been created, and the policy can only be set at the table level. The syntax is

ALTER TABLE TableName SET (LOCK_ESCALATION = TABLE | AUTO | DISABLE);

The options mean:

  • TABLE - escalation will always be to the table level. This is the default.
  • AUTO - escalation will be to the partition level if the table is partitioned; otherwise it will be to the table level
  • DISABLE - escalation will be disabled. This does not guarantee that it will NEVER occur - there are some cases where it is necessary (Books Online gives the example of scanning a heap in the SERIALIZABLE isolation level)

The only way I could find to check what the escalation policy for a table is set to is to use the sys.tables catalog view:

SELECT lock_escalation_desc FROM sys.mytables WHERE name = 'TableName';

Let's try it out. Here's a script that creates a database with an example table with 3 partitions. The partition ranges are negative infinity to 7999, 8000 to 15999, 16000 to positive infinity.

CREATE DATABASE LockEscalationTest;
GO

USE LockEscalationTest;
GO

-- Create three partitions: -7999, 8000-15999, 16000+
CREATE PARTITION FUNCTION MyPartitionFunction (INT) AS RANGE RIGHT FOR VALUES (8000, 16000);
GO

CREATE PARTITION SCHEME MyPartitionScheme AS PARTITION MyPartitionFunction
ALL TO ([PRIMARY]);
GO

-- Create a partitioned table
CREATE TABLE MyPartitionedTable (c1 INT);
GO

CREATE CLUSTERED INDEX MPT_Clust ON MyPartitionedTable (c1)
ON MyPartitionScheme (c1);
GO

-- Fill the table
SET NOCOUNT ON;
GO

DECLARE @a INT = 1;
WHILE (@a < 17000)
BEGIN
INSERT INTO MyPartitionedTable VALUES (@a);
SELECT @a = @a + 1;
END;
GO

Now I'm going to explicitly set the escalation to TABLE and start a transaction that should cause lock escalation.

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = TABLE);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

We should be able to see the locks being held:

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    X              LOCK           GRANT

Just as we expected - an X table lock. Trying any query against the table fails now. Now I'll rollback that transaction, set the escalation to partition-level and try again.

ROLLBACK TRAN;
GO

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = AUTO);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable');
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Excellent - the object lock is now IX rather than X, and the X lock is at the partition (HOBT) level for partition 1 (see the bold highlighting to match the partition ID with the lock resource). (For an explanation of HOBTs, see my post Inside The Storage Engine: IAM pages, IAM chains, and allocation units.) So now we should be able to do something with another partition - let's see if we can cause another partition level X lock in another connection:

USE LockEscalationTest;
GO

BEGIN TRAN
UPDATE
MyPartitionedTable set c1 = c1 WHERE c1 > 8100 AND c1 < 15900
;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable'
);
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE'
;
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039107584             X              LOCK           GRANT
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Now we have two partition X locks, for partitions 1 and 2 (as expected - use the color coding above to match up the IDs), plus two table-level IX locks (one for each  connection, as expected). Very cool!

Now I'm going to force a deadlock - by having each connection try to read a row from the other locked partition:

Connection 1:

SELECT * FROM MyPartitionedTable WHERE c1 = 8500;
GO

Conneciton 2:

SELECT * FROM MyPartitionedTable WHERE c1 = 100;
GO

Connection 2 succeeds but on connection 1 we get (as expected):

(local)\SQLDEV01(SQLHAVPC\Administrator): Msg 1205, Level 13, State 18, Line 1
Transaction (Process ID 51) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This illustrates a potential problem with this new mechanism - applications that used to rely on the blocking nature of X table locks may now exhibit deadlocks if partition-level escalation is turned on in production without any testing. In fact, this mode was specifically chosen NOT to be the default setting for new tables because some trial workloads exhibited deadlocks during testing. Don't just turn it on in production without testing - as with any other option or feature.

This is a question I was sent a week or so ago - if a table is truncated inside a transaction, what protects the integrity of the table's pages in case the transaction rolls back? Let's find out.

First off I'll create a simple table to experiment with.

CREATE TABLE TruncateTest (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'A');
GO

SET NOCOUNT ON;
GO

DECLARE @a INT;
SELECT @a = 1
;
WHILE (@a < 20
)
BEGIN

INSERT INTO TruncateTest DEFAULT VALUES;
SELECT @a = @a + 1;

END;
GO

We can see what pages and extents are allocated to the table using the undocumented DBCC IND command:

DBCC IND (test, TruncateTest, 0);
GO

PageFID PagePID
------- ---------
1       193
1       192
1       194
1       195
1       196
1       197
1       198
1       199
1       200
1       224
1       225
1       226
1       227
1       228
1       229
1       230
1       231
1       232
1       233
1       234

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I've curtailed the output to just the page IDs and we can see that there are 4 extents used by this table (starting on pages (1:192), (1:200), (1:224), and (1:232)). Now if we truncate the table in a transaction, what will DBCC IND show?

BEGIN TRAN;
GO

TRUNCATE TABLE TruncateTest;
GO

DBCC IND (test, TruncateTest, 0);
GO

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Looks like there are no pages allocated to the table. So where are they? Let's check what locks there are. Instead of using sp_lock, I'm going to use it's replacement DMV, sys.dm_tran_locks:

SELECT resource_type, resource_description, request_mode FROM sys.dm_tran_locks WHERE resource_type IN ('EXTENT', 'PAGE');
GO

resource_type   resource_description   request_mode
--------------- ---------------------- --------------
EXTENT          1:200                  X
PAGE            1:198                  X
PAGE            1:199                  X
PAGE            1:196                  X
PAGE            1:197                  X
PAGE            1:194                  X
PAGE            1:195                  X
PAGE            1:192                  X
PAGE            1:193                  X
EXTENT          1:192                  X
PAGE            1:200                  X
EXTENT          1:232                  X
EXTENT          1:224                  X

Ah - all the pages and extents are locked. The table doesn't show them as allocated any more but because they're exclusively locked, the allocation subsystem can't really deallocate them until the locks are dropped (when the transaction commits). That's the answer - they can't be reused until they're really deallocated. If a transaction rollback happens, the pages are just marked as allocated again.

SQL Server supports lock escalation - when the server decides to move from a large number of row or page locks on an object to a table-level lock. Sunil Agarwal posted a great description of lock escalation in SQL Server 2005 on the Storage Engine blog so I won't repeat it all here.

The problem with lock escalation is that it can be tricky to manage on systems that have conflicting requirements.

Disabling lock escalation

For example, if a table needs to support large batch updates with concurrent user queries, then having the batch update cause an escalation to a table-level exclusive lock prevents the user queries from running. There are a couple of documented trace flags that can be used to disable lock escalation:

  • 1211 - disables lock escalation totally and will allow lock memory to grow to 60% of dynamically allocated memory (non-AWE memory for 32-bit and regullar memory for 64-bit) and will then further locking will fail with an out-of-memory error
  • 1224 - disables lock escalation until 40% of memory is used and then re-enables escalation

The problem with these two trace flags are that they are instance-wide and turning them on can cause huge performance issues if a poorly-written application takes too many locks. It's not possible to disable lock escalation for a single table - until now!

SQL Server 2008 includes the ability to disable lock escalation per-table!! This is a fantastic step forward in concurrency management.

Changing the escalation mechanism

To extend the example above, what about if the table has multiple partitions? With the batch update only affecting a single partition and concurrent user queries going against other partitions, the escalation policy in SQL Server 2005 means that the batch update will escalate to a table-level exclusive lock and freeze out the user queries, even though they're going against different partitions. The only recourse is to disable lock escalation - until now!

SQL Server 2008 includes the ability to specify partition-level lock escalation instead of table-level lock escalation. And this is per-table! Very cool.

Summary

SQL Server 2008 will have ALTER TABLE syntax to specify per-table lock escalation management. The options will be:

  • Automatic determination of the level to escalate to. If the table is partitioned, locks will be escalated to the partition-level.
  • Table-level lock escalation (even if the table is partitioned).
  • Disable lock escalation

Once this feature is available in a CTP I'll blog about the syntax and supporting infrastructure, along with some examples.

Theme design by Nukeation based on Jelle Druyts