Earlier this morning I noticed a discussion on the SQL MCM distribution list (that all the original MCM instructors are part of) that was trying to make sense of a huge disparity between tempdb data file usage and log file usage. I explained the answer and thought I'd share it with you all too.

The situation was a tempdb where the data files grew from 30GB to 120GB, running out of space on the disk, but the tempdb log file did not grow at all from its initial size of 1GB! How could that be?

One of the things to consider about tempdb is that logging in tempdb is ultra-efficient. Log records for updates in tempdb, for instance, only log the before image of the data instead of logging both before and after images. There is no need to log the after image - as that is only used for the REDO portion of crash recovery. As tempdb never gets crash-recovered, REDO never occurs. The before image *is* necessary, however, because transactions can be rolled back in tempdb, just like other databases, and so the before image of an update must be available to be able to successfully roll back the update.

Getting to the question though, I can easily explain the observed behavior by considering how a sort spill happens with tempdb.

I can simulate this using a gnarly query on the SalesDB database you can download from our resources page (see the top of the page for the sample databases to download).

I'm going to do a join of my Sales and Products tables and then sort the multi-million row result set by product name:

SELECT S.*, P.* from Sales S
JOIN Products P ON P.ProductID = S.ProductID
ORDER BY P.Name;
GO 

The query plan for this is (using Plan Explorer):

I know that the sort is going to spill out of memory into tempdb in this case. First I checkpoint tempdb (to clear out the log) and then after running the query, I can analyze the transaction log for tempdb.

Looking at the operation in the log:

SELECT
     [Current LSN],
     [Operation],
     [Context],
     [Transaction ID],
     [Log Record Length],
     [Description]
FROM fn_dblog (null, null);
GO

Current LSN            Operation       Context  Transaction ID Len Description
---------------------- --------------- -------- -------------- --- ----------------------------------------------------------
000000c0:00000077:0001 LOP_BEGIN_XACT  LCX_NULL 0000:00005e4d  120 sort_init;<snip>
000000c0:00000077:0002 LOP_BEGIN_XACT  LCX_NULL 0000:00005e4e  132 FirstPage Alloc;<snip>
000000c0:00000077:0003 LOP_SET_BITS    LCX_GAM  0000:00005e4e  60  Allocated 1 extent(s) starting at page 0001:0000aa48
000000c0:00000077:0004 LOP_MODIFY_ROW  LCX_PFS  0000:00005e4e  88  Allocated 0001:0000aa48;Allocated 0001:0000aa49;
000000c0:00000077:0005 LOP_MODIFY_ROW  LCX_PFS  0000:00005e4d  80  Allocated 0001:00000123
000000c0:00000077:0006 LOP_FORMAT_PAGE LCX_IAM  0000:00005e4d  84               
000000c0:00000077:0007 LOP_SET_BITS    LCX_IAM  0000:00005e4e  60               
000000c0:00000077:0009 LOP_COMMIT_XACT LCX_NULL 0000:00005e4e  52               
000000c0:00000077:000a LOP_BEGIN_XACT  LCX_NULL 0000:00005e4f  128 soAllocExtents;<snip>
000000c0:00000077:000b LOP_SET_BITS    LCX_GAM  0000:00005e4f  60  Allocated 1 extent(s) starting at page 0001:0000aa50
000000c0:00000077:000c LOP_MODIFY_ROW  LCX_PFS  0000:00005e4f  88  Allocated 0001:0000aa50;Allocated 0001:0000aa51;<snip>
000000c0:00000077:000d LOP_SET_BITS    LCX_IAM  0000:00005e4f  60               
000000c0:00000077:000e LOP_SET_BITS    LCX_GAM  0000:00005e4f  60  Allocated 1 extent(s) starting at page 0001:0000aa58
000000c0:00000077:000f LOP_MODIFY_ROW  LCX_PFS  0000:00005e4f  88  Allocated 0001:0000aa58;Allocated 0001:0000aa59;<snip>
000000c0:00000077:0010 LOP_SET_BITS    LCX_IAM  0000:00005e4f  60               
000000c0:00000077:0011 LOP_SET_BITS    LCX_GAM  0000:00005e4f  60  Allocated 1 extent(s) starting at page 0001:0000aa60
000000c0:00000077:0012 LOP_MODIFY_ROW  LCX_PFS  0000:00005e4f  88  Allocated 0001:0000aa60;Allocated 0001:0000aa61;<snip>
000000c0:00000077:0013 LOP_SET_BITS    LCX_IAM  0000:00005e4f  60               
000000c0:00000077:0014 LOP_COMMIT_XACT LCX_NULL 0000:00005e4f  52               
000000c0:00000077:0015 LOP_BEGIN_XACT  LCX_NULL 0000:00005e50  128 soAllocExtents;<snip>
000000c0:00000077:0016 LOP_SET_BITS    LCX_GAM  0000:00005e50  60  Allocated 1 extent(s) starting at page 0001:0000aa68
000000c0:00000077:0017 LOP_MODIFY_ROW  LCX_PFS  0000:00005e50  88  Allocated 0001:0000aa68;Allocated 0001:0000aa69;<snip>
000000c0:00000077:0018 LOP_SET_BITS    LCX_IAM  0000:00005e50  60               
000000c0:00000077:0019 LOP_SET_BITS    LCX_GAM  0000:00005e50  60  Allocated 1 extent(s) starting at page 0001:0000aa70
000000c0:00000077:001a LOP_MODIFY_ROW  LCX_PFS  0000:00005e50  88  Allocated 0001:0000aa70;Allocated 0001:0000aa71;<snip>
000000c0:00000077:001b LOP_SET_BITS    LCX_IAM  0000:00005e50  60               
000000c0:00000077:001c LOP_SET_BITS    LCX_GAM  0000:00005e50  60  Allocated 1 extent(s) starting at page 0001:0000aa78
000000c0:00000077:001d LOP_MODIFY_ROW  LCX_PFS  0000:00005e50  88  Allocated 0001:0000aa78;Allocated 0001:0000aa79;<snip>
000000c0:00000077:001e LOP_SET_BITS    LCX_IAM  0000:00005e50  60               
000000c0:00000077:001f LOP_SET_BITS    LCX_GAM  0000:00005e50  60  Allocated 1 extent(s) starting at page 0001:0000aa80
000000c0:00000077:0020 LOP_MODIFY_ROW  LCX_PFS  0000:00005e50  88  Allocated 0001:0000aa80;Allocated 0001:0000aa81;<snip>
000000c0:00000077:0021 LOP_SET_BITS    LCX_IAM  0000:00005e50  60               
000000c0:00000077:0022 LOP_COMMIT_XACT LCX_NULL 0000:00005e50  52               
000000c0:00000077:0023 LOP_BEGIN_XACT  LCX_NULL 0000:00005e51  128 soAllocExtents;<snip>

<snip>

000000cd:00000088:01d3 LOP_SET_BITS    LCX_GAM  0000:000078fc  60  Deallocated 1 extent(s) starting at page 0001:00010e50
000000cd:00000088:01d4 LOP_COMMIT_XACT LCX_NULL 0000:000078fc  52               
000000cd:00000088:01d5 LOP_BEGIN_XACT  LCX_NULL 0000:000078fd  140 ExtentDeallocForSort;<snip>
000000cd:00000088:01d6 LOP_SET_BITS    LCX_IAM  0000:000078fd  60               
000000cd:00000088:01d7 LOP_MODIFY_ROW  LCX_PFS  0000:000078fd  88  Deallocated 0001:00010e68;Deallocated 0001:00010e69;<snip>
000000cd:00000088:01d8 LOP_SET_BITS    LCX_GAM  0000:000078fd  60  Deallocated 1 extent(s) starting at page 0001:00010e68
000000cd:00000088:01d9 LOP_COMMIT_XACT LCX_NULL 0000:000078fd  52               
000000cd:00000088:01da LOP_MODIFY_ROW  LCX_PFS  0000:00005fac  80  Deallocated 0001:00000109
000000cd:00000088:01db LOP_SET_BITS    LCX_SGAM 0000:00005fac  60  ClearBit 0001:00000108
000000cd:00000088:01dc LOP_SET_BITS    LCX_GAM  0000:00005fac  60  Deallocated 1 extent(s) starting at page 0001:00000108
000000cd:00000088:01dd LOP_COMMIT_XACT LCX_NULL 0000:00005fac  52               

(I snipped out a few extraneous log records plus the 6 extra 'Allocated' and 'Deallocated' for each of the PFS row modifications.) 

One of the things I notice is that the sort spill space is allocated in extents, and almost the entire sort - from initialization, through allocating all the extents, to deallocating them - is contained in a few very large transactions. But the transactions aren't actually that large.

Look at the soAllocExtents transaction with Transaction ID 00005e50. It's allocating 4 extents - i.e. 256KB - in a single system transction (4 x mark an extent as unavailable in the GAM, 4 x bulk set the 8 PFS bytes for the 8 pages in the extent, 4 x mark an extent allocated in the IAM). The total size of the log records for this transaction is 1012 bytes. (The first soAllocExtents system transaction only allocates 3 extents, all the others allocate 4 extents.)

When the sort ends, the extents are deallocated one-at-a-time in system transactions called ExtentDeallocForSort. An example is the transaction with Transaction ID 000078fd. It generates log records totalling 400 bytes. This means each 256KB takes 4 x 400 = 1600 bytes to deallocate.

Combining the allocation and deallocation operations, each 256KB of the sort that spills into tempdb generates 2612 bytes of log records.

Now let's consider the original behavior that I explained. If the 90GB was all sort space:

  • 90GB is 90 x 1024 x 1024 = 94371840KB, which is 94371840 / 256 = 368640 x 256KB chunks.
  • Each 256KB chunk takes 2612 bytes to allocate and deallocate, so our 90GB would take 368640 x 2612 = 962887680 bytes of log, which is 962887680 / 1024 / 1024 = ~918MB of log.

And this would explain the observed behavior. 90GB of tempdb space can be allocated and used for a sort spill with roughly 918MB of transaction log, give or take a bit from my rough calculations.

Tempdb logs things very efficiently - especially things that spill out of memory. The next stop in debugging such a problem would be regularly capturing the output of sys.dm_db_task_space_usage to figure out who is using all the space and then digging in from there.

Hope this helps explain things!

I first started blogging about latches and some of the deeper parts of SQL Server internals last year (see Advanced performance troubleshooting: waits, latches, spinlocks) and now I'd like to pick up that thread (no scheduling pun intended :-)) and blog some more about some of the common latches that could be a performance bottleneck.

To that end, I've got some code below (plus example output) that will show the most common latch waits that have occurred on your system.

WITH Latches AS
    (SELECT
        latch_class,
        wait_time_ms / 1000.0 AS WaitS,
        waiting_requests_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_latch_stats
    WHERE latch_class NOT IN (
        'BUFFER')
    AND wait_time_ms > 0

    )
SELECT
    W1.latch_class AS LatchClass,
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(14, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S
FROM Latches AS W1
INNER JOIN Latches AS W2
    ON W2.RowNum <= W1.RowNum
WHERE W1.WaitCount > 0
GROUP BY W1.RowNum, W1.latch_class, W1.WaitS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

LatchClass                        Wait_S  WaitCount  Percentage  AvgWait_S
--------------------------------- ------- ---------- ----------- ----------
LOG_MANAGER                       221.43  4659       45.81       0.0475
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  199.56  7017       41.28       0.0284
FGCB_ADD_REMOVE                   35.17   1047       7.27        0.0336
DBCC_OBJECT_METADATA              26.85   256490     5.55        0.0001

I'd like you to run the code and send me the output (either as a comment or in email). I'll collate all your output and do some blogging for your enjoyment.

Thanks!

During every one of our Immersion Events, we designate Thursday evening as 'open mic' night where anyone can do a 15-minute presentation on anything they want (to do with SQL Server) to the class. We usually have 4 or 5 people who entertain us with interesting talks, and our recent classes in Chicago were no different.

One of the talks really impressed me. David Klee (b|t) demonstrated an automated analysis tool he's written for SQLIO result file parsing to save him time. He mentioned he was going to put it online and I encouraged him to do so as I could see the benefit to many people out there of not having to write their own analysis tools/spreadsheets.

You can get to David's free analysis site at http://tools.davidklee.net/sqlio.aspx. Clicking on the link at bottom right allows you to upload a SQLIO results text file. Once you've clicked ANALYZE, select the option to output the results to a spreadsheet and one will be automatically generated for you. If you look in the Analysis pane of the spreadsheet, you'll see something like below (using David's supplied example SQLIO output).

Very cool stuff - thanks David!

A long time ago, in a galaxy far, far away I kicked off a survey about memory configuration. Actually it was back at the start of January and I've been terribly remiss about posting the survey results!

I was interested in how the setting of Max Server Memory (which controls the maximum size of the buffer pool) related to the physical memory available on the server.

Thanks to the people who sent me data from 525 servers worldwide.

Here is the data, presented in two charts.

Firstly, when Max Server Memory is actually set:

 

I had a few data points at the 512GB and 768GB sizes, and their Max Server Memory settings were all valid.

What's interesting in this graph is the wide variety of Max Server Memory settings for any specific amount of physical server memory.

Rather than me explaining how you figure out how much physical memory to reserve for the operating system and other SQL Server memory uses, Jonathan just posted an explanation and loose formula over on his blog - so I'll point you there.

There were a disturbingly large number of SQL Servers that did *not* have Max Server Memory set at all:

 

These systems may suffer performance problems when the operating system has to pressure the SQL Server buffer pool to give back some memory - it's always better to set a Max Server Memory value - again, see Jonathan's post.

The large spike at 8GB in the graph above is because one person sent me a few hundred sets of results for 8GB servers without Max Server Memory set.

Here's the same set of results without the 8GB spike:

 

Quick summary: make sure you have an appropriate Max Server Memory setting for your servers to avoid performance problems.

Back in January I posted the results of the cluster key size survey I ran in 2011 and explained how the larger the cluster key is on your table, the more space is being wasted in all the nonclustered index rows. Check it out if you haven't already.

I've finally put together the code that will run through all your databases and give you a per-table indication of how much space is being taken up by cluster keys in nonclustered indexes, and the potential space savings if you converted the cluster key to a single 8-byte bigint. You can get it below or from here: KeySpaceSavingsSingleResultSet.zip (1.57 kb)

You can mess around with the code to do what you want. And I'm continuing to use sp_msforeachdb because it's the fastest way for me to knock out code for you, and it continues to irritate my good friend Aaron Bertrand :-)

Enjoy!

/*============================================================================
  File:     KeySpaceSavingsSingleResultSet.sql

  Summary:  Potential cluster key space savings

  SQL Server Versions: 2005 onwards
------------------------------------------------------------------------------
  Written by Paul S. Randal, SQLskills.com

  (c) 2012, SQLskills.com. All rights reserved.

  For more scripts and sample code, check out
   
http://www.SQLskills.com

  You may alter this code for your own *non-commercial* purposes. You may
  republish altered code as long as you include this copyright and give due
  credit, but you must obtain prior permission before blogging this code.
 
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
============================================================================*/

IF EXISTS (SELECT * FROM msdb.sys.objects WHERE [name] = 'SQLskillsIKSpace')
    DROP TABLE msdb.dbo.SQLskillsIKSpace;
GO
CREATE TABLE msdb.dbo.SQLskillsIKSpace (
    DatabaseID SMALLINT,
 SchemaName SYSNAME,
 ObjectName SYSNAME,
    ObjectID INT,
    IndexCount SMALLINT DEFAULT (0),
    TableRows BIGINT DEFAULT (0),
    KeyCount SMALLINT DEFAULT (0),
    KeyWidth SMALLINT DEFAULT (0));
GO

EXEC sp_MSforeachdb
    N'IF EXISTS (SELECT 1 FROM (SELECT DISTINCT [name]
    FROM sys.databases WHERE [state_desc] = ''ONLINE''
        AND [database_id] > 4
        AND [name] != ''pubs''
        AND [name] != ''Northwind''
        AND [name] != ''distribution''
        AND [name] NOT LIKE ''ReportServer%''
        AND [name] NOT LIKE ''Adventure%'') AS names WHERE [name] = ''?'')
BEGIN
USE [?]

INSERT INTO msdb.dbo.SQLskillsIKSpace
(DatabaseID, SchemaName, ObjectName, ObjectID)
SELECT DB_ID (''?''), SCHEMA_NAME (o.[schema_id]), OBJECT_NAME (o.[object_id]), o.[object_id]
FROM sys.objects o
WHERE o.[type_desc] IN (''USER_TABLE'', ''VIEW'')
    AND o.[is_ms_shipped] = 0
    AND EXISTS (
        SELECT *
        FROM sys.indexes
        WHERE [index_id] = 1
            AND [object_id] = o.[object_id]);

UPDATE msdb.dbo.SQLskillsIKSpace
SET [TableRows] = (
    SELECT SUM ([rows])
    FROM sys.partitions p
    WHERE p.[object_id] = [ObjectID]
    AND p.[index_id] = 1)
WHERE [DatabaseID] = DB_ID (''?'');
 
UPDATE msdb.dbo.SQLskillsIKSpace
SET [IndexCount] = (
    SELECT COUNT (*)
    FROM sys.indexes i
    WHERE i.[object_id] = [ObjectID]
    AND i.[is_hypothetical] = 0
    AND i.[is_disabled] = 0
    AND i.[index_id] != 1)
WHERE [DatabaseID] = DB_ID (''?'');

UPDATE msdb.dbo.SQLskillsIKSpace
SET [KeyCount] = (
    SELECT COUNT (*)
    FROM sys.index_columns ic
    WHERE ic.[object_id] = [ObjectID]
    AND ic.[index_id] = 1)
WHERE [DatabaseID] = DB_ID (''?'');

UPDATE msdb.dbo.SQLskillsIKSpace
SET [KeyWidth] = (
    SELECT SUM (c.[max_length])
    FROM sys.columns c
    JOIN sys.index_columns ic
    ON c.[object_id] = ic.[object_id]
    AND c.[object_id] = [ObjectID]
    AND ic.[column_id] = c.[column_id]
    AND ic.[index_id] = 1)
WHERE [DatabaseID] = DB_ID (''?'');

DELETE msdb.dbo.SQLskillsIKSpace
WHERE
    ([KeyCount] = 1 AND [KeyWidth] < 9)
    OR [IndexCount] = 0 OR [TableRows] = 0;

END';
GO

SELECT
 DB_NAME ([DatabaseID]) AS [Database],
 [SchemaName] AS [Schema],
 [ObjectName] AS [Table],
    [IndexCount] AS [NCIndexes],
    [KeyCount] AS [ClusterKeys],
    [KeyWidth],
    [TableRows],
    [IndexCount] * [TableRows] * [KeyWidth] AS [KeySpaceInBytes],
    ([IndexCount] * [TableRows] * ([KeyWidth] - 8)) AS [PotentialSavings]
FROM msdb.dbo.SQLskillsIKSpace
ORDER BY [PotentialSavings] DESC;

DROP TABLE msdb.dbo.SQLskillsIKSpace;
GO

 

Back in November I kicked off a survey that had you run some code to get some details about your cluster keys, nonclustered indexes, and table size. I got results from more than 500 systems across the world, resulting in 97565 lines of data - thanks!

The purpose of the survey is to highlight one of the side-effects of not adhering to the general guidelines (i.e. there are exceptions to these) for choosing a clustered index key. It should be, if possible:

  1. Narrow
  2. Static
  3. Ever-increasing
  4. Unique

The survey and this post are intended to show how not adhering to Rule #1 can lead to performance problems.

Both Kimberly and I have explained in the past the architecture of nonclustered indexes - where every nonclustered index row has to have a link back to the matching heap or clustered index record. The link must be a unique value as it must definitively match a single record in the heap or clustered index. For nonclustered indexes on a table with a clustered index, this link is the cluster key (or keys) as these are guaranteed to be unique. Ah, you say, but what about when the clustered index is NOT defined as unique? That's where Rule #4 comes in. For a non-unique clustered index, there will be a hidden 4-byte column (called the uniquifier) added when necessary as a tie-breaker when multiple clustered index records have the same key values. This increases the clustered index key size by 4 bytes (the uniquifier is an integer) when needed.

But I digress. The crux of the matter is that every nonclustered index record will include the cluster keys. The wider the cluster key size is (e.g. a few natural keys), the more overhead there is in each nonclustered index record, compared to using, for instance, an integer (4-byte) or bigint (8-byte) surrogate cluster key. This can mean you've got the potential for saving huge amounts of space by moving to smaller clustered index keys - as we'll see from the data I collected.

The survey code I got you to run returned:

  1. Number of nonclustered indexes
  2. Number of cluster keys
  3. Total cluster key size
  4. Number of table rows
  5. Calculation of bytes used in all the nonclustered indexes to store the cluster keys in each row

I did not take into account filtered indexes in 2008, or variable length cluster key columns, as to be honest although these will make a difference, for the purposes of my discussion here (making you aware of the problem), they're irrelevant. It also would have made the survey code much more complex for me to figure out :-)

Now let's look at some of the results I received. To make things a little simpler, I discarded results from tables with less then ten thousand rows, and with clustered index key sizes less than 9. This dropped the number of data points from 97565 down to 22425.

The graphs below show the estimated amount off savings that could be had in GB from moving to an 8-byte bigint, plotted against the first four factors in the list above.

 

 

 

 

And here are the top 20 in terms of potential savings so you can see how the rough table schema:

NCIndexes  ClusterKeys  KeyWidth  TableRows      KeySpaceGB  SavingsGB
---------  -----------  --------  -------------  ----------  ---------
6          4            72        891,751,171    358.8       352.1
6          3            16        3,189,075,035  285.1       261.4
1          5            45        4,479,327,954  187.7       154.4
6          4            72        453,251,463    182.4       179.0
4          3            16        2,766,814,206  164.9       144.3
4          2            89        371,745,035    123.3       120.5
2          4            774       76,337,053     110.1       109.5
2          4            774       76,331,676     110.0       109.5
2          4            774       75,924,837     109.5       108.9
2          4            774       75,533,539     108.9       108.3
5          4            72        318,217,628    106.7       104.3
7          1            60        269,590,810    105.5       103.4
22         3            13        389,203,725    103.7       100.8
22         3            13        329,772,049    87.8        85.4
2          2            509       90,311,271     85.6        85.0
17         1            510       9,334,362      75.4        75.3
22         3            13        267,380,864    71.2        69.2
2          7            172       219,929,560    70.5        68.8
22         3            13        261,967,851    69.8        67.8
6          5            31        395,800,250    68.6        65.6

Wow - that's some pretty amazing stuff - and that doesn't even account for the space taken up by page headers etc.

You might be thinking - why do I care? There are plenty of reasons:

  • If you can save tens or hundreds of GBs by changing the cluster key to something much smaller, that translates directly into a reduction in size of your backups and data file disk space requirements.
  • Smaller databases mean faster backups and restores.
  • Making the nonclustered indexes smaller means that index maintenance (from inserts/updates/deletes) and index fragmentation removal will be much faster and generate less transaction log.
  • Making the nonclustered indexes smaller means that consistency checking will be much faster - nonclustered index checking takes 30% of the CPU usage of DBCC CHECKDB.
  • Reducing the width of nonclustered index records means the density of records (number of records per nonclustered index page) increases dramatically, leading to faster index processing, more efficient buffer pool (i.e. memory) usage, and fewer I/Os as more of the indexes can fit in memory.
  • Anything you can do to reduce the amount of transaction log directly affects the performance of log backups, replication, database mirroring, and log shipping.

As you can see, there are many reasons to keep the cluster key as small as possible - all directly translating into performance improvements. For those of you that think that moving to a bigint may cause you to run out of possible keys, see this blog post where I debunk that - unless you've got 3 million years and 150 thousand petabytes to spare...

One thing I'm not doing in this post is advocating any particular key over any other (although bigint identity does fit all the criteria from the top of the post) - except to try to keep it as small as possible. Choosing a good cluster key entails understanding the data and workload as well as the performance considerations of key size that I've presented here. And in some very narrow cases, not having a cluster key at all is acceptable - which means there's 8 bytes in each nonclustered index record (just to forestall those who may want to post a comment arguing against clustered indexes in general :-)

Changing the cluster key can be tricky - Kimberly blogged a set of steps to follow plus some code to help you on our SQL Server Magazine blog back in April 2010.

Later this week I'll blog some code that will run through your databases and spit out table names that could have significant space savings from changing the cluster key.

I won't be blogging or tweeting much in January as we'll be in Indonesia diving, but I will be posting photos later in the month.

Enjoy!

Continuing with my "index health" series, I've got another piece of code for you to run.

This time I'm interested in the number of columns in your clustered indexes and the consequent amount of nonclustered index space used by the clustered index keys.

Again, you're going to be really interested to see the results on your servers. When I editorialize the results I'll provide another query for you to run which will make the data actionable on your server.

Here are some results from a random customer server (yes, we already knew about these - long story :-):

NCIndexes ClusterKeys KeyWidth TableRows            KeySpaceInBytes
--------- ----------- -------- -------------------- --------------------
7         3           16       129902437            14549072944
1         3           12       29199817             350397804
10        2           12       1612919              193550280
5         2           5        4266671              106666775
2         2           8        5887697              94203152
5         4           20       827975               82797500
3         3           16       1215800              58358400
7         2           5        1497746              52421110
1         3           12       2667765              32013180
1         4           25       1033063              25826575
1         3           12       989320               11871840
2         2           8        278989               4463824
1         3           12       293736               3524832
4         2           5        160696               3213920

Feel free to send the results in any format you want - Excel spreadsheet works best though. Try not to add any columns to the result set - complicates the aggregation process.

The more results the better - thanks!

Here's the code:

IF EXISTS (SELECT * FROM tempdb.sys.objects WHERE [name] = 'SQLskillsIKSpace')
    DROP TABLE tempdb.dbo.SQLskillsIKSpace;
GO
CREATE TABLE tempdb.dbo.SQLskillsIKSpace (
    DatabaseID SMALLINT,
    ObjectID INT,
    IndexCount SMALLINT,
    TableRows  BIGINT,
    KeyCount   SMALLINT,
    KeyWidth   SMALLINT);
GO

EXEC sp_MSforeachdb 
    N'IF EXISTS (SELECT 1 FROM (SELECT DISTINCT [name]
    FROM sys.databases WHERE [state_desc] = ''ONLINE''
        AND [database_id] > 4
        AND [name] != ''pubs''
        AND [name] != ''Northwind''
        AND [name] != ''distribution''
        AND [name] NOT LIKE ''ReportServer%''
        AND [name] NOT LIKE ''Adventure%'') AS names WHERE [name] = ''?'')
BEGIN
USE [?]

INSERT INTO tempdb.dbo.SQLskillsIKSpace
SELECT DB_ID (''?''), o.[object_id], 0, 0, 0, 0
FROM sys.objects o
WHERE o.[type_desc] IN (''USER_TABLE'', ''VIEW'')
    AND o.[is_ms_shipped] = 0
    AND EXISTS (
        SELECT *
        FROM sys.indexes
        WHERE [index_id] = 1
            AND [object_id] = o.[object_id]);

UPDATE tempdb.dbo.SQLskillsIKSpace
SET [TableRows] = (
    SELECT SUM ([rows])
    FROM sys.partitions p
    WHERE p.[object_id] = [ObjectID]
    AND p.[index_id] = 1)
WHERE [DatabaseID] = DB_ID (''?'');
 
UPDATE tempdb.dbo.SQLskillsIKSpace
SET [IndexCount] = (
    SELECT COUNT (*)
    FROM sys.indexes i
    WHERE i.[object_id] = [ObjectID]
    AND i.[is_hypothetical] = 0
    AND i.[is_disabled] = 0
    AND i.[index_id] != 1)
WHERE [DatabaseID] = DB_ID (''?'');

UPDATE tempdb.dbo.SQLskillsIKSpace
SET [KeyCount] = (
    SELECT COUNT (*)
    FROM sys.index_columns ic
    WHERE ic.[object_id] = [ObjectID]
    AND ic.[index_id] = 1)
WHERE [DatabaseID] = DB_ID (''?'');

UPDATE tempdb.dbo.SQLskillsIKSpace
SET [KeyWidth] = (
    SELECT SUM (c.[max_length])
    FROM sys.columns c
    JOIN sys.index_columns ic
    ON c.[object_id] = ic.[object_id]
    AND c.[object_id] = [ObjectID]
    AND ic.[column_id] = c.[column_id]
    AND ic.[index_id] = 1)
WHERE [DatabaseID] = DB_ID (''?'');

DELETE tempdb.dbo.SQLskillsIKSpace
WHERE
    ([KeyCount] = 1 AND [KeyWidth] < 9)
    OR [IndexCount] = 0 OR [TableRows] = 0;

END';
GO

SELECT
    [IndexCount] AS [NCIndexes],
    [KeyCount] AS [ClusterKeys],
    [KeyWidth],
    [TableRows],
    [IndexCount] * [TableRows] * [KeyWidth] AS [KeySpaceInBytes]
FROM tempdb.dbo.SQLskillsIKSpace
ORDER BY [KeySpaceInBytes] DESC;

DROP TABLE tempdb.dbo.SQLskillsIKSpace;
GO

Yesterday I blogged about how having too few or too many nonclustered indexes can be a big problem for performance (see here). Today I'm posting some code you can run which will print out the number of indexes for each table in each database on an instance.

I made it print a result set per database so the table in msdb doesn't get too big having to store schema and object names. You can get the code below and here: NCIndexCounts.zip (870.00 bytes).

I also created a version that *does* store names and prints a single result set. You can get that one here: NCIndexCountsSingleResultSet.zip (1.41 kb).

There's probably a niftier way to do this but I'm not a T-SQL expert :-)

Enjoy!

/*============================================================================
  File:     NCIndexCounts.sql

  Summary:  Nonclustered index counts (multiple result sets)

  SQL Server Versions: 2005 onwards
------------------------------------------------------------------------------
  Written by Paul S. Randal, SQLskills.com

  (c) 2011, SQLskills.com. All rights reserved.

  For more scripts and sample code, check out
   
http://www.SQLskills.com

  You may alter this code for your own *non-commercial* purposes. You may
  republish altered code as long as you include this copyright and give due
  credit, but you must obtain prior permission before blogging this code.
 
  THIS CODE AND INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF
  ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED
  TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
  PARTICULAR PURPOSE.
============================================================================*/

IF EXISTS (SELECT * FROM msdb.sys.objects WHERE [name] = 'SQLskillsPaulsIndexCounts')
    DROP TABLE msdb.dbo.SQLskillsPaulsIndexCounts;
GO
CREATE TABLE msdb.dbo.SQLskillsPaulsIndexCounts (
    SchemaID INT,
    ObjectID INT,
    BaseType CHAR (10),
    IndexCount SMALLINT);
GO

EXEC sp_MSforeachdb
    N'IF EXISTS (SELECT 1 FROM (SELECT DISTINCT [name]
    FROM sys.databases WHERE [state_desc] = ''ONLINE''
        AND [database_id] > 4
        AND [name] != ''pubs''
        AND [name] != ''Northwind''
        AND [name] != ''distribution''
        AND [name] NOT LIKE ''ReportServer%''
        AND [name] NOT LIKE ''Adventure%'') AS names WHERE [name] = ''?'')
BEGIN
USE [?]
INSERT INTO msdb.dbo.SQLskillsPaulsIndexCounts
SELECT o.[schema_id], o.[object_id], ''Heap'', 0
FROM sys.objects o
WHERE o.[type_desc] IN (''USER_TABLE'', ''VIEW'')
    AND o.[is_ms_shipped] = 0
    AND EXISTS (
        SELECT *
        FROM sys.indexes
        WHERE [index_id] = 0
            AND [object_id] = o.[object_id]);

INSERT INTO msdb.dbo.SQLskillsPaulsIndexCounts
SELECT o.[schema_id], o.[object_id], ''Clustered'', 0
FROM sys.objects o
WHERE o.[type_desc] IN (''USER_TABLE'', ''VIEW'')
    AND o.[is_ms_shipped] = 0
    AND EXISTS (
        SELECT *
        FROM sys.indexes
        WHERE [index_id] = 1
            AND [object_id] = o.[object_id]);
           
UPDATE msdb.dbo.SQLskillsPaulsIndexCounts
SET [IndexCount] = (
    SELECT COUNT (*)
    FROM sys.indexes i
    WHERE i.object_id = [ObjectID]
    AND i.[is_hypothetical] = 0)

IF EXISTS (SELECT * FROM msdb.dbo.SQLskillsPaulsIndexCounts)
SELECT
    ''?'' AS [Database],
    SCHEMA_NAME ([SchemaID]) AS [Schema],
    OBJECT_NAME ([ObjectID]) AS [Table],
    [BaseType],
    (CASE
       WHEN [IndexCount] = 0 THEN 0
       ELSE [IndexCount]-1 END)
    AS [NCIndexes]
FROM msdb.dbo.SQLskillsPaulsIndexCounts
ORDER BY [BaseType] DESC, IndexCount;

TRUNCATE TABLE msdb.dbo.SQLskillsPaulsIndexCounts;
END';
GO

DROP TABLE msdb.dbo.SQLskillsPaulsIndexCounts;
GO

Back at the start of August I kicked off a survey (see here) that gave you some code to run to produce an aggregate list of the number of tables on your server with different numbers of nonclustered indexes. I got back results from more than 1000 servers across the world - a big thank you to everyone who sent me data!

It's taken me a while to get to this post because a) I needed a few hours to set aside to aggregate the comments, txt files, and spreadsheets that people sent and load them into a database; and b) I've been really busy with teaching etc. Finally I've had time this week while at SQL Connections in Las Vegas to put together the results and this post.

The winners:

  • Highest number of nonclustered indexes on a single clustered index: 1032
  • Highest number of nonclustered indexes on a single heap: 148
  • Highest number of clustered indexes with zero nonclustered indexes on a single server: 185237
  • Highest number of heaps with zero nonclustered indexes on a single server: 88042

Wow!

Now to some of the details...

Tables with Zero Nonclustered Indexes

The two graphs below show the number of servers that have a certain number of tables with zero nonclustered indexes.

 

For the clustered indexes, there is one case I can think of where having zero nonclustered indexes is acceptable: if all queries return all columns of the table and the query search predicate for all queries is the cluster key (or a left-based subset of the cluster key).

All queries that have a search predicate that does not match the cluster key (or a left-based subset thereof) will be table scans, which can put pressure on the buffer pool (see my post on Page Life Expectancy) and lead to contention on the ACCESS_METHODS_DATASET_PARENT latch (all manifesting as a high percentage of LATCH_EX or PAGEIOLATCH_SH and maybe CXPACKET waits).

For the heaps, all queries are inefficient table scans. Well, efficient if you're returning all the rows in the table every time, I suppose :-)

Bottom line: tables usually need nonclustered indexes to provide efficient access paths to the data requested by the various queries that your workload performs.

There are two things you can do to help find queries that need nonclustered indexes:

  1. Use the missing index DMVs (cautiously!) to determine which nonclustered indexes to create. I use the script posted by Microsoftie Bart Duncan in his blog post. However don't just go create all the indexes there. I generally look for Bart's "improvement_measure" column to be above 100k before I'll consider recommending the index to a client (and on systems that already have nonclustered indexes on the table, I'll look for index consolidation possibilities). Note also that the missing index DMVs will sometimes add the cluster key as an INCLUDEd column. This is unnecessary but harmless.
  2. Look directly in the plan cache to find query plans that perform scans. I use a variant of a query published by fellow MVP Glenn Berry in this blog post. Using the graphical query plan I can see what columns are being searched for and returned from the scans and then create the correct nonclustered indexes for these.

You can also use the Database Tuning Advisor, but I don't use that, personally.

You'll be amazed at the performance difference by having a good set of nonclustered indexes.

But don't go overboard otherwise you could detrimentally affect performance by having too many nonclustered indexes...

Tables with Nonclustered Indexes

The two graphs below show the number of tables that have a certain number of nonclustered indexes.

 

The data shows that it is most common to have 10 nonclustered indexes or less, but even that may be too many.

Every nonclustered index incurs overheard when a table row is inserted or deleted, or when any of the nonclustered index key columns (or INCLUDEd columns) are updated. Filtered indexes in SQL 2008+ are a special case, obviously. The overheard takes a few forms:

  • Buffer pool (i.e. memory and I/O) overhead of having to search the nonclustered index for the record to update.
  • I/O overhead of having to flush the updated index page to disk during the next checkpoint
  • Log space for the log records generated by the operation on the nonclustered index
  • Resource overheard for those log records in terms of:
    • Time to be read by the replication/CDC log reader Agent job
    • Time to be read by log backups (and data backups, if applicable)
    • Time and bandwidth to send the log records to a database mirroring mirror
    • Disk space to store the log records in a log backup
    • Time to restore the log records on a log shipping secondary or during a disaster recovery
  • Locking overhead
  • Page split overhead
  • Time to consistency check
  • Time to examine for fragmentation
  • Time to update statistics
  • Disk and backup space overhead

As you can see, nonclustered indexes can be a big burden on a system - you have to be careful when creating them so that you don't have too many.

There are three things you can do to reduce the number of nonclustered indexes on your system:

  1. Use the sys.dm_db_index_usage_stats DMV to find indexes that are only being updated. I've blogged about this here. Again, be careful though. Just because an index hasn't been used doesn't mean it should be dropped. It may be used only infrequently, but it's critical when it is used. Ideally you need to look at the output from the DMV after an entire business cycle has passed. Even then, be careful about dropping indexes that are enforcing uniqueness constraints as these can be used by the query optimizer without reflecting any user seeks or scans.
  2. Remove duplicate nonclustered indexes. Kimberly blogged code to find duplicate indexes here. There is no downside to doing this.
  3. Look for consolidation possibilities. Kimberly has code to show you all the key and INCLUDEd columns here. This is harder and is more of an art than a science. You're looking for indexes where you can combine two or more indexes into one without affecting the ability of the optimizer to use them for the various queries that the non-consolidated indexes used to help.
    • For example, an index on c1, c2, c3 INCLUDE c4, c5 can be combined with an index on c1, c2, c3 INCLUDE c4, c6. But only as long as c6 isn't a really wide column that would affect the performance of the queries using the first index.
    • A harder example: would you consolidate an index c1, c2 INCLUDE c3 with an index on c1, c3 INCLUDE c2? Possibly. It would depend on what the indexes are being used for in queries. 

Summary

Nonclustered indexes are essential for the performance of most workloads, but how many should you have? I often get someone in a class that Kimberly's teaching on indexes to ask her "what's the optimum number of indexes a table should have?" because I know it's a nonsensical question. (And she reciprocates by getting someone to ask me "how long will CHECKDB take?" :-)

The answer is a big, fat "it depends" - and hopefully I've given you some pointers to figure it out for yourself.

I'll continue this series of posts with more surveys and code that you can use on your systems to gauge the health of your indexes.

Hope this helps!

There's a lot of controversy about the Buffer Manager performance object counter Page Life Expectancy - mostly around people continuing to quote 300 as a threshold for starting to worry about there being a problem (which is just utter nonsense these days). That's far too *low* to be the point at which to start worrying if your PLE dips and stays there. Jonathan came up with a better number to use - based on the size of your buffer pool - see the bottom of his post here.

But that's not why I'm writing today: I want to explain why most of the time now Page Life Expectancy is really not giving you useful information.

Most new systems today use NUMA, and so the buffer pool is split up and managed per NUMA node, with each NUMA node getting it's own lazy writer thread, managing it's own buffer free list, and dealing with node-local memory allocations. Think of each of these as a mini buffer pool.

The Buffer Manager:Page Life Expectancy counter is calculated by adding the PLE of each mini buffer pool and then getting the average.

What does this mean? It means that the overall PLE is not giving you a true sense of what is happening on your machine as one NUMA node could be under memory pressure but the *overall* PLE would only dip slightly. One of my friends who's a Premier Field Engineer and MCM just had this situation today, which prompted this blog post. The conundrum was how can there be 100+ lazy writes/sec occuring when overall PLE is relatively static - and this was the issue.

For instance, for a machine with 8 NUMA nodes, with PLE of each being 4000, the overall PLE is 4000. If one of them drops to 1000, the overall PLE only drops to 3625, which likely wouldn't trigger your alerting as it hasn't even dropped 10%.

On NUMA machines, you need to be looking at the Buffer Node:Page Life Expectancy counters for all NUMA nodes otherwise you're not getting an accurate view of buffer pool memory pressure and so could be missing performance issues. And adjust Jonathan's threshold value according to the number of NUMA nodes you have.

Unfortunately there's no easy way to monitor per-NUMA node lazy writes - maybe in future.

[Edit: thanks to an email thread with Trayce Jordan from Microsoft and Jonathan, you can see the lazywriter activity for each NUMA node by looking for the lazywriter threads in sys.dm_exec_requests.]

Hope this helps!

This is a performance tuning post that's been on my to-do list for quite a while. Wait stats analysis is a great way of looking at the symptoms of performance problems (see my Wait Stats category for more posts on this) but using the sys.dm_os_wait_stats DMV shows everything that's happening on a server. If you want to see what waits are occuring because of a single query or operation on a production system (e.g. the effect of MAXDOP hints on number and length of CXPACKET waits for a query) then using the DMV isn't usually feasible - you'd need to clear the wait stats and then have nothing else running on the system except the query/operation you want to investigate. The way to do it is with extended events.

Whenever a thread state changes to SUSPENDED because of a wait, the wait_info event from the sqlos event package fires to mark the beginning of the wait. When the thread gets back on the CPU again, the wait_info event fires again, noting the total wait time and the signal wait time (spent on the RUNNABLE queue). Just like with the DMV, we need to calculate the resource wait time (spent being SUSPENDED) ourselves.

Here's a simple example for you - imagine we want to investigate the waits that occur during an index rebuild. First thing is to create the event session:

-- Drop the session if it exists.
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'MonitorWaits')
    DROP EVENT SESSION MonitorWaits ON SERVER
GO

CREATE EVENT SESSION MonitorWaits ON SERVER
ADD EVENT sqlos.wait_info
    (WHERE sqlserver.session_id = 53 /* session_id of connection to monitor */)
ADD TARGET package0.asynchronous_file_target
    (SET FILENAME = N'C:\SQLskills\EE_WaitStats.xel',
    METADATAFILE = N'C:\SQLskills\EE_WaitStats.xem')
WITH (max_dispatch_latency = 1 seconds);
GO

You'll need to plug in the session ID of the SQL Server connection where you'll run the operation under investigation. I created a dummy production database with a table and some data so I can monitor the waits from rebuilding its clustered index.

You can also filter out any wait types you're not interested in by adding another predicate to the session that filters on the sqlos.wait_type. You'll need to specify numbers, and you can list the mapping between numbers and human-understandable wait types using this code:

SELECT xmv.map_key, xmv.map_value
FROM sys.dm_xe_map_values xmv
JOIN sys.dm_xe_packages xp
    ON xmv.object_package_guid = xp.guid
WHERE xmv.name = 'wait_types'
    AND xp.name = 'sqlos';
GO

Now I turn on the session and run my rebuild.

-- Start the session
ALTER EVENT SESSION MonitorWaits ON SERVER STATE = START;
GO

-- Go do the query

-- Stop the event session
ALTER EVENT SESSION MonitorWaits ON SERVER STATE = STOP;
GO

And then I can see how many events fired:

SELECT COUNT (*)
FROM sys.fn_xe_file_target_read_file
    ('C:\SQLskills\EE_WaitStats*.xel',
    'C:\SQLskills\EE_WaitStats*.xem', null, null);
GO

13324

And then pull them into a temporary table and aggregate them (various ways of doing this, I prefer this one):

-- Create intermediate temp table for raw event data
CREATE TABLE #RawEventData (
    Rowid  INT IDENTITY PRIMARY KEY,
    event_data XML);
 
GO

-- Read the file data into intermediate temp table
INSERT INTO #RawEventData (event_data)
SELECT
    CAST (event_data AS XML) AS event_data
FROM sys.fn_xe_file_target_read_file (
    'C:\SQLskills\EE_WaitStats*.xel',
    'C:\SQLskills\EE_WaitStats*.xem', null, null);
GO

SELECT
    waits.[Wait Type],
    COUNT (*) AS [Wait Count],
    SUM (waits.[Duration]) AS [Total Wait Time (ms)],
    SUM (waits.[Duration]) - SUM (waits.[Signal Duration]) AS [Total Resource Wait Time (ms)],
    SUM (waits.[Signal Duration]) AS [Total Signal Wait Time (ms)]
FROM
    (SELECT
        event_data.value ('(
/event/@timestamp)[1]', 'DATETIME') AS [Time],
        event_data.value ('(/event/data[@name=''wait_type'']/text)[1]', 'VARCHAR(100)') AS [Wait Type],
        event_data.value ('(/event/data[@name=''opcode'']/text)[1]', 'VARCHAR(100)') AS [Op],
        event_data.value ('(/event/data[@name=''duration'']/value)[1]', 'BIGINT') AS [Duration],
        event_data.value ('(/event/data[@name=''signal_duration'']/value)[1]', 'BIGINT') AS [Signal Duration]
     FROM #RawEventData
    ) AS waits
WHERE waits.[op] = 'End'
GROUP BY waits.[Wait Type]
ORDER BY [Total Wait Time (ms)] DESC;
GO

-- Cleanup
DROP TABLE #RawEventData;
GO

Wait Type             Wait Count  Total Wait Time (ms) Total Resource Wait Time (ms) Total Signal Wait Time (ms)
--------------------- ----------- -------------------- ----------------------------- ---------------------------
LATCH_EX              4919        15783                15693                         90
LATCH_SH              18          4399                 4398                          1
CXPACKET              16          3039                 3038                          1
PAGEIOLATCH_SH        813         468                  457                           11
PAGELATCH_SH          246         113                  109                           4
WRITELOG              99          71                   66                            5
PAGELATCH_UP          203         65                   63                            2
LOGBUFFER             123         58                   57                            1
WRITE_COMPLETION      82          55                   55                            0
PAGEIOLATCH_UP        35          27                   27                            0
IO_COMPLETION         80          25                   25                            0
SQLTRACE_LOCK         1           6                    6                             0
PAGEIOLATCH_EX        4           3                    3                             0
CMEMTHREAD            17          1                    0                             1
SOS_SCHEDULER_YIELD   6           0                    0                             0

You can easily change this to give average times and add filters to remove uninteresting waits.

If you're interested in getting more involved with extended events, check out Jonathan's excellent series from last December.

Happy waiting!

Over the last few months I've been blogging occasionally about some pretty deep performance tuning topics, namely latches and spinlocks (see my blog categories Wait Stats, Latches, and Spinlocks).

Ewan Fairweather and Mike Ruthruff of the SQLCAT team have written some really excellent whitepapers on interpreting and dealing with latch and spinlock issues, which I helped tech review, and they've now been published.

I strongly recommend reading these whitepapers if you want to improve your perf tuning skills beyond looking at wait statistics.

You can get them at:

Enjoy!

A month ago I kicked off a survey about MAXDOP setting - see here for the survey. I received results for 700 servers around the world! Here they are:

 

The X-axis format is X-Y-Z, where X = number of cores, Y = number of NUMA nodes, Z = MAXDOP setting. I didn't differentiate between hyper-threading or not, or soft vs. hard NUMA.

The results are striking - 75% of all systems out there use a server MAXDOP of zero. Now, this doesn't show whether individual queries are using MAXDOP to override that, or resource governor in 2008+ to override that either and I don't have info on the workload for all these servers - so it's not a result we can draw any concrete conclusions from. However, I do find it interesting that such a high proportion are running fine with MAXDOP 0 - my expectation was that there would be a higher proportion of servers with a non-zero MAXDOP setting.

There are quite a few 'black and white' configuration rules out there - for instance:

  1. Set MAXDOP to 1 if you're seeing CXPACKET waits as the prevalent wait type.
  2. Set MAXDOP to 1 for OLTP systems, and don't do anything else.
  3. Old Microsoft guidance to set MAXDOP to half the number of physical processors.
  4. Set MAXDOP to the number of cores in the NUMA node.

These are all incorrect as *rules*. There is no one-size-fits-all rule for MAXDOP - there are only general guidelines. For instance:

  • For OLTP systems, it can often be beneficial to set MAXDOP to 1 and then use the MAXDOP query hint to override the server-wide setting for queries that can benefit from parallelism.
  • For mixed-workload systems, you need to be careful how you set MAXDOP so you don't inadvertently penalize one of the workloads. Judicious use of the MAXDOP query hint can help here. A more powerful solution for mixed workloads is to use resource governor and have a workload group for each portion of the workload, with a different MAXDOP for each workload group.
  • For systems with high CXPACKET waits, investigate why this is the case before reducing MAXDOP. It's easy to come up with a demo where there are lots of CXPACKET waits, and while reducing MAXDOP (for the server or the query) reduces the CXPACKET waits, it also makes the query take a lot longer. CXPACKET waits can be because the statistics are incorrect and the query execution system divides up the work by the out-of-date statistics
  • Consider using the cost threshold for parallelism setting - see Jonathan's recent post here.

Using the resource governor as I described above can be a very easy way to mess around with the MAXDOP setting - especially for applications with legacy code that you cannot change, and you don't want to set a server-wide MAXDOP setting. This even works if the legacy code uses MAXDOP query hints, because the resource governor workload group MAXDOP setting *cannot* be overridden.

The key point when making any change for performance tuning is to test the change before putting it into production and work out which setting works best for your workload on your system - rather than blindly following 'this is the best way' rules for settings that people publish.

In other words, what should your MAXDOP be set to? It depends! :-)

One thing to be aware of: if you change the MAXDOP setting for the server, it will flush the plan cache when you do. It shouldn't, but it does. Be careful when doing this on a production server...

Thanks to all those who replied to the original survey!

Back in April I kicked off a survey where I asked you all to send me some information about your buffer pools - how much memory is being used for data file pages and how much of that memory is storing empty space. I got back data from 1394 servers around the world - thanks!

The reason I'm interested in this, and you should be too, is that memory is one of the most important resources that SQL Server uses. If you don't have anough memory, your workload will suffer because:

  • You'll be driving more read I/Os because more of the workload can't fit in the buffer pool.
  • You'll be driving more write I/Os because the lazywriter will have to be tossing dirty pages from the buffer pool.
  • You may encounter RESOURCE_SEMAPHORE waits because queries can't get the query execution memory grants that they need.
  • You may cause excessive plan recompilations if the plan cache is too constrained.

And a bunch of other things.

One of the memory problems that Kimberly discussed in depth last year (and teaches in depth in our Performance Tuning classes) is single-use plan cache bloat - where a large proportion of the plan cache is filled with single-use plans that don't ever get used again. You can read about it in the three blog posts in her Plan Cache category, along with how to identify plan cache bloat and what you can do about it.

This post is about the memory the buffer pool is using to store data file pages, and whether good use is being made from it.

The sys.dm_os_buffer_descriptors DMV gives the information stored by the buffer pool for each data file page in memory (called a BUF structure in the code). One of the things that this structure keeps track of is the free_space_in_bytes for each page. This metric is updated in real-time as changes are made to the page in memory (you can easily prove this for yourself) and so is a reliable view of the data density of the used portion of the buffer pool.

Data density? Think of this as how packed full or data, index, or LOB rows a data file page is. The more free space on the page, the lower the data density.

Low data density pages are caused by:

  • Very wide data rows (e.g. a table with a 5000-byte fixed-size row will only ever fit one row per page, wasting roughly 3000 bytes per page).
  • Page splits, from random inserts into full pages or updates to rows on full pages. These kind of page splits result in logical fragmentation that affects range scan performance, low data density in data/index pages, and increased transaction log overhead (see How expensive are page splits in terms of transaction log?).
  • Row deletions where the space freed up by the deleted row will not be reused because of the insert pattern into the table/index.

Low data density pages can be detrimental to SQL Server performance, because the lower the density of records on the pages in a table:

  • The higher the amount of disk space necessary to store the data (and back it up).
  • The more I/Os are needed to read the data into memory.
  • The higher the amount of buffer pool memory needed to store the extra pages in the buffer pool.

From the survey results I took all the SQL Servers that were using at least one GB of buffer pool memory for data file page storage (900 servers) and plotted that amount of memory against the percentage of that memory that was storing free space in the data file pages.

Wow! That's a lot of servers with a lot of buffer pool memory storing nothing useful.

So what can you do about it? There are a number of solutions to low page density including:

  • Change the table schema (e.g. vertical partitioning, using smaller data types).
  • Change the index key columns (usually only applicable to clustered indexes - e.g. changing the leading cluster key from a random value like a non-sequential GUID to a sequential GUID or identity column).
  • Use index FILLFACTOR to reduce page splits, and...
  • Periodically rebuild problem indexes.
  • Consider enabling data compression on some tables and indexes.

From the graph above, bear in mind that some of the 'wasted' space on these servers could be from proper index management where data and index pages have a low FILLFACTOR set to alleviate page splits. But I suspect that only accounts for a small portion of what we're seeing in this data.

The purpose of my survey and this post is not to explain how to make all the changes to reduce the amount of free space being stored in memory, but to educate you that this is a problem. Very often PAGEIOLATCH waits are prevalent on systems because more I/O than necessary is being driven to the I/O subsystem because of things like bad plans causing table scans or low data density. If you can figure out that it's not an I/O subsystem problem, then you as the DBA can do something about it.

Below is a script to analyze the buffer pool and break down by database the amount of space being taken up in the buffer pool and how much of that space is empty space. For systems with a 100s of GB of memory in use, this query may take a while to run:

SELECT
   (CASE WHEN ([database_id] = 32767)
       THEN 'Resource Database'
       ELSE DB_NAME ([database_id]) END) AS [DatabaseName],
   COUNT (*) * 8 / 1024 AS [MBUsed],
   SUM (CAST ([free_space_in_bytes] AS BIGINT)) / (1024 * 1024) AS [MBEmpty]
FROM sys.dm_os_buffer_descriptors
GROUP BY [database_id];
GO

And here's some sample output from a client system (made anonymous, of course):

DatabaseName        MBUsed   MBEmpty
------------------- -------- ---------
Resource Database   51       11
ProdDB              71287    9779
master              2        1
msdb                481      72
ProdDB2             106      17
model               0        0
tempdb              2226     140

Below is a script that will break things down by table and index across all databases that are using space in the buffer pool. I'm filtering out system objects plus indexes where the amount of space used in the buffer pool is less than 100MB. You can use this to identify tables and indexes that need some work on them to allow your buffer pool memory to be used more effectively by SQL Server and increase your workload performance.

[Edit: There was a bug in the script - fixed now 6/10/2011] 

EXEC sp_MSforeachdb
    N'IF EXISTS (SELECT 1 FROM (SELECT DISTINCT DB_NAME ([database_id]) AS [name]
    FROM sys.dm_os_buffer_descriptors) AS names WHERE [name] = ''?'')
BEGIN
USE [?]
SELECT
    ''?'' AS [Database],
    OBJECT_NAME (p.[object_id]) AS [Object],
    p.[index_id],
    i.[name] AS [Index],
    i.[type_desc] AS [Type],
    --au.[type_desc] AS [AUType],
    --DPCount AS [DirtyPageCount],
    --CPCount AS [CleanPageCount],
    --DPCount * 8 / 1024 AS [DirtyPageMB],
    --CPCount * 8 / 1024 AS [CleanPageMB],
    (DPCount + CPCount) * 8 / 1024 AS [TotalMB],
    --DPFreeSpace / 1024 / 1024 AS [DirtyPageFreeSpace],
    --CPFreeSpace / 1024 / 1024 AS [CleanPageFreeSpace],
    ([DPFreeSpace] + [CPFreeSpace]) / 1024 / 1024 AS [FreeSpaceMB],
    CAST (ROUND (100.0 * (([DPFreeSpace] + [CPFreeSpace]) / 1024) / (([DPCount] + [CPCount]) * 8), 1) AS DECIMAL (4, 1)) AS [FreeSpacePC]
FROM
    (SELECT
        allocation_unit_id,
        SUM (CASE WHEN ([is_modified] = 1)
            THEN 1 ELSE 0 END) AS [DPCount],
        SUM (CASE WHEN ([is_modified] = 1)
            THEN 0 ELSE 1 END) AS [CPCount],
        SUM (CASE WHEN ([is_modified] = 1)
            THEN CAST ([free_space_in_bytes] AS BIGINT) ELSE 0 END) AS [DPFreeSpace],
        SUM (CASE WHEN ([is_modified] = 1)
            THEN 0 ELSE CAST ([free_space_in_bytes] AS BIGINT) END) AS [CPFreeSpace]
    FROM sys.dm_os_buffer_descriptors
    WHERE [database_id] = DB_ID (''?'')
    GROUP BY [allocation_unit_id]) AS buffers
INNER JOIN sys.allocation_units AS au
    ON au.[allocation_unit_id] = buffers.[allocation_unit_id]
INNER JOIN sys.partitions AS p
    ON au.[container_id] = p.[partition_id]
INNER JOIN sys.indexes AS i
    ON i.[index_id] = p.[index_id] AND p.[object_id] = i.[object_id]
WHERE p.[object_id] > 100 AND ([DPCount] + [CPCount]) > 12800 -- Taking up more than 100MB
ORDER BY [FreeSpacePC] DESC;
END';

And here's some sample output from the same client system with the more comprehensive script:

Database Object index_id Index        Type         TotalMB FreeSpaceMB FreeSpacePC
-------- ------ -------- ------------ ------------ ------- ----------- -----------
ProdDB   TableG 1        TableG_IX_1  CLUSTERED    531     130         24.5
ProdDB   TableI 1        TableI_IX_1  CLUSTERED    217     48          22.2
ProdDB   TableG 2        TableG_IX_2  NONCLUSTERED 127     27          21.8
ProdDB   TableC 1        TableC_IX_1  CLUSTERED    224     47          21.4
ProdDB   TableD 3        TableD_IX_3  NONCLUSTERED 1932    393         20.4
ProdDB   TableH 1        TableH_IX_1  CLUSTERED    162     33          20.4
ProdDB   TableF 5        TableF_IX_5  NONCLUSTERED 3128    616         19.7
ProdDB   TableG 9        TableG_IX_9  NONCLUSTERED 149     28          19.1
ProdDB   TableO 10       TableO_IX_10 NONCLUSTERED 1003    190         19
ProdDB   TableF 6        TableF_IX_6  NONCLUSTERED 3677    692         18.8
.
.

This is cool because it's a lot less intrusive way to figure out which tables and indexes have data density issues than running sys.dm_db_index_physical_stats (see this post for an in depth view of this DMV). You can mess around with the script to make it create a table to hold all the data for all databases and slice+dice however you want. I have a much more comprehensive script that I use on client systems but this one will provide you with lots of relevant data.

Have fun! Let me know if this is useful!

I have a few surveys to editorialize but I'd like to kick off another one where you have to run a bit of code and send the results.

For this one I'm interested in how you have your system configured for parallelism, given the number of processor cores and NUMA nodes configured.

For systems using SQL Server 2005 onwards, the following code will work:

SELECT
    [value_in_use] AS [MAXDOP],
    os.*
FROM
    sys.configurations,
    (SELECT
        COUNT (DISTINCT [parent_node_id]) AS [Nodes],
        COUNT (*) AS [Cores]
     FROM
        sys.dm_os_schedulers 
     WHERE
        [status] = 'VISIBLE ONLINE') AS os
WHERE
    [name] = 'max degree of parallelism';
GO

Shoot me the results in email, spreadsheet, or as a comment on the blog. Comment #7 below has a Powershell script from Aaron Nelson (b|t) for all your servers - thanks!

If you have any info on the general workload/workload mix on the server, that would be helpful too.

Thanks!

Categories:
Performance Tuning | Surveys

It's been a long time since the last blog post on SSD benchmarking - I've been busy! I'm starting up my benchmarking activities again and hope to post more frequently. You can see the whole progression of benchmarking posts here.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. My test systems now have 16GB each and all tests were performed with the buffer pool ramped up, so memory allocation didn't figure into the performance numbers.

In this recent set of tests I wanted to explore three questions:

  1. What kind of performance gain do I get upgrading from Fusion-io's v1.2 driver to the v2.2 driver?
  2. What is the sweet spot for the number of files on an SSD?
  3. Does a 4Kb block size give any gains in performance for my test?

To keep it simple I'm using one half of the 640GB drive (it's two 320GB drives under the covers). To do this my test harness does the following:

  • Formats the SSDs in one of three ways:
    • Fusion-io basic format (the 320GB drive has 300GB capacity)
    • Fusion-io improved write performance format (the 320GB drive has only 210GB capacity, 70% of normal)
    • Fusion-io maximum write performance format (the 320GB drive has only 151GB capacity, 50% of normal)
  • The SSD format is performed using Fusion-io's ioManager tool
  • Creates 1, 2, 4, 8, 16, 32, 64 or 128 data files, with the file sizes calculated to fill the SSDs
  • My table structure is:

CREATE TABLE MyBigTable (
    c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID (),
    c2 DATETIME DEFAULT GETDATE (),
    c3 CHAR (111) DEFAULT 'a',
    c4 INT DEFAULT 1,
    c5 INT DEFAULT 2,
    c6 BIGINT DEFAULT 42); 
GO

CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

  • I have 64 connections each inserting 2.5 million records into the table (with the loop code running server-side) for a total of 160 million records inserted, in batches of 1000 records per transaction. This works out to be about 37 GB of allocated space from the database.

The clustered index on a random GUID is the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) - for my purposes it's perfect.

I tested each of the eight data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, RAID was not involved):

  • Log and data on a single 320GB SSD with the old v1.2.7 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver and a 4Kb block size (each of the 3 ways of formatting)
That's a total of 9 configurations, with 8 data file layouts in each configuration - making 72 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 360 tests, for a cumulative test time of just over 1.43 million seconds (16.5 days) during April.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. This uses the script from this blog post.

On to the results... the wait stats are *really* interesting!

Note: the y-axes in the graphs below do not start at zero. All the graphs have the same axes so there is nothing to misunderstand. They are not misleading - and I make no apologies for my choice of axes - I want to show the difference between the various formats more clearly.

v1.2.7 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the old driver was 3580 seconds for test completion, with 4 data files and the Improved Write format - a tiny amount less than the time for 8 data files.

v2.2.3 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the new driver was 2993 seconds for test completion, with 8 data files and the Max Write format - a tiny amount less than the time for other formats for 8 data files, and very close to the times for 4 data files.

On average across all the tests the new v2.2 Fusion-io driver gives a 20.5% performance boost over the old v1.2 driver, and for the regular format, the new v2.2 Fusion-io driver gives a 24% performance boost over the old v1.2 driver. It also (but I didn't measure) reduces the amount of system memory required to use the SSDs. Good stuff!

v2.2.3 Fusion-io driver and 4-Kb sector size

 

The performance using a 4-Kb sector size is roughly the same for my test compared to traditional 512-byte sector size. The most performance gain I got was 3% over using a 512-byte sector size, but on average across all tests the performance was very slightly (0.5%) lower.

Wait Stats

The wait stats were very interesting.

The wait stats are presented in the following format:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX  154611.39  128056.51   26554.88  45295507   71.83       0.0034     0.0028    0.0006
PAGELATCH_UP    37948.31   36988.52    959.79    2314370    17.63       0.0164     0.016     0.0004
PAGELATCH_SH    16976      13823.71    3152.3    3751811    7.89        0.0045     0.0037    0.0008

The columns are:

  • WaitType - kind of obvious
  • Wait_S - cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again
  • Resource_S - cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)
  • Signal_S - cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again - called the signal wait time)
  • WaitCount - number of waits of this type during the test
  • Percentage - percentage of all waits during the test that had this type
  • AvgWait_S - average cumulative wait time in seconds
  • AvgRes_S - average resource wait time in seconds
  • AvgSig_S - average signal wait time in seconds

For a single file, the wait stats for all the various formatting options look like:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX  226362.54  193378.56   32983.98  45742117   78.66       0.0049     0.0042    0.0007
PAGELATCH_UP    36701.66   35760.67    940.99    2144533    12.75       0.0171     0.0167    0.0004
PAGELATCH_SH    16775.05   13644.99    3130.06   3542549    5.83        0.0047     0.0039    0.0009

With more files, the percentage of PAGEIOLATCH_EX waits increases, and by the time we get to 8 files, SOS_SCHEDULER_YIELD has started to appear. At 8 files, the wait stats for all the various formatting options look like:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX       244703.77  210859.63   33844.14  45169863   89.47       0.0054     0.0047    0.0007
SOS_SCHEDULER_YIELD  12500.15   0.99        12499.16  823658     4.57        0.0152     0         0.0152
PAGELATCH_SH         5777.54    3749.18     2028.36   476618     2.11        0.0121     0.0079    0.0043

By 16 files, the PAGELATCH waits have disappeared from the top 95%. As the number of files increases to 128, the PAGEIOLATCH_EX waits increase to just over 91% and the wait stats look like this for regular format:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX       304106.32  273671      30435.32  43478489   91.25       0.007      0.0063    0.0007
SOS_SCHEDULER_YIELD  16733      1.35        16731.65  889147     5.02        0.0188     0         0.0188

What does this mean? It's obvious from the wait stats that as I increase the number of data files on the drive, the average resource wait time for each PAGEIOLATCH_EX wait increases from 4.2ms for 1 file up to 6.3ms for 128 files - 50% worse, with the signal wait time static at 0.7ms.

But look at the wait stats for 128 files using the Maximum Write format:

WaitType      Wait_S Resource_S  Signal_s  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX      196322.97 166602.88   29720.09  45435521  88.66      0.0043 0.0037   0.0007
SOS_SCHEDULER_YIELD  16182.48 1.1     16181.38  828595  7.31      0.0195 0   0.0195

The average resource wait time for the PAGEIOLATCH_EX waits has dropped from 6.3ms to 3.7ms! But isn't PAGEIOLATCH_EX a wait type that's for a page *read*? Well, yes, but what I think is happening is that the buffer pool is having to force pages out to disk to make space for the pages being read in to be inserted into (which I believe is included in the PAGEIOLATCH_EX wait time) - and when the SSD is formatted with the improved write algorithm, this is faster and so the PAGEIOLATCH_EX resource wait time decreases.

But why the gradual decrease in PAGELATCH waits and increase in SOS_SCHEDULER_YIELD waits as the number of files increases?

I went back and ran a single file test and used the sys.dm_os_waiting_tasks DMV (see this blog post) to see what the various threads are waiting for. Here's some example output:

session_id  wait_duration_ms  wait_type     resource_description
----------  ----------------  ------------  --------------------
79          11                PAGELATCH_UP  5:1:1520544
80          14                PAGELATCH_UP  5:1:1520544
93          16                PAGELATCH_UP  5:1:1520544
94          0                 PAGELATCH_UP  5:1:1520544
101         15                PAGELATCH_UP  5:1:1520544
111         25                PAGELATCH_UP  5:1:1520544
110         17                PAGELATCH_UP  5:1:1520544
75          6                 PAGELATCH_UP  5:1:1520544
78          17                PAGELATCH_UP  5:1:1520544
88          8                 PAGELATCH_UP  5:1:1520544
107         12                PAGELATCH_UP  5:1:1520544
109         25                PAGELATCH_UP  5:1:1520544
113         20                PAGELATCH_UP  5:1:1520544
.
.

Dividing 1520544 by 8088 gives exactly 188. Running the same query a few seconds later gives most of the threads waiting on resource 5:1:1544808, another exact multiple of 8088. These resources are PFS pages! What we're seeing is PFS page contention, just like you can get in tempdb with lots of concurrent threads creating and dropping temp tables. In this case, I have 64 concurrent threads doing inserts that are causing page splits, which requires page allocations. As the number of files increases, the amount of PFS page contention decreases. It disappears after 8 files because I've only got 8 cores, so there can only be 8 threads running at once (one per SQLOS scheduler, with the others SUSPENDED on the waiter list or waiting in the RUNNABLE queue).

From 16 to 128 files, the wait stats hardly change and the performance (in the Improved Write and Max Write formats) only slightly degrades (5%) with each doubling of the number of files. Without deeper investigation, I'm putting this down to increased amounts of metadata to deal with - maybe with more to do when searching the allocation cache for the allocation unit of the clustered index. If I have time I'll dig in and investigate exactly why.

The SOS_SCHEDULER_YIELD waits are just because the threads are able to do more before having to wait, and so they're hitting voluntary yield points in the code - the workload is becoming more CPU bound.

Summary

I've clearly shown that the new Fusion-io driver gives a nice boost compared to the older one - very cool.

I've also shown that the number of files on the SSD does have an affect on performance too - with the sweet spot appearing to be the number of processor cores. I'd love to see someone do similar tests on a 16-way, 32-way or higher (or lend me one to play with :-)

[Edit: I discussed the results with my friend Thomas Kejser on the SQLCAT team and he sees the same behavior on a 64-way running the same benchmark (in fact we screen-shared on a 64-way system with 2TB and 4 640GB Fusion-io cards this weekend). He posted some more investigations on his blog - see here.]

And finally, I showed that for my workload, using a 4-Kb sector size did not improve performance.

I'd call that a successful test - *really* interesting!

Continuing my series on advanced performance troubleshooting - see these two posts for the scripts I'll be using and an introduction to the series:

In this blog post I'd like to show you an example of SOS_SCHEDULER_YIELD waits occurring and how it can seem like a spinlock is the cause.

I originally published this blog post and then had a discussion with good friend Bob Ward from Product Support who questioned my conclusions given what he's seen (thanks Bob!). After digging in further, I found that my original post was incorrect, so this is the corrected version.

The SOS_SCHEDULER_YIELD wait means that an executing thread voluntarily gave up the CPU to allow other threads to execute. The SQL Server code is sprinkled with "voluntary yields" in places where high CPU usage may occur.

One such place where a thread will sleep but not explicitly yield is when backing off after a spinlock collision waiting to see if it can get access to the spinlock. A spinlock is a very lightweight synchronization mechanism deep inside SQL Server that protects access to a data structure (not database data itself). See the end of the second blog post above for more of an explanation on spinlocks.

When a thread voluntarily yields, it does not go onto the waiting tasks list - as it's not waiting for anything - but instead goes to the bottom of the RUNNABLE queue. SOS_SCHEDULER_YIELD waits by themselves are not cause for concern unless they are the majority of waits on the system, and performance is suffering.

To set up the test, I'll create a simple database and table:

USE master;
GO
DROP DATABASE YieldTest;
GO

CREATE DATABASE YieldTest;
GO

USE YieldTest;
GO

CREATE TABLE SampleTable (c1 INT IDENTITY);
GO
CREATE NONCLUSTERED INDEX SampleTable_NC ON SampleTable (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO SampleTable DEFAULT VALUES;
GO 100

Then I'll clear out wait stats and latch stats:

DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);
GO

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);
GO

And then fire up 50 clients running the following code (I just have a CMD script that fires up 50 CMD windows, each running the T-SQL code):

USE YieldTest;
GO

SET NOCOUNT ON;
GO

DECLARE @a int
WHILE (1=1)
BEGIN 
    SELECT @a = COUNT (*) FROM YieldTest..SampleTable WHERE c1 = 1;
END;
GO

And the CPUs in my laptop are jammed solid (snapped from my desktop and rotated 90 degrees to save space):

 

Wow!

Looking at perfmon:

 

The CPU usage is not Windows (% User Time), and of all the counters I usually monitor (I've cut them all out here for clarity) I can see a sustained very high Lock Requests/sec for Object and Page locks (almost 950 thousand requests per second for both types! Gotta love my laptop :-)

So let's dig in. First off, looking at wait stats (using the script in the wait stats post referenced above):

WaitType            Wait_S  Resource_S Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S
------------------- ------- ---------- -------- --------- ---------- --------- -------- --------
SOS_SCHEDULER_YIELD 4574.77 0.20       4574.57  206473    99.43      0.0222    0.0000   0.0222

SOS_SCHEDULER_YIELD at almost 100% of the waits on the system means that I've got CPU pressure - as I saw from the CPU graphs above. The fact that nothing else is showing up makes me suspect this is a spinlock issue.

Checking in the sys.dm_os_waiting_tasks output (see script in the second blog post referenced above), I see nothing waiting, and if I refresh a few times I see the occasional ASYNC_NETWORK_IO and/or PREEMPTIVE_OS_WAITFORSINGLEOBJECT wait type - which I'd expect from the CMD windows running the T-SQL code.

Checking the latch stats (again, see script in the second blog post referenced above) shows no latch waits apart from a few BUFFER latch waits.

So now let's look at spinlocks. First off I'm going to take a snapshot of the spinlocks on the system:

-- Baseline
DROP TABLE ##TempSpinlockStats1;
GO
SELECT * INTO ##TempSpinlockStats1 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

(On 2005 you'll need to use DBCC SQLPERF ('spinlockstats') and use INSERT/EXEC to get the results into a table - see the post above for example code.)

Then wait 10 seconds or so for the workload to continue... and then grab another snapshot of the spinlocks:

-- Capture updated stats
DROP TABLE ##TempSpinlockStats2;
GO
SELECT * INTO ##TempSpinlockStats2 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

Now running the code I came up with to show the difference between the two snapshots:

-- Diff them
SELECT
    '***' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] AS [DiffCollisions],
    ts2.[spins] AS [DiffSpins],
    ts2.[spins_per_collision] AS [SpinsPerCollision],
    ts2.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] AS [DiffBackoffs]
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NULL
UNION
SELECT
    '' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] - ts1.[collisions] AS [DiffCollisions],
    ts2.[spins] - ts1.[spins] AS [DiffSpins],
    CASE (ts2.[spins] - ts1.[spins]) WHEN 0 THEN 0
        ELSE (ts2.[spins] - ts1.[spins]) / -- > 0 spins = > 0 collisions
        (ts2.[collisions] - ts1.[collisions]) END AS [SpinsPerCollision],
    ts2.[sleep_time] - ts1.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] - ts1.[backoffs] AS [DiffBackoffs]
    --, ts2.*
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NOT NULL
    AND ts2.[collisions] - ts1.[collisions] > 0
ORDER BY [New] DESC, [Spinlock] ASC;
GO

New  Spinlock                        DiffCollisions DiffSpins  SpinsPerCollision DiffSleepTime DiffBackoffs
---- ------------------------------- -------------- ---------- ----------------- ------------- ------------
     LOCK_HASH                       6191134        4005774890 647               686           1601383
     OPT_IDX_STATS                   1164849        126549245  108               57            7555
     SOS_OBJECT_STORE                73             305        4                 0             0
     SOS_WAITABLE_ADDRESS_HASHBUCKET 115            44495      386               0             3
     XDESMGR                         1              0          0                 0             0

This is telling me that there is massive contention for the LOCK_HASH spinlock, further backed up by the SOS_WAITABLE_ADDRESS_HASHBUCKET spinlock. The LOCK_HASH spinlock protects access to the hash buckets used by the lock manager to efficiently keep track of the lock resources for locks held inside SQL Server and to allow efficient searching for lock hash collisions (i.e. does someone hold the lock we want in an incompatible mode?). In this case the contention is so bad that instead of just spinning, the threads are actually backing off and letting other threads execute to allow progress to be made.

And that makes perfect sense because of what my workload is doing - 50 concurrent connections all trying to read the same row on the same page in the same nonclustered index.

But is that the cause of the SOS_SCHEDULER_YIELDs? To prove it one way or the other, I created an XEvent session that would capture callstacks when a wait occcured:

CREATE EVENT SESSION MonitorWaits ON SERVER
ADD EVENT sqlos.wait_info
 (ACTION (package0.callstack))
      ADD TARGET package0.asynchronous_bucketizer (
            SET filtering_event_name = 'sqlos.wait_info',
            source_type = 1, source = 'package0.callstack')        
      WITH (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE,
      max_dispatch_latency=5 seconds)
GO

I also made sure to have the correct symbol files in the \binn directory (see How to download a sqlservr.pdb symbol file). After running the workload and examining the callstacks, I found the majority of the waits were coming from voluntary yields deep in the Access Methods code. An example call stack is:

IndexPageManager::GetPageWithKey+ef [ @ 0+0x0
GetRowForKeyValue+146 [ @ 0+0x0
IndexRowScanner::EstablishInitialKeyOrderPosition+10a [ @ 0+0x0
IndexDataSetSession::GetNextRowValuesInternal+7d7 [ @ 0+0x0
RowsetNewSS::FetchNextRow+12a [ @ 0+0x0
CQScanRangeNew::GetRow+6a1 [ @ 0+0x0
CQScanCountStarNew::GetRowHelper+44 [ @ 0+0x0
CQScanStreamAggregateNew::Open+70 [ @ 0+0x0
CQueryScan::Uncache+32f [ @ 0+0x0
CXStmtQuery::SetupQueryScanAndExpression+2a2 [ @ 0+0x0
CXStmtQuery::ErsqExecuteQuery+2f8 [ @ 0+0x0
CMsqlExecContext::ExecuteStmts&lt;1,1&gt;+cca [ @ 0+0x0
CMsqlExecContext::FExecute+58b [ @ 0+0x0
CSQLSource::Execute+319

[Edit: check out how to do this in the spinlock whitepaper

This is clearly (to me) nothing to do with the LOCK_HASH spinlock, so that's a red herring. In this case, I'm just CPU bound. When a thread goes to sleep when backing off from a spinlock, it directly calls Windows Sleep() - so it does not show up as a SQL Server wait type at all, even though the sleep call is made from the SQLOS layer. Which is a bummer.

How to get around that and reduce CPU usage? This is really contrived workload, but this can occur for real. Even if I try using WITH (NOLOCK), the NOLOCK seek will take a table SCH_S (schema-stability) lock to make sure the table structure doesn't change while the seek is occurring, so that only gets rid of the page locks, not the object locks and doesn't help CPU usage. With this (arguably weird) workload, there are a few things I could do (just off the top of my head):

  • Scale-out the connections to a few copies of the database, updated using replication
  • Do some mid-tier or client-side caching, with some notification mechanism of data changes (e.g. a DDL trigger firing a Service Broker message)

So this was an example of where wait stats lead to having to look at spinlock stats, but that the spinlock, on even *deeper* investigation, wasn't the issue at all. Cool stuff.

Next time we'll look at another cause of SOS_SCHEDULER_YIELD waits.

Hope you enjoyed this - let me know your thoughts!

Cheers

PS In the latest SQLskills Insider Quick Tips newsletter from last week, I did a video demo of looking at transaction log IOs and waits - you can get it here.

In this survey I'd like you to run some code and then send me the results (and I'm sure someone will put together a PowerShell script to make it easy to run on multiple instances).

I want to know how much of your precious server memory is being wasted storing empty space on data file pages. I'm sure you'll be interested in it too!

Here's the code (works on 2005+):

SELECT
    COUNT (*) * 8 / 1024 AS MBUsed,
    SUM (CONVERT (BIGINT, free_space_in_bytes)) / (1024 * 1024) AS MBEmpty
FROM sys.dm_os_buffer_descriptors;
GO

[Edit: forgot that many systems have way more memory than my 16GB laptop, so updated code above uses BIGINT now.

And here's a PowerShell script courtesy of Nic Cain, assuming server list in c:\temp\serverlist.txt:

$SQLQuery = 'SELECT
COUNT (*) * 8 / 1024 AS MBUsed,
SUM (CONVERT (BIGINT, free_space_in_bytes)) / (1024 * 1024) AS MBEmpty
FROM sys.dm_os_buffer_descriptors;
GO'

$servers = get-content c:\temp\serverlist.txt
foreach ($srv in $servers)
{
invoke-sqlcmd -Server $srv -Database master -Query $SQLQuery
}

/Edit]

Send me the results for as many systems as you can, preferably production systems - either as a comment on this post, in email, or in a spreadsheet in email.

I'll collect all the results and explain what this means, how to drill in further, and what you can do about it in a week or two.

Thanks!

It's all very well having whizz-bang 3rd-party performance monitoring and troubleshooting tools, but sometimes you have to get deeper into what's going on with SQL Server than any of these tools can go. Or you have to call Customer Support or Premier Support so *they* can dive in deeper.

Typically you or they are going to make use of four DMVs that give increasingly advanced information about what's going on for use in performance troubleshooting:

A few weeks ago I kicked off a survey to find out whether you've heard of or used these DMVs - see here for the survey.

In this post I'm going to present the survey results and explain a bit about these DMVs, focusing the most attention on latches and spinlocks. This started out as a small post but grew into a 10-page, 2500 word article :-)

Here are the results (in each of the Other values, a few people asked what DMVs are - see Dynamic Management Views and Functions in BOL):

 

Other values:

  • 12 x Yes, more than once or twice but not routinely.
  • 2 x Only because of your wait statistics post.

The survey results are not surprising, especially amongst readers of my blog.

Wait statistics are the bread-and-butter of performance tuning. SQL Server is keeping track of what resources threads need to wait for, and how long they need to wait. By analyzing which resources (and combinations of resource) are being waited for the most, you can get an idea of where to start digging in further. An example might be that if most of the waits are PAGEIOLATCH_SH waits, and this wasn't the case in your wait stats baseline, you might look at the I/O subsystem performance using the sys.dm_io_virtual_file_stats DMV (which I blogged about here).

Last December I wrote a long blog post introducing wait statistics, showing how to use the sys.dm_os_wait_stats DMV, giving links to resources, and explaining the most common ones that people see in the field based on data from more than 1800 SQL Servers - see Wait statistics, or please tell me where it hurts.

 

Other values:

  • 6 x Yes, More than once or twice but not routinely.

I'm surprised that these results don't tie in more closely with the results for sys.dm_os_wait_stats, but they're reasonably close.

The sys.dm_os_waiting_tasks DMV shows you what is currently being waited on by everything running on the system.

I created a scenario with 200 clients creating and dropping small temp tables to creat tempdb latch contention. Using the DMV, I can see what's being waited on (I've removed the columns describing blocking from the output in this case to make it fit on screen)

SELECT * FROM sys.dm_os_waiting_tasks;
GO

waiting_task_address session_id exec_context_id wait_duration_ms     wait_type          resource_address   resource_description
-------------------- ---------- --------------- -------------------- ------------------ ------------------ --------------------
0x000000000044E508   2          0               4091305              XE_DISPATCHER_WAIT NULL               NULL
0x000000000044E988   9          0               4121252              FSAGENT            NULL               NULL
0x000000000044EBC8   20         0               4121251              BROKER_TRANSMITTER NULL               NULL
0x000000000044F4C8   63         0               53                   PAGELATCH_EX       0x0000000088FFEED8 2:1:1139
0x000000000044EE08   64         0               8                    PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000000044F288   87         0               0                    PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000000044F048   91         0               53                   PAGELATCH_EX       0x0000000088FFEED8 2:1:1139
0x000000000044F948   92         0               61                   PAGELATCH_EX       0x0000000088FFEED8 2:1:1139
0x000000000044F708   101        0               10                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000000044FDC8   103        0               37                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000008744E088   118        0               11                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744E2C8   121        0               66                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000008744E508   122        0               33                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744E748   155        0               32                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744E988   158        0               27                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744EBC8   163        0               34                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744EE08   168        0               66                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000008744F048   179        0               26                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
.
.

As you can see, the classic tempdb latch contention is showing - page ID (2:1:1) - the first PFS page in tempdb. (See here for more on tempdb contention, and here for more on PFS pages.)

Joe Sack (blog|twitter) created a script that pulls in data from a bunch of other DMVs to make the sys.dm_os_waiting_tasks output more useful, which I've modified into the following:

SELECT
    owt.session_id,
    owt.wait_duration_ms,
    owt.wait_type,
    owt.blocking_session_id,
    owt.resource_description,
    es.program_name,
    est.text,
    est.dbid,
    eqp.query_plan,
    es.cpu_time,
    es.memory_usage
FROM sys.dm_os_waiting_tasks owt
INNER JOIN sys.dm_exec_sessions es
    ON owt.session_id = es.session_id
INNER JOIN sys.dm_exec_requests er
    ON es.session_id = er.session_id
OUTER APPLY sys.dm_exec_sql_text (er.sql_handle) est
OUTER APPLY sys.dm_exec_query_plan (er.plan_handle) eqp
WHERE es.is_user_process = 1;
GO

There's too much information in the output to usefully show in this post, but I can see the actual T-SQL statements being run (in this case a lot of DROP TABLE and SELECT * INTO of global temp tables) and the XML query plans. Clicking on one of them in 2008 SSMS gives me the actual plan - very cool:

 

This means I can see from the sys.dm_os_wait_stats DMV what the prevalent resource waits are, then use the sys.dm_os_waiting_tasks DMV to see which queries are waiting for those resources - and then dive in deeper to see why.

 

As I suspected most readers have heard of latches, but 75% of respondents haven't used the DMV or have only used it once or twice.

A latch is a lightweight syncronization mechanism that protects access to read and change in-memory structures - for instance, 8KB page buffers in the buffer pool (latch type = BUFFER), or the data structure that represents a database's data and log files (latch type = FGCB_ADD_REMOVE). A latch is only held for the duration of the operation, unlike a lock which may be held until a transaction commits. One example of locks and latches - imagine a table where an update query has caused lock escalation so that a table X lock is held on the table. As the query continues updating more records in the table, it won't acquire any more locks, but any data and index pages that are updated in memory must be EX (exclusive) latched before the update can occur. The latch acts as the synchronization mechanism to prevent two threads updating the page at the same time, or a thread reading the page while another is in the middle of updating it. Another example is if you run a select query using NOLOCK - although the query will not acquire SH (share) locks at any level in the table, the threads must acquire SH latches on pages before they can be read - to synchronize with possible concurrent updaters.

If a thread requires a latch it will be moved from RUNNING to SUSPENDED and put on the waiter list to await notification that the latch has been acquired in the requested mode.

Latch waits correspond to LATCH_XX waits in the output from the sys.dm_os_wait_stats DMV, so digging into to which latches are accounting for most waits can show where a bottleneck is on the system.

You can reset latch wait statistics just like regular wait statistics using:

DBCC SQLPERF('sys.dm_os_latch_stats', CLEAR);
GO

An example set of output from the DMV is:

SELECT * FROM sys.dm_os_latch_stats
WHERE [waiting_requests_count] > 0
ORDER BY [wait_time_ms] DESC;
GO

latch_class                       waiting_requests_count wait_time_ms         max_wait_time_ms
--------------------------------- ---------------------- -------------------- --------------------
BUFFER                            113181121              466697044            1233
ACCESS_METHODS_HOBT_COUNT         66676                  331193               577
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  15018                  68865                125
LOG_MANAGER                       130                    5610                 234
FGCB_ADD_REMOVE                   299                    5073                 32
TRACE_CONTROLLER                  1                      0                    0
VERSIONING_TRANSACTION_LIST       1                      0                    0
ACCESS_METHODS_HOBT_FACTORY       64                     0                    0

You can also aggregate them in the same way as I described in my big wait stats blog post, using code below:

WITH Latches AS
    (SELECT
        latch_class,
        wait_time_ms / 1000.0 AS WaitS,
        waiting_requests_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_latch_stats
    WHERE latch_class NOT IN ('BUFFER')
)
SELECT
    W1.latch_class AS LatchClass,
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(14, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S
FROM Latches AS W1
INNER JOIN Latches AS W2
    ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.latch_class, W1.WaitS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

Here's an example after clearing the latch stats and running the tempdb contention test (I described above) for 30 seconds:

LatchClass                        Wait_S  WaitCount  Percentage  AvgWait_S
--------------------------------- ------- ---------- ----------- ----------
ACCESS_METHODS_HOBT_COUNT         7.92    1471       75.41       0.0054
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  1.38    393        13.15       0.0035
LOG_MANAGER                       1.20    12         11.44       0.1001

Most of the latch classes are undocumented, but I'll be shedding light on them as I blog more about latch stats.

 

Other values:

  • 1 x Just learned of this at SQLskills training!
  • 1 x Learned about it, after you tweeted about it on 3/23 to a co-worker.

This is really cool that more than 40% of respondents have never heard of this DMV or spinlocks - education time!

A spinlock is another lightweight synchronization mechanism used to control access to certain data structures in the engine - used when the time that the spinlock will be held is very short. They are different from latches because a thread waiting for a latch will yield the scheduler and go onto the waiter list whereas a thread waiting to acquire a spinlock will burn some CPU "spinning" to see if it can get the CPU before giving up and backing off (yielding the scheduler) before trying again. This may allow another thread to execute that is holding the spinlock and eventually release it, allowing the system to proceed (yes, a thread can yield the scheduler and move to the waiter list while holding a spinlock!) because another thread can then acquire the spinlock.

It is perfectly normal for spinlock collisions and spins to occur on a busy system, but sometimes a bottleneck can occur on systems with larger numbers of CPUs where collisions are more likely - this can drain CPU resources while many threads are spinning trying to acquire the spinlock.

Running the DMV shows you the list of all spinlocks on the system (all of which are undocumented - but I'll be working on that going forward) - here is some partial output:

SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY [spins] DESC;
GO

name               collisions           spins                spins_per_collision sleep_time           backoffs
------------------ -------------------- -------------------- ------------------- -------------------- -----------
LOCK_HASH          3629624              4402099957           1212.825            561                  817819
SOS_CACHESTORE     11992297             3352117666           279.5226            6093                 71948
OPT_IDX_MISS_KEY   63329610             2036811058           32.16207            15830                180845
SOS_TLIST          9769744              574740437            58.82861            320                  3619
SOS_SCHEDULER      2137875              107392996            50.23352            557                  7753
MUTEX              676406               83493780             123.4374            340                  3300
LOGCACHE_ACCESS    2210697              83204315             37.63714            0                    252366
SOS_RW             264489               70122059             265.1228            14                   799
XDESMGR            346005               61031449             176.3889            216                  3788
SOS_SUSPEND_QUEUE  3397384              53752545             15.82174            120                  2384
OPT_IDX_STATS      129814               19800952             152.5332            27                   356
BACKUP_CTX         29730                16784471             564.5635            192                  1645
LOCK_RESOURCE_ID   17558                4363116              248.4973            20                   375
SOS_TASK           206597               1898063              9.187273            16                   171
XVB_LIST           266112               882691               3.316991            1                    63
.
.

[

On 2005 you'll need to use DBCC SQLPERF ('spinlockstats') and use INSERT/EXEC to get the results into a table. Eric Humphrey (blog|twitter) put the code together:

-- Baseline
IF OBJECT_ID('tempdb..#TempSpinlockStats1') IS NOT NULL
DROP TABLE #TempSpinlockStats1;
GO
CREATE TABLE #TempSpinlockStats1(
[name] NVARCHAR(30) NOT NULL,
[collisions] BIGINT NOT NULL,
[spins] BIGINT NOT NULL,
[spins_per_collision] FLOAT NOT NULL,
[sleep_time] BIGINT NOT NULL,
[backoffs] BIGINT NOT NULL
)
INSERT INTO #TempSpinlockStats1
EXEC ('DBCC SQLPERF(''spinlockstats'')')
GO

]

The LOCK_HASH spinlock, for instance, is used by the lock manager to look at one of the hash buckets holding lock resource hashes to tell whether lock can be granted or not.

The sleep_time is an aggregate of how much time is spent sleeping between spin cycles when a backoff occurs.

I've put together some code that will allow you to see what spinlock activity occurs between two times. The code captures the output from the DMV into two temp tables, with whatever time period you want in between (to allow you to run a command), and then shows the difference between the two sets of data. I'll show an example of running DBCC CHECKDB.

-- Baseline
DROP TABLE ##TempSpinlockStats1;
GO
SELECT * INTO ##TempSpinlockStats1 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

-- Do some stuff
DBCC CHECKDB (SalesDB) WITH NO_INFOMSGS;
GO

-- Capture updated stats
DROP TABLE ##TempSpinlockStats2;
GO
SELECT * INTO ##TempSpinlockStats2 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

-- Diff them
SELECT
    '***' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] AS [DiffCollisions],
    ts2.[spins] AS [DiffSpins],
    ts2.[spins_per_collision] AS [SpinsPerCollision],
    ts2.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] AS [DiffBackoffs]
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NULL
UNION
SELECT
    '' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] - ts1.[collisions] AS [DiffCollisions],
    ts2.[spins] - ts1.[spins] AS [DiffSpins],
    CASE (ts2.[spins] - ts1.[spins]) WHEN 0 THEN 0
        ELSE (ts2.[spins] - ts1.[spins]) / -- > 0 spins = > 0 collisions
        (ts2.[collisions] - ts1.[collisions]) END AS [SpinsPerCollision],
    ts2.[sleep_time] - ts1.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] - ts1.[backoffs] AS [DiffBackoffs]
    --, ts2.*
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NOT NULL
    AND ts2.[collisions] - ts1.[collisions] > 0
ORDER BY [New] DESC, [Spinlock] ASC;
GO

And the output is as follows:

New  Spinlock           DiffCollisions       DiffSpins            SpinsPerCollision DiffSleepTime        DiffBackoffs
---- ------------------ -------------------- -------------------- ----------------- -------------------- ------------
***  DBCC_CHECK         4                    24                   6                 0                    0
***  DIAG_MANAGER       1                    0                    0                 0                    0
***  FCB_REPLICA_SYNC   10                   16147                1614.7            0                    0
***  LSID               9                    0                    0                 0                    0
***  QUERYEXEC          5                    0                    0                 0                    0
***  X_PACKET_LIST      2                    0                    0                 0                    0
***  X_PORT             8                    227                  28.375            0                    0
***  XACT_WORKSPACE     11                   0                    0                 0                    0
***  XID_ARRAY          7                    0                    0                 0                    0
     BUF_FREE_LIST      2                    0                    0                 0                    0
     HOBT_HASH          2                    1                    0                 0                    0
     LOCK_HASH          3                    1818                 606               0                    0
     SOS_RW             2                    500                  250               0                    0
     SOS_SCHEDULER      5                    6                    1                 0                    0
     SOS_SUSPEND_QUEUE  11                   39                   3                 0                    0
     SOS_TASK           1                    0                    0                 0                    0
     SOS_TLIST          1                    0                    0                 0                    0

You can see here which spinlocks were acquired to run the DBCC CHECKDB commands - those marked with *** did not appear in the 'before' set of spinlock stats. More on all of these in future posts.

You can also investigate spinlocks using extended events - again, more on that in future.

Summary

It's possible to dive really deeply into what's happening inside SQL Server using these four DMVs. Spinlocks in particular - what each means, what each controls and what contention on each them means (plus what you can do about it) - involve a lot of knowledge of what's going on inside the engine, and I'm planning to spread some of that knowledge going forward - there's an enormous amount of information to be published about latches and spinlocks.

Hope you'll join me to learn about these - let me know if you find this stuff interesting and useful.

Thanks!

I just had to figure out how to do this so I figured a quick blog post is in order to save other people time in future.

If you ever need to use windbg to debug a SQL Server crash dump, or you want to capture call stacks using extended events (e.g. when debugging excessive spinlock contention), you'll need the correct symbol file (sqlservr.pdb) to go with sqlservr.exe of the instance you're interested in.

KB article 311503 has details on how to do this in general, but they're a little cryptic.

Assuming you've already installed the Windows debugging tools (32-bit or 64-bit), you'll have a tool called symchk in the folder where windbg resides (for my laptop, "C:\Program Files\Debugging Tools for Windows (x64)") - this is what will pull down sqlservr.pdb.

You need to point symchk at the executable you're interested in, tell it where to put the sqlservr.pdb, and tell it the location of the Microsoft symbol server.

For me, the following worked:

C:\>cd C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn

C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn>"C:\Program
 Files\Debugging Tools for Windows (x64)\symchk" sqlservr.exe /s SRV*c:\symbols*
http://msdl.microsoft.com/download/symbols

SYMCHK: FAILED files = 0
SYMCHK: PASSED + IGNORED files = 1

And then I copied the sqlservr.pdb into the \Binn directory.

Enjoy!

Over the last few months I've been lecturing at classes and conferences about getting SQL Server's view of the I/O subsystem and what latencies it is experiencing, so time for a blog post to help everyone else.

Most SQL Server's today are I/O bound - that's a generally agreed-on statement by DBAs and consultants in the field. This means that the major factor in server performance is its ability to perform I/O operations quickly. If the I/O subsystem cannot keep up with the demand being placed on it, then the SQL Server workload will suffer performance problems.

Now, saying that, one trap that many people fall into is equating increased I/O subsystem latencies with poor I/O subsystem performance. This is often not the case at all. It's usually the case that the I/O subsystem performs fine when the designed-for I/O workload is happening, but becomes the performance bottleneck when the I/O workload increases past the I/O subsystem design point. The I/O workload increase is what's causing the problem, not the I/O subsystem - if you design an I/O subsystem to support 1000 IOPS (I/O operations per second) and SQL Server is trying to push 2000 IOPS, performance is going to suffer.

If you find that I/O latency has increased, look to a change in SQL Server behavior before blaming the I/O subsystem. For instance:

  • Query plan changes from out-of-date statistics, code changes, implicit conversions, poor indexing that cause table scans rather than index seeks
  • Additional indexes being created that cause increased index maintenance workload and logging
  • Access pattern/key changes that cause page splits and hence extra page reads/writes and logging
  • Adding change data capture or change tracking that causes extra writes and logging
  • Enabling snapshot isolation that causes tempdb I/Os, plus potentially page splits
  • Decreased server memory leading to a smaller buffer pool and increased lazy writer and read activity

and a whole host of other reasons can lead to an increased I/O workload where it's not the I/O subsystem's fault.

On the other hand, however, it may very well be the I/O subsystem that has an issue if the SQL Server workload is the same. A SAN administrator may have decided to give some of the space on one of the SQL Server LUNs to another server, which can lead to an overload.

Using performance counters, you can see at the physical disk level what the latencies are (looking at the Avg. Disk sec/Read and Avg. Disk sec/Write counters) but if you have your databases spread across a few LUNs, that doesn't help you pinpointing which database files are experiencing latency issues and driving the most I/Os to the I/O subsystem.

This is where sys.dm_io_virtual_file_stats comes in. It was introduced in SQL Server 2005 as a beefed-up replacement for fn_virtualfilestats and shows you how many I/Os have occured, with latencies for all files.

You can give it a database ID and a file ID, but I found it most useful to look at all the files on the server and order by one of the statistics.

Here's the default output (with some column name changed slightly to make things fit nicely):

SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL);
GO

database_id file_id sample_ms reads bytes_read io_stall_read_ms writes bytes_written io_stall_write_ms io_stall size_on_disk_bytes file_handle
----------- ------- --------- ----- ---------- ---------------- ------ ------------- ----------------- -------- ------------------ ------------------
1           1       14433009  52    3227648    58               10     90112         12                70       515964928          0x000000000000013C
1           2       14433009  55    1241088    36               28     122880        115               151      12648448           0x0000000000000474
2           1       14433009  22    1261568    22               4      147456        1                 23       8388608            0x000000000000065C
2           2       14433009  6     385024     2                7      154112        3                 5        524288             0x0000000000000660
3           1       14433009  31    1859584    18               1      8192          0                 18       1310720            0x0000000000000524
3           2       14433009  6     385024     5                3      16384         1                 6        524288             0x000000000000056C
4           1       14433009  139   8175616    369              13     106496        45                414      260767744          0x0000000000000958
4           2       14433009  136   712704     485              7      53248         25                510      104595456          0x00000000000009F0
5           1       14433009  20    1138688    204              0      0             0                 204      419430400          0x0000000000000980
5           2       14433009  13    411136     466              1      2048          13                479      104857600          0x0000000000000A44
6           1       14433009  15    811008     81               1      8192          8                 89       1310720            0x0000000000000954
6           2       14433009  7     386560     89               4      11264         80                169      516096             0x00000000000009EC
7           1       14433009  15    811008     80               1      8192          9                 89       1310720            0x0000000000000970
7           2       14433009  7     386560     130              3      9216          17                147      516096             0x00000000000009B4
8           1       14433009  16    876544     37               1      8192          103               140      136577024          0x0000000000000988
8           2       14433009  9     229888     11               4      10752         59                70       5242880            0x00000000000009E4
.
.

This isn't that useful because a) I don't have database IDs and file paths memorized, and b) it gives aggregate latencies (io_stall_read_ms and io_stall_write_ms).

What I usually do is use the script below - part of my standard set of scripts I use when doign a server health audit for a client. It's based in part on code from by good friend Jimmy May (blog|twitter), with a bunch of tweaks. It allows me to filter on read or write latencies and it joins with sys.master_files to get database names and file paths.

SELECT
    --virtual file latency
    ReadLatency = CASE WHEN num_of_reads = 0
        THEN 0 ELSE (io_stall_read_ms / num_of_reads) END,
    WriteLatency = CASE WHEN num_of_writes = 0
        THEN 0 ELSE (io_stall_write_ms / num_of_writes) END,
    Latency = CASE WHEN (num_of_reads = 0 AND num_of_writes = 0)
        THEN 0 ELSE (io_stall / (num_of_reads + num_of_writes)) END,
    --avg bytes per IOP
    AvgBPerRead = CASE WHEN num_of_reads = 0
        THEN 0 ELSE (num_of_bytes_read / num_of_reads) END,
    AvgBPerWrite = CASE WHEN io_stall_write_ms = 0
        THEN 0 ELSE (num_of_bytes_written / num_of_writes) END,
    AvgBPerTransfer = CASE WHEN (num_of_reads = 0 AND num_of_writes = 0)
        THEN 0 ELSE ((num_of_bytes_read + num_of_bytes_written) /
            (num_of_reads + num_of_writes)) END,    
    LEFT (mf.physical_name, 2) AS Drive,
    DB_NAME (vfs.database_id) AS DB,
    --vfs.*,
    mf.physical_name
FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS vfs
JOIN sys.master_files AS mf
    ON vfs.database_id = mf.database_id
    AND vfs.file_id = mf.file_id
--WHERE vfs.file_id = 2 -- log files
-- ORDER BY Latency DESC
-- ORDER BY ReadLatency DESC
ORDER BY WriteLatency DESC;
GO

The (partial) output is below:

ReadLatency WriteLatency Latency AvgBPerRead AvgBPerWrite AvgBPerTransfer Drive DB                      physical_name
----------- ------------ ------- ----------- ------------ --------------- ----- ----------------------- ------------------------------------------
2           103          8       54784       8192         52043           D:    HotSpot                 D:\SQLskills\HotSpot_data.mdf
7           48           13      63061       5632         54857           C:    DemoCorruptSystemTables C:\SQLskills\corrupt_log.LDF
32          41           33      43861       2048         39680           D:    DemoSystemIndex3        D:\SQLskills\DemoSystemIndex3_LOG.ldf
27          40           28      43861       1024         39577           D:    DemoSystemIndex5        D:\SQLskills\DemoSystemIndex5_LOG.ldf
23          39           24      43861       7168         40192           D:    DemoSystemIndex4        D:\SQLskills\DemoSystemIndex4_LOG.ldf
7           38           10      43861       4608         39936           D:    DemoSystemIndex1        D:\SQLskills\DemoSystemIndex1_LOG.ldf
11          25           12      43861       7168         40192           C:    FSWaits                 C:\SQLskills\FSWaits_log.ldf
12          20           13      55222       6656         49152           C:    newdatabase             C:\SQLskills\newdatabase_log.LDF
12          20           15      55222       2816         36165           C:    LockResDemo             C:\SQLskills\LockResDemo_log.LDF
14          17           15      45933       2816         30254           C:    TempdbTest              C:\SQLskills\TempdbTest_log.LDF
1           14           5       25543       2688         18510           C:    HotSpot                 C:\SQLskills\HotSpot_log.ldf
40          14           39      31625       7680         29915           D:    RecompileTest           D:\SQLskills\RecompileTest_log.ldf
9           13           10      51858       7680         46336           D:    DemoFatalCorruption1    D:\SQLskills\DemoFatalCorruption1_log.ldf
35          13           34      31625       2048         29513           D:    GUIDTest                D:\SQLskills\GUIDTest_log.ldf
12          13           12      16861       1024         15872           C:    FNDBLogTest             C:\SQLskills\FNDBLogTest_log.LDF
.

This is much more useful as it allows me to quickly see where the read and write hot spots are and then drill into a database to see what's going on, and if nothing out of the ordinary, ask the SAN admin to move those hot spot files to dedicated and/or faster storage.

One question I'm sure you're going to ask is "what is an acceptable read or write latency?" and the answer is a big "it depends!". It depends on the I/O subsystem and how it's been configured. The key is producing a performance baseline for when things are running acceptably well and then seeing where the results from the DMV deviate from your baseline. On well-configured storage which isn't being overloaded I'd expect to see single-digit ms for either read or write latency, but this will vary based on the rotation speed and technology of the drives (SCSI vs SATA vs SSD etc).

There are a bunch of other ways this DMV can be used in conjunction with other DMVs and I'll be blogging about these over the next few months.

Enjoy!

A couple of week ago I kicked off a survey about common causes of performance problems - see here for the survey.

Firstly I asked what was the root cause for the most recent performance problems you looked at - here are the results:

 

Secondly I asked what you think the overall most common cause of performance problems is - here are the results:

 

The results are not surprising, but it's good to have empirical confirmation from the community - T-SQL code and then poor indexing strategy are the top two in both surveys, accounting for roughly 50% of all performance issues.

No matter how powerful the hardware and how much memory or IOPS capacity you have, if you write crappy code that means SQL Server has to use really non-optimal query plans then performance is going to be poor. It's really quite a low proportion of the time that hardware (including the I/O subsystem) is the root-cause of a performance problem - usually it's just a symptom of a deeper problem.

One of the most common causes of poorly-performing T-SQL code and indexing strategy is that developers don't write and test the code in an environment that simulates a true production environment and workload. What works for 2 connections with 100 rows in a table isn't necessarily going to work for 1000 concurrent connections and millions of rows in a table. Another cause is making code changes directly in production without testing their impact.

Nothing much else to say here as the theme has been hit on by many people in the past, but now we have some numbers.

Thanks to all those who responded!

In this week's survey I've got four mini-surveys for you, all to do with in-depth performance analysis.

I'd like to know whether you've ever used each of four DMVs that look progressively more deeply into the workings of the database engine.

I'll report on the results in a week or two and start blogging about using these - let me know if that would be interesting to you.

Thanks!

This survey follows on from the survey results I just blogged about, and is particularly apt given we're here in Dallas this week teaching our Immersion Event on Performance Tuning.

I'm interested to know the root causes of your last few performance problems. I know that we all debug multiple performance problems every week, so have a think about the issues you debugged over the last week or month and vote a few times (or as many times as you want) to reflect the eventual root cause that was determined.

In the second survey, I'd like your opinion on what the most common cause of SQL Server performance problems is. Slightly different way of looking at things, and it'll be interesting to see how the results differ from the first survey.

Please vote as many times as you want on the first survey - free survey website only allows ten choices and no multi-select - but only vote once on the second survey.

(Damn - forgot to tick the box for 'Allow other' as a response - feel free to leave a comment with another response!) 

I'll report on the results in a couple of weeks.

Thanks!

Categories:
Performance Tuning | Surveys

About a month ago I kicked off a survey asking what you look for when first analyzing a plan for a poorly performing query. You can see the original survey here.

Here are the survey results:

 

The "Other" values are as follows:

  • 13 x "Most expensive as percentage of total cost of batch"
  • 7 x "It depends on what I am trying to fix/improve"
  • 4 x "Fat arrows"
  • 3 x "Operator cost"

First of all - what would my answer be? I tend to look for scans and the most expensive operators, plus big fat arrows that stand out.

There's a clear winner here in opinion amongst my readers - by far the most common first consideration when analyzing a query plan is to look for scans - I'm not surprised by that.

In this editorial I'm not going to write about how to go about analyzing query plans - there's so much to cover that I'd end up writing a book-length blog post. Instead I'd like to point you at a bunch of resources that will help you with query tuning, including a few books.

I use SQL Sentry's fabulous free Plan Explorer tool to do this as I can switch back/forth between showing costs by CPU, by IO or combined. I can also see cumulative costs for a branch of the query plan, rather than having to look through a bunch of operators for expensive ones - this is invaluable to me day-to-day when working on client systems. It has a host of other features that SSMS does not have and I know lots of people who use it - why wouldn't you?

If you've never look at a query plan before, I strongly recommend Grant Fritchey's free e-book Dissecting SQL Server Execution Plans. I'd also recommend Grant's regular book SQL Server 2008 Performance Tuning Distilled.

As far as the choices I gave in the survey, each of them can be a major problem but aren't necessarily a problem at all. It's like the misconception that if you have wait stats, then you must have a performance problem. See my long blog post on wait stats for more info.

Here is some specific info that will help you understand the ramifications of each problem in a plan that I listed in the survey:

  • Different row counts or executions between the estimated and actual plans usually indicates that either statistics are out of date leading to a bad plan, or maybe a plan was cached for a stored proc based on atypical parameters. Checkout the following links:
  • Sorts can sometimes be caused by unneeded ORDER BY statement or by missing nonclustered indexes. A good, quick overview of sorts is in Showplan Operator of the Week - SORT (Fabiano Amorim)
  • Joins are very often misunderstood and there are a whole host of reasons why one join may be chosen over another. Best thing to do here is point you at Craig Freedman's excellent series that I link to here.
  • There are all kinds of reasons why scans appear in query plans, not all bad at all. One of the bad reasons is that there are insufficient nonclustered indexes which mean the table has to be scanned to retrieve the data - see the books and search on Google/Bing for an enormous amount written about this. Another reason is T-SQL code written so that an index *cannot* be used because an expression does not isolate the table column correctly, forcing a scan. I blogged about this here.
  • Yet another reason why scans occurs is code written/schema designed so that an operation called an implicit conversion occurs, where the table column must be converted to a different data type before a comparison can take place - forcing a scan, as each value has to be converted to the comparison type. Jonathan blogged about this here.
  • Key/RID lookups are where the query plan uses a nonclustered index to find a value, but the nonclustered index is not covering, so the other result set columns must be retrieved from the table row. When a Key Lookup occurs, the retrieval is from a clustered index and when a RID Lookup occurs the retrieval is from a heap. Both of these are undesirable because of the extra processing required. The fix for this is simply to ensure the correct indexes exist to support the queries, and that the queries are pulling in the correct columns - many times I've seen client code that pulls in columns that aren't necessary.
  • The most expensive operators in a plan are usually a good place to look to see where gains can be made by changing code, statistics, or indexing. I blogged a short post on using SET STATISTICS to watch IO and CPU costs here.
  • Parallelism is again often misunderstood and not necessarily a bad thing. For a large report query in a data warehouse, parallelism is good. For frequent queries in a busy OLTP system, parallelism can be a problem. The best presentation I've seen by far on parallelism is by Craig Freedman - check it out here.

And that's what I have for you on query plans - lots of information for you to go exploring and learning. Analyzing a query plan is a skill that all DBAs and Developers should have IMHO and all the information is out there for you to try it out on your systems.

Enjoy!

I just about beat my fists against my laptop this afternoon while updating a couple of demo scripts for our Performance class in Dallas this week. Well, not quite because I really like my new laptop. But you get the idea.

I'm working over a 3G USB wireless connection and every time I want to save a file, the performance was dire. Opening the Save File As dialog box was fine, but clicking on any directory would take about 10 seconds, with the little disk icon flashing. I thought I remembered a setting that stopped it, but was thinking of the old problem with 2005 SSMS where it tried to connect up to Microsoft whenever it was opened. After searching around on Google for 5 minutes fruitlessly and then posting a question on #sqlhelp on twitter, Kimberly guessed the answer.

SSMS tries to connect to each disconnected network drive in turn, for every click in the Save File As wizard.

How utterly crappy is that? Why on earth would it try to connect to disconnected network drives? They're *DISCONNECTED*!!!!

For the record, Windows Explorer and every other tool I tried does the right thing. Except SSMS. I dug around more extensively on Google and found that other people have experienced the same thing - and only in SSMS.

The only fix is to remove the network drives altogether. Sigh. Sanity restored.

I submitted Connect Item 651176 - please vote for it.

Categories:
Performance Tuning | Tools

In this week's survey, I want to know how you've got tempdb configured compared to the number of processor cores SQL Server thinks it has. I'll correlate, analyze, and present the results like the log file survey I did last year where I got results for 17000 databases.

The code I'd like you to run is as follows:

SELECT os.Cores, df.Files
FROM
   (SELECT COUNT(*) AS Cores FROM sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE') AS os,
   (SELECT COUNT(*) AS Files FROM tempdb.sys.database_files WHERE type_desc = 'ROWS') AS df;
GO

It works on 2005 onwards (thanks to Robert Davis (blog|twitter) for doing a quick 2005 test for me).

I'll take as many results as you can be bothered gathering for me - anywhere from 1 to 100s!

To help you out, Eric Humphrey (blog|twitter), who's in our class this week, put together a PowerShell script to help you run the code against multiple servers. See here - very cool!

Either add your results as a comment or drop me a mail with the results. Feel free to add in any explanation you want as to the reasoning for your setup. Everything will remain anonymous as always.

I'll report on the results in a week or two.

[Edit: the survey is now closed.]

Thanks!

Last week I kicked off a survey about network latencies and database mirroring. See here for the original post.

Here are the results of the survey:

 

I was really interested to see whether the proportion of people doing asynchronous mirroring became higher as the network latency increased. Although this isn't a statistically valid sampe by any means, it does show that the answer is no. However, we're missing some data that would help explain what we see here: how long are the average transactions and is there a response time SLA?

The latency between the principal and the mirror is a big deal for synchronous mirroring, because a transaction on the principal cannot be acknowledged to the user/app as having committed until all of it's log records have been written to the mirror database's log drive.

NOTE: the transaction does NOT have to be replayed/committed on the mirror, simply the log records have to be durable to guarantee the transaction is durable if the principal has a disaster. This is a very common misconception.

If the average transaction length is quite long, say 20 seconds, then the addition of another 500ms of latency when the commit is issued is not a big deal. But if the average transaction length is 100ms then an extra 500ms is a *very* big deal. This is when using asynchronous mirroring starts being considered - as transactions on the principal do NOT have to wait, but at the expense of potential data loss if the principal experiences a disaster. However, if there is no response time SLA, then the company may be fine with the extra delay with synchronous mirroring to guaranteezero data loss (as long as the mirror session stays SYNCHRONIZED).

As always, the choice of HA and DR technology comes down to analyzing requirements and limitations before choosing a technology. I go into this in more detail in the whitepaper I wrote in 2009 for Microsoft: High Availability with SQL Server 2008. There is also an excellent whitepaper on database mirroring: Database Mirroring Best Practices and Performance Considerations.

If you're one of the people who responded that you don't know your network latency even though you're using mirroring, check out the post I wrote last week: Importance of monitoring a database mirroring session.

Thanks!

In my survey for this week I'm interested in what you look for first when analyzing a query plan.

I'll report on the results around mid-February.

Thanks!

PS Post comments are disabled to avoid skewing the results.

The January edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Diagnosing I/O subsystem bottlenecks
  • Capacity planning for transaction logs
  • Why there are no non-logged operations in user databases

Check it out at http://technet.microsoft.com/en-us/magazine/gg552991.aspx.

Yesterday I presented my lecture Index Fragmentation: The Hidden Menace for the PASS Performance Virtual Chapter. The recording of the session, plus a PDF of my slides, and the demo scripts are now available from the PASS website.

Check it out here (look for January 18, 2011).

Enjoy!

I've been doing a lot of performance tuning work over the last couple of months and this weekend found something that's very pervasive out in the wild. Kimberly was helping me optimize a gnarly query plan and spotted something in the code I hadn't noticed that was causing an index scan instead of an index seek. Even though the query plan was using a covering nonclustered index, the way the code was written was causing the index to be scanned.

I'll set up a test to show you what I mean.

First, create an example sales tracking table and populate it (took 3m45s on my laptop):

-- Create example table
CREATE TABLE BigTableLotsOfColumns (
    SalesID  BIGINT IDENTITY,
    SalesDate DATETIME,
    Descr  VARCHAR (100),
    CustomerID INT,
    ProductID INT,
    ModifyDate DATETIME DEFAULT NULL,
    PayDate  DATETIME DEFAULT NULL,
    Quantity INT,
    Price  DECIMAL (6,2),
    Discount INT DEFAULT NULL,
    WarehouseID INT,
    PickerID INT,
    ShipperID INT);
GO 

-- Populate table
SET NOCOUNT ON;

DECLARE @Date DATETIME;
DECLARE @Loop INT;

SET @Loop = 1;
SET @Date = '01/01/2010';

-- Insert 1500 sales per day
WHILE @Loop < 547600
BEGIN
    INSERT INTO BigTableLotsOfColumns (
        SalesDate, Descr, CustomerID, ProductID, Quantity, Price, WarehouseID, PickerID, ShipperID)
    VALUES (
        @Date,
        'A nice order from someone',
        CONVERT (INT, (RAND () * 1000)),
        CONVERT (INT, (RAND () * 1000)),
        CONVERT (INT, (RAND () * 10)),
        ROUND (RAND () * 1000, 2),
        1,
        CONVERT (INT, (RAND () * 4)),
        CONVERT (INT, (RAND () * 4)));
  
    IF @Loop % 1500 = 0
        SET @Date = @Date + 1;
  
    SET @Loop = @Loop + 1;
END
GO

-- And a clustered index
CREATE CLUSTERED INDEX IX_BigTableLotsOfColumns_Clustered ON BigTableLotsOfColumns (SalesID);
GO

Now I've got a simple stored procedure to find the total sales amount for any particular date.

CREATE PROCEDURE TotalSalesForDate (
 @TargetDate DATETIME)
AS 
    SELECT SUM (Quantity * Price)
    FROM BigTableLotsOfColumns
    WHERE DATEDIFF (DAY, SalesDate, @TargetDate) = 0
    OPTION (MAXDOP 1);
GO

The idea is that the stored procedure will pick up all sales for a particular date, no matter what time of that day the sale occured. I've got OPTION (MAXDOP 1) to simulate a system that's been tuned for OLTP queries (and to make the query plans fit better in my blog post :-)

If I run the stored procedure for, say, December 26th 2010, the query plan it uses is as follows:

EXEC TotalSalesForDate '12/26/2010';
GO

 

And it uses the following CPU and IO (you can get these by using SET STATISTICS IO ON and SET STATISTICS TIME ON, but be careful - the can generate lots of output for big procs):

Table 'BigTableLotsOfColumns'. Scan count 1, logical reads 7229, physical reads 0, ...

SQL Server Execution Times:
   CPU time = 125 ms,  elapsed time = 186 ms.

I ran the proc twice - once to get it to compile, and the second time to factor out the compile time.

Pretty obvious that I can speed this up by creating a covering nonclustered index, and in fact you can see the query processor telling me that with the missing index suggestion. I create the covering nonclustered index:

CREATE NONCLUSTERED INDEX IX_BigTableLotsOfColumns_Date ON BigTableLotsOfColumns (SalesDate)
INCLUDE (Quantity, Price);

Now when I run the proc, I get a better query plan:

 

And the following CPU and IO:

Table 'BigTableLotsOfColumns'. Scan count 1, logical reads 2113, physical reads 0, ...

SQL Server Execution Times:
   CPU time = 78 ms,  elapsed time = 115 ms.

Much better, as I'd expect. But look at how the nonclustered index is being used: it's being scanned when instead there should be a seek. It's still doing more than 2000 logical reads because it's scanning the whole index.

The reason for this is that the DATEDIFF in the proc is forcing every SalesDate column value to be evaluated, hence the scan. If I know that the proc only ever is called with a date without a time portion, I can do some rearranging of the logic.

The updated proc is below. Notice that there's no reason to scan any more as the WHERE clause doesn't require every column value to passed through the DATEDIFF:

CREATE PROCEDURE TotalSalesForDate (
    @TargetDate DATETIME)
AS 
    SELECT SUM (Quantity * Price)
    FROM BigTableLotsOfColumns
    WHERE SalesDate >= @TargetDate AND SalesDate < DATEADD (DAY, 1, @TargetDate)
    OPTION (MAXDOP 1);
GO

And when I run it I get the best plan:

 

And the following CPU and IO:

Table 'BigTableLotsOfColumns'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, ...

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 64 ms.

Wow! 200x less logical IOs and not even enough CPU time to register as 1ms.

Summary: although you may have nonclustered indexes that are being used, make sure that they're being used for seeks as you expect. For small queries that are executed hundreds of times a second, the difference between seeking and scanning can be huge in overall performance.

 

At the end of December I showed you how to discover if power saving is enabled on your server, which can lead to variable and often degraded performance. I also included a survey to let me know what you found after running the free CPU-Z tool on your servers. See here for the original post.

I want to do a quick post to show you the results of the survey.

 

The five 'other' results were:

  • "As part of my server build scripts I disable power management"
  • "I thought I had power savings on and it was. However, I had never confirmed. (Personal Computer)"
  • "I thought we had power saving OFF and it was OFF. Good deal."
  • "Not sure if power saving is on -- "Core Speed" is 75% of rated speed"
  • "Told IT that they need to change the bios settings and reboot. have they ? have they hell!"

As you can see, almost 40% of the people who tried the tool AND took the time to fill in the survey reported that they discovered power saving was erroneously enabled.

Have you tested your system yet?

Thanks!

(Yes, I know I haven't editorialized the last survey on What's in a Job Title - I will in the New Year.)

Over the last couple of weeks I've signed up a bunch of new customers for maintenance/ops audits and general perf work and I've had all of them check whether power saving mode is enabled for their CPUs. And I've been astounded by the results - as have some of the customers who thought they were running at full speed all the time.

In a nutshell, to save power the CPUs can essentially down-shift to a lower clock speed and then automatically speed up again when the load increases. But at what point do they speed up? That varies - you might have to really push things to make them speed up.

And do you really want your workload running slower until the CPU usage hits the magic speed-up threshold? Most likely the answer is no.

So the point of this blog post - I'd like you to go check the CPU speeds on your systems and see whether they're in power-saving mode without you knowing. I think you'll be surprised. And then I want you to tell me what you saw.

You can check the CPU speed using the free CPU-Z tool - it's awesome. It's just asks the CPU it's speed - no load is put on the system.

The picture below shows a CPU-Z snapshot from a client system. The CPUs are spec'd at 2.4GHz but as you can see, they're only running at 1.2GHz - half speed.

 

Go download the tool and run it on your system - what did you see?

If you found power savings on, you can read about how to turn it off in the BIOS and Windows Server in a blog post by Glenn Berry ("Windows Power Plans and CPU Performance").

I'll report on the results in a few weeks. [Edit: The survey results can be found HERE.]

You can another read example of this problem in a blog post by Brent Ozar ("SQL Server on Power-Saving CPUs? Not So Fast.") about helping Stack Overflow upgrade.

Have fun!

PS There's also a KB article that discusses how Windows Server 2008 R2 sets the Balanced power plan by default! See here.

Categories:
Performance Tuning | Surveys | Tools

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!)  

How many times have you walked up to a SQL Server that has a performance problem and wondered where to start looking? 

One of the most under-utilized performance troubleshooting methodologies in the SQL Server world is one called "waits and queues" (also known simply as "wait stats"). The basic premise is that SQL Server is permanently tracking why execution threads have to wait. You can ask SQL Server for this information and then use the results to narrow down where to start digging to unearth the cause of performance issues. The "waits" are what SQL Server tracks. The "queues" are the resources that the threads are waiting for. There are a myriad of waits in the system and they all indicate different resources being waited for. For example, a PAGEIOLATCH_EX wait means a thread is waiting for a data page to be read into the buffer pool from disk. A LCK_M_X wait means a thread is waiting to be granted an exclusive lock on something.

The great thing about all of this is the SQL Server *knows* where the performance issues are, and you just need to ask it.... and then interpret what it tells you, which can be a little tricky.

Now - where people sometimes get hung up is trying to track down every last wait and figure out what's causing it. Waits *always* occur. It's the way SQL Server's scheduling system works.

A thread is using the CPU (called RUNNING) until it needs to wait for a resource. It then moves to an unordered list of threads that are SUSPENDED. In the meantime, the next thread on the FIFO (first-in-first-out) queue of threads waiting for the CPU (called being RUNNABLE) is given the CPU and becomes RUNNING. If a thread on the SUSPENDED list is notified that it's resource is available, it becomes RUNNABLE and is put on the bottom of the RUNNABLE queue. Threads continue this clockwise movement from RUNNING to SUSPENDED to RUNNABLE to RUNNING again until the task is completed. You can see processes in these states using the sys.dm_exec_requests DMV. 

SQL Server keeps track of the time that elapses between leaving the RUNNING state and becoming RUNNING again (called the "wait time") and the time spent on the RUNNABLE queue (called the "signal wait time" - i.e. how long does the thread need to wait for the CPU after being signaled that its resource is available). We need to work out the time spent waiting on the SUSPENDED list (called the "resource wait time") by subtracting the signal wait time from the overall wait time.

Tom Davidson of Microsoft wrote a fantastic and very accessible whitepaper on "waits and queues" which I encourage you to read: Performance Tuning Using Waits and Queues. My good friend Joe Sack (blog|twitter) who runs the MCM program also published an excellent slide deck on the subject that you can download here. And of course Books Online has a section on the sys.dm_os_wait_stats DMV that gives info on some of the newer wait types. PSS is putting together a repository of info on all the wait types but not much progress has been made. And there's a free video+demo recording as part of the MCM online training we recorded for Microsoft - see here.

You can ask SQL Server for the cumulative wait statistics using the sys.dm_os_wait_stats DMV, and many people prefer to wrap the DMV call in some aggregation code. I use code based on a query that I got from fellow-MVP Glenn Berry (blog|twitter) and then modified quite a bit. See below for the version updated to take account of the results discussed below:

WITH Waits AS
    (SELECT
        wait_type,
        wait_time_ms / 1000.0 AS WaitS,
        (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
        signal_wait_time_ms / 1000.0 AS SignalS,
        waiting_tasks_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_wait_stats
    WHERE wait_type NOT IN (
        'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
        'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
        'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
        'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
        'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',
        'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',
        'SLEEP_BPOOL_FLUSH', 'SQLTRACE_LOCK')
    )
SELECT
    W1.wait_type AS WaitType,
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S,
    CAST ((W1.ResourceS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgRes_S,
    CAST ((W1.SignalS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgSig_S
FROM Waits AS W1
    INNER JOIN Waits AS W2 ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

This will show the waits grouped together as a percentage of all waits on the system, in decreasing order. The waits to be concerned about (potentially) are those at the top of the list as this represents the majority of where SQL Server is spending it's time waiting. You can see that a bunch of waits are being filtered out of consideration - as I said above, waits happen all the time and these are the benign ones we can usually ignore.

You can also reset the aggregated statistics using this code:

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

And of course you can very easily come up with a way to persist the results every few hours or every day and do some time-series analysis to figure out trends or automatically spot problems as they start to happen. You can also use Performance Dashboard to see these graphically in 2005 and Data Collector in 2008. On SQL Server 2000 you can use DBCC SQLPERF ('waitstats').

Once you get the results, you then start figuring out how to interpret them and where to go looking. The whitepaper I referenced above has a ton of good info on most of the wait types (except those added in 2008). There are various ways you can dig in deeper to this information that I'll go into in later posts.

I'm going to start blogging about wait stats analysis, either as standalone posts or as part of other things - and I've already done so in the last post (at time of writing this) in my benchmarking series.

For now, I want to report on the results of the wait stats survey I posted a couple of months back. I asked people to run the code above and let me know the results. I received results for a whopping 1823 SQL Servers out there - thank you!

Here's a graphical view of the results:

 

I'm not surprised at all by the top four results as I see these over and over on client systems.

For the remainder of this post, I'm going to list all the top wait types reported by survey respondents, in descending order, and give a few words about what they might mean if they are the most prevalent wait on your system. The list format shows the number of systems with that wait type as the most prevalent, and then the wait type.

  • 505: CXPACKET
    • This is commonly where a query is parallelized and the parallel threads are not given equal amounts of work to do, or one thread blocks. One thread may have a lot more to do than the others, and so the whole query is blocked while the long-running thread completes. If this is combined with a high number of PAGEIOLATCH_XX waits, it could be large parallel table scans going on because of incorrect non-clustered indexes, or out-of-date statistics causing a bad query plan. If neither of these are the issue, you might want to try setting MAXDOP to 4, 2, or 1 for the offending queries (or possibly the whole instance). Make sure that if you have a NUMA system that you try setting MAXDOP to the number of cores in a single NUMA node first to see if that helps the problem. You also need to consider the MAXDOP effect on a mixed-load system.
  • 304: PAGEIOLATCH_XX
    • This is where SQL Server is waiting for a data page to be read from disk into memory. It commonly indicates a bottleneck at the IO subsystem level, but could also indicate buffer pool pressure (i.e. not enough memory for the workload).
  • 275: ASYNC_NETWORK_IO
    • This is commonly where SQL Server is waiting for a client to finish consuming data. It could be that the client has asked for a very large amount of data or just that it's consuming it reeeeeally slowly because of poor programming.
  • 112: WRITELOG
    • This is the log management system waiting for a log flush to disk. It commonly indicates a problem with the IO subsystem where the log is, but on very high-volume systems it could also be caused by waiting for the LOGCACHE_ACCESS spinlock (which you can't do anything about). To be sure it's the IO subsystem, use the DMV sys.dm_io_virtual_file_stats to examine the IO latency for the log file.
  • 109: BROKER_RECEIVE_WAITFOR
    • This is just Service Broker waiting around for new messages to receive. I would add this to the list of waits to filter out and re-run the wait stats query. 
  • 086: MSQL_XP
    • This is SQL Server waiting for an extended stored-proc to finish. This could indicate a problem in your XP code. 
  • 074: OLEDB
    • As its name suggests, this is a wait for something communicating using OLEDB - e.g. a linked server. It could be that the linked server has a performance issue.
  • 054: BACKUPIO
    • This shows up when you're backing up directly to tape, which is slooooow. I'd be tempted to filter this out.
  • 041: LCK_M_XX
    • This is simply the thread waiting for a lock to be granted and indicates blocking problems. These could be caused by unwanted lock escalation or bad programming, but could also be from IOs taking a long time causing locks to be held for longer than usual. Look at the resource associated with the lock using the DMV sys.dm_os_waiting_tasks.
  • 032: ONDEMAND_TASK_QUEUE
    • This is normal and is part of the background task system (e.g. deferred drop, ghost cleanup).  I would add this to the list of waits to filter out and re-run the wait stats query.
  • 031: BACKUPBUFFER
    • This shows up when you're backing up directly to tape, which is slooooow. I'd be tempted to filter this out.
  • 027: IO_COMPLETION
    • This is SQL Server waiting for IOs to complete and is a sure indication of IO subsystem problems.
  • 024: SOS_SCHEDULER_YIELD
    • If this is a very high percentage of all waits (had to say, but Joe suggests 80%) then this is likely indicative of CPU pressure.
  • 022: DBMIRROR_EVENTS_QUEUE
  • 022: DBMIRRORING_CMD
    •  These two are database mirroring just sitting around waiting for something to do. I would add these to the list of waits to filter out and re-run the wait stats query.
  • 018: PAGELATCH_XX
    • This is contention for access to in-memory copies of pages. The most well-known cases of these are the PFS, SGAM, and GAM contention that can occur in tempdb under certain workloads. To find out what page the contention is on, you'll need to use the DMV sys.dm_os_waiting_tasks to figure out what page the latch is for. For tempdb issues, my friend Robert Davis (blog|twitter) has a good post showing how to do this. Another common cause I've seen is an index hot-spot with concurrent inserts into an index with an identity value key.
  • 016: LATCH_XX
    • This is contention for some non-page structure inside SQL Server - so not related to IO or data at all. These can be hard to figure out and you're going to be using the DMV sys.dm_os_latch_stats. More on this in future posts.
  • 013: PREEMPTIVE_OS_PIPEOPS
    • This is SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means.
  • 013: THREADPOOL
    • This says that there aren't enough worker threads on the system to satisfy demand. You might consider raising the max worker threads setting.
  • 009: BROKER_TRANSMITTER
    • This is just Service Broker waiting around for new messages to send. I would add this to the list of waits to filter out and re-run the wait stats query. 
  • 006: SQLTRACE_WAIT_ENTRIES
    • Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.
  • 005: DBMIRROR_DBM_MUTEX
    •  This one is undocumented and is contention for the send buffer that database mirroring shares between all the mirroring sessions on a server. It could indicate that you've got too many mirroring sessions.
  • 005: RESOURCE_SEMAPHORE
    • This is queries waiting for execution memory (the memory used to process the query operators - like a sort). This could be memory pressure or a very high concurrent workload. 
  • 003: PREEMPTIVE_OS_AUTHENTICATIONOPS
  • 003: PREEMPTIVE_OS_GENERICOPS
    • These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means.
  • 003: SLEEP_BPOOL_FLUSH
    • This is normal to see and indicates that checkpoint is throttling itself to avoid overloading the IO subsystem. I would add this to the list of waits to filter out and re-run the wait stats query.
  • 002: MSQL_DQ
    • This is SQL Server waiting for a distributed query to finish. This could indicate a problem with the distributed query, or it could just be normal.
  • 002: RESOURCE_SEMAPHORE_QUERY_COMPILE
    • When there are too many concurrent query compilations going on, SQL Server will throttle them. I don't remember the threshold, but this can indicate excessive recompilation, or maybe single-use plans.
  • 001: DAC_INIT
    • I've never seen this one before and BOL says it's because the dedicated admin connection is initializing. I can't see how this is the most common wait on someone's system... 
  • 001: MSSEARCH
    • This is normal to see for full-text operations.  If this is the highest wait, it could mean your system is spending most of its time doing full-text queries. You might want to consider adding this to the filter list. 
  • 001: PREEMPTIVE_OS_FILEOPS
  • 001: PREEMPTIVE_OS_LIBRARYOPS
  • 001: PREEMPTIVE_OS_LOOKUPACCOUNTSID
  • 001: PREEMPTIVE_OS_QUERYREGISTRY
    • These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means. 
  • 001: SQLTRACE_LOCK
    • Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.

I hope you found this interesting! Let me know if there's anything in particular you're interested in seeing or just that you're following along and enjoying the ride!

I've recently been creating some content about wait stats analysis and I think it would be really interesting to see what kind of waits people are seeing out there in the wild. Hopefully it'll also introduce a bunch of people to the waits-and-queues performance troubleshooting methodology and how it can be really useful to them.

[Edit: comments are closed on this post as I have all the info I need - thanks!]

Here's what I'd like you to do:

  • Run the T-SQL query from the end of this post. It's completely benign and only reports on statistics that SQL Server is already gathering. It will not cause any perf issues on your production systems.
  • Look at the output - it'll be something like (from random anonymous client system):

 

  • Look at the top WaitType and fill in the simple survey below. If your top wait isn't listed, click on Other and cut-and-paste the top WaitType into the box provided.

The free survey system only allows a single vote per IP address - if you have any other results, send them in email (mailto:paul@SQLskills.com?Subject=Wait stats) or attach a comment below. 

In a week or two I'll report on the results. It would be great to get a few hundred responses. 

Thanks!

T-SQL code to run (works on 2005 onwards) - based on code by Glenn Berry:

WITH Waits AS
    (SELECT
        wait_type,
        wait_time_ms / 1000.0 AS WaitS,
        (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
        signal_wait_time_ms / 1000.0 AS SignalS,
        waiting_tasks_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_wait_stats
    WHERE wait_type NOT IN (
        'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
        'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
        'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
        'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
        'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')
     )
SELECT
     W1.wait_type AS WaitType,
     CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
     CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
     CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
     W1.WaitCount AS WaitCount,
     CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage
FROM Waits AS W1
INNER JOIN Waits AS W2
     ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

Last time I posted about SSDs I presented the findings from sequential inserts with a variety of configurations and basically concluded that SSDs do not provide a substantial gain over SCSI storage (that is not overloaded) - see this blog post for more details.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. (For the next set of benchmarks I've just upgraded to 16GB of memory and added the second 640GB Fusion-io Duo, for a total of 1.2TB... watch this space!).

In this set of tests I wanted to see how the SSDs behaved for random reads and writes. To do this my test harness does the following:

  • Formats the SSDs in one of three ways:
    • Fusion-io basic format (each 320GB drive has 300GB capacity)
    • Fusion-io improved write performance format (each 320GB drive has only 210GB capacity, 70% of normal)
    • Fusion-io maximum write performance format (each 320GB drive has only 151GB capacity, 50% of normal)
  • The SSD format is performed using Fusion-io's ioManager tool, with their latest publicly-released driver (1.2.7.1).
  • Creates 1, 2, 4, 8, or 16 data files, with the file sizes calculated to fill the SSDs
  • My table structure is:

CREATE TABLE MyBigTable (
    c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID (),
    c2 DATETIME DEFAULT GETDATE (),
    c3 CHAR (111) DEFAULT 'a',
    c4 INT DEFAULT 1,
    c5 INT DEFAULT 2,
    c6 BIGINT DEFAULT 42); 
GO

CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

  • I have 16 connections each inserting 2 million records into the table (with the loop code running server-side)

Now before anyone complains, yes, this is a clustered index on a random GUID. It's the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) - for my purposes it's perfect.

I tested each of the five data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, 128k RAID stripe size - where applicable):

  • Data round-robin between two RAID-10 SCSI (each with 4 x 300GB 15k and one server NIC), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Data on two 320GB SSDs in RAID-0 (each of the 3 ways of formatting), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Log and data on two 320GB SSDs in RAID-0 (each of the 3 ways of formatting)
  • Log and data on two 320GB SSDs in RAID-1 (each of the 3 ways of formatting)  
  • Log and data on single 320GB SSD (each of the 3 ways of formatting)
  • Log and data on separate 320GB SSDs (each of the 3 ways of formatting)
  • Log and data round-robin between two 320GB SSDs (each of the 3 ways of formatting)

That's a total of 19 configurations, with 5 data file layouts in each configuration - making 95 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 475 tests, for a cumulative test time of just less than 250 thousand seconds (2.9 days) at the end of July.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. The wait stats will be presented in the following format:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       26833.45       26822.85          10.60      867558      75.88
WRITELOG                              7097.77        6647.26         450.51     3221475       20.0

The columns are:

  • WaitType - kind of obvious
  • Wait_S - cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again
  • Resource_S - cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)
  • Signal_S - cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again - called the signal wait time)
  • WaitCount - number of waits of this type during the test
  • Percentage - percentage of all waits during the test that had this type

On to the results...

Data on SCSI RAID-10, log on SATA RAID-10

 

Once again this shows what I've shown a few times before - on SCSI having multiple data files on the two RAID arrays gives a performance boost. The two-file case is going from a single RAID array to two RAID arrays - bound to get a performance gain - and it gets a 35% performance boost - 6 times the boost I got from messing around with multiple files for the sequential inserts case last time (see here and here for details).

The best performance I could get from having data on the SCSI arrays was 1595 seconds.

Representative wait stats for a run of this test - one file:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       28993.08       28984.66           8.42      647973      75.53
WRITELOG                              7333.36        6883.82         449.54     3223809      19.10
SLEEP_BPOOL_FLUSH                     1786.18        1781.94           4.24     1147596       4.65

Representative wait stats for a run of this test - two files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       15306.22       15296.67           9.55      679281      63.87
WRITELOG                              7762.25        7270.79         491.47     3215377      32.39

Representative wait stats for a run of this test - four files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       26833.45       26822.85          10.60      867558      75.88
WRITELOG                              7097.77        6647.26         450.51     3221475      20.07

Representative wait stats for a run of this test - eight files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       27556.79       27547.83           8.96      674319      75.09
WRITELOG                              7545.40        7118.93         426.47     3221841      20.56

Representative wait stats for a run of this test - sixteen files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       37716.72       37705.87          10.85      792189      80.13
WRITELOG                              7150.01        6699.36         450.64     3228609      15.19

These numbers are showing the majority of waits are for data pages to be read into the buffer pool - random reads, and the next most prevalent wait is for log block flushes to complete. The more PAGEIOLATCH_EX waits there are, the worse the performance is.

Data on 640GB RAID-0 SSDs, log on SATA RAID-10

 

Don't let this graph fool you - the top and bottom of the scale are only 30 seconds apart. Basically moving the data files from the SCSI arrays to the RAID-0 SSD got around a 3-5x performance gain, no matter how the SSDs are formatted.

Representative wait stats for a run of this test - one file:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              8459.65        7789.91         669.73     3207448      94.48
PAGEIOLATCH_EX                         440.27         392.51          47.77      828420       4.92

Representative wait stats for a run of this test - two, four, eight, or sixteen files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              7957.35        7356.01         601.34     3206855      95.75

The log is the obvious bottleneck in this configuration. 

Data and log on 640GB RAID-0 SSDs

 

And again - high and low values are only 25 seconds apart. Moving log off to the same SSD gave a further 45%-ish improvement across the board, with little difference according to how the SSDs were formatted.

Representative wait stats for a run of this test - any number of files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              2955.69        2184.99         770.69     3203957      89.24
PAGEIOLATCH_EX                         330.11         288.89          41.23      653147       9.97

The percentages fluctuate up and down a few percent depending on write format and number of files, with the maximum write performance format tending to have a slightly higher percentage of WRITELOG waits than the other two formats.

Note that moving the log to the SSD as well as the data files drastically cuts down the number of WRITELOG waits - what we'd expect.

Data and log on single 320GB SSD

 

The performance numbers for having everything on a single 320GB SSD are only a tiny amount slower than those for two 320GB SSDs - which is what I'd expect.

Representative wait stats for a run of this test - one file with basic format or improved write performance format:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              2911.22        2121.05         790.17     3204459      81.44
PAGEIOLATCH_EX                         602.11         546.56          55.55      758271      16.84

And for one file with maximum write performance format:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              3363.11        2523.63         839.48     3204110      87.54
PAGEIOLATCH_EX                         428.68         406.77          21.92      412081      11.16


You can see that the higher amount of PAGEIOLATCH_EX waits leads to lower overall performance. This makes sense to me.

Data and log on two 320GB RAID-1 SSDs

 

Now, I have an issue with people using SSDs in RAID-0 because it's a single point of failure. In an environment that's going all out on high-availability, if I was using SSDs for performance, depending on the criticality of the data I'd want to at least double-up to RAID-1. For all the various configurations, moving from a single 320GB SSD to two of them in RAID-1 resulted in no more than a 10-15% drop in performance and it's still 3-5x faster than the SCSI setup.

Here's a representative set of wait stats for the entire set of tests:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              3949.44        3031.14         918.30     3204694      85.68
PAGEIOLATCH_EX                         608.62         555.98          52.65      692934      13.20

In general the RAID-1 configuration had more waits of both types than the single drive configuration.

Data and log on separate 320GB SSDs

 

Splitting the data and log make for a 5-20% improvement over having everything on a single 320GB SSD.

The wait stats for these configurations show the same trends that we've seen so far - slightly slower performance = slightly more PAGEIOLATCH_EX waits.

Data and log round-robin between separate 320GB SSDs

 

This confused me - the single file case is exactly the same configuration as the test case above, but the results (for each test being run 5 time and then averaged) were almost 10% faster for the first two formats. No significant differences for the other configurations.

The wait stats for these configurations show the same trends that we've seen so far - slightly slower performance = slightly more PAGEIOLATCH_EX waits.

Best-case performance for each number of data files

 

 

 

Well big surprise - the SSDs outperform the SCSI storage for all these tests. The improvement factor varied by the number of data files:

  • 1: SSD was 7.25x faster than SCSI
  • 2: SSD was 4.74x faster than SCSI
  • 4: SSD was 6.81x faster than SCSI
  • 8: SSD was 7.64x faster than SCSI
  • 16: SSD was 9.03x faster than SCSI

The configuration of 4 data files on one SSD and the log on the other SSD, with basic format for both, was the best overall performer, beating the best SCSI configuration (2 data files) by a factor of 4.96.

Summary

Reminder: this test was 32 million inserts with no reads or updates (i.e. no random IO). It is very important to consider the limited scenario being tested and to draw appropriate conclusions.

My conclusions are as follows:

  1. For a random read+write workload, the method of formatting the Fusion-io drives doesn't make much difference. I'd go for the basic format to get the higher capacity, but I'd always to a representative load test to make sure.
  2. For a random read+write workload, the SSDs give at least a 5x performance gain over iSCSI storage
  3. Once again, having multiple data files outperforms having a single data file in most configurations
  4. I can easily correlate IO-subsystem related wait stats to the varying performance of the various configurations

Compared to the sequential insert workload that I benchmarked in the previous set of tests, the random read+write workload makes it worth investigating the investment of moving to SSDs.

Just like last time, these results confirm what I'd heard anecdotally - random operations are the sweet-spot for SSDs.

Brent's playing with the server over the next 4 weeks so I won't be doing any more benchmarking until mid-September at least.

Hope these results are interesting to you!

Back at the start of July I kicked off a survey around your plans for SSDs (see here) and now I present the results to you. There's not much to editorialize here, but the numbers are interesting to see.

 

The "other" answers were (verbatim):

  • 3 x 'have bought and am trying them out'
  • 3 x 'not sure if we need them or not'
  • 2 x 'all production servers are hosted'
  • 1 x 'bought them, tried them..not good enough yet for tempdb'
  • 1 x 'Have some, want more, could you really every have enough?'
  • 1 x 'We get every penny from or spinning media, and have no need for SSD'

The results reflect what I've been hearing when teaching classes and talking to customers/conference attendees over the last six months. People are becoming more interested in SSDs but there's still a lot of wariness about them and of course the whole money issue of being able to buy them. I'm also not surprised (given the general readership demographics of this blog) by the number of people who've analyzed their IOPS requirements and concluded that they don't need SSDs to accomplish that.

 

The "other" answers were (verbatim):

  • 3 x 'not in the budget'
  • 1 x 'I plan to buy expensive drives and throw them at you, paul! love, conor'
  • 1 x 'I'm going to do the same thing Conor will do. Denny'
  • 1 x 'OLAP Scale Out'
  • 1 x 'Use them as cache'
  • 1 x 'Using in an EMC V-MAX SAN to dynamically move high workloads to SSD temporarily'

Ahem - thanks Conor and Denny :-)

Another unsurprising set of results that reflects what I've been hearing. One number I'd be interested in drilling deeper into is answer #3 - are people putting/planning to put tempdb on SSDs because that's what they've heard is the best thing to do, or because tempdb truly is the largest I/O bottleneck that can benefit the most from SSDs? That's a set of experiments I'd like to try out with my Fusion-io drives.

The final "other" answer is also interesting - I was talking to a couple of folks from EMC in Ireland about the V-MAX when we were there earlier this month. Very cool idea to migrate data up and down a set of devices with varying latencies (at the block level, not the file level) - I'd like to see more on how the technology copes with one-off operations like consistency checks or backups - do those IOs affect which layer a block resides in?

Anyway, hope you find these results interesting.

Thanks to all those who responded!

Over the last month we've been teaching in Europe and I haven't had much time to focus on benchmarking, but I've finally finished the first set of tests and analyzed the results.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me.

In this set of tests I wanted to check three things for a sequential write-only workload (i.e. no reads or updates)

  • Best file layout with the hardware I have available
  • Best way to format the SSDs
  • Whether SSDs give a significant performance gain over SCSI storage

The Fusion-io SSDs can be formatted four ways:

  • Regular Windows format
  • Fusion-io's format
  • Fusion-io's improved write performance format
  • Fusion-io's maximum write performance format

I'm using one of the 640GB SSDs in my server, which presents itself as two 320GB drives that I can use individually or tie together in a RAID array. The actual capacity varies depending on how the drives are formatted:

  • With the Windows and normal Fusion-io format, each of the 320GB drives has 300GB capacity
  • With the improved write performance format, each of the 320GB drives has only 210GB capacity, 70% of normal
  • With the maximum write performance format, each of the 320GB drives has only 151GB capacity, 50% of normal

In my tests, I want to determine whether the loss in capacity is worth it in terms of a performance gain. The SSD format is performed using Fusion-io's ioManager tool, with their latest publicly-released driver (1.2.7.1).

My tests involve 16 connections to the server, running server-side code to insert 6.25GB each into a table with a clustered index, one row per page. The database is 160GB with a variety of file layouts:

  • 1 x 160GB file
  • 2 x 80GB files
  • 4 x 40GB files
  • 8 x 20GB files
  • 16 x 10GB files

These drop down to 128/64/32/etc when using a single 320GB drive with the maximum write capacity format. The log file is pre-created at 8GB and does not need to grow during the test.

I tested each of the five data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, 128k RAID stripe size - where applicable):

  • Data on RAID-10 SCSI (8 x 300GB 15k), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Data round-robin between two RAID-10 SCSI (each with 4 x 300GB 15k and one server NIC), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Data on two 320GB SSDs in RAID-0 (each of the 4 ways of formatting), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Log and data on two 320GB SSDs in RAID-0 (each of the 4 ways of formatting)
  • Log and data on single 320GB SSD (each of the 4 ways of formatting)
  • Log and data on separate 320GB SSDs (each of the 4 ways of formatting)
  • Log and data round-robin between two 320GB SSDs (each of the 4 ways of formatting)

That's a total of 22 configurations, with 5 data file layouts in each configuration - making 110 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 550 tests, for a cumulative test time of just less than 110 million seconds (12.7 days) over the last 4 weeks.

And yes, I do have a test harness that automates a lot of this so I only had to reconfigure things 22 times manually. And no, for these tests I didn't have wait stats being captured. I've upgraded the test harness and now it captures wait stats for each test - that'll come in my next post.

On to the results... bear in mind that these results are testing a 100GB sequential insert-only workload and are not using the full size of the disks involved!!!

Data on SCSI RAID-10, log on SATA RAID-10

 

I already blogged about these tests here last week. They prove that for this particular workload, multiple data files on the same RAID array does give a performance boost - albeit only 6%.

The best performance I could get from the SCSI/SATA configurations was completing the test in 1755 seconds.

Data and log on 640GB RAID-0 SSDs (Data on 640GB RAID-0 SSDs, log on SATA RAID-10)

 

The performance whether the log file was on SATA or on the SSD was almost identical, so I'm only including one graph, in the interests in making this post a little shorter.

These results clearly show that the SSDs have to be formatted correctly to get any performance out of them. The SSDs performed the same for all data file configurations until performance almost doubles when the number of data files hits 16. I tested 32 and 64 files and didn't get any further increase. My guess here is that I had enough files that when checkpoints or lazywrites occured, the behavior was as if I was doing a random-write workload rather than sequential-write workload.

The best performance I could get here was with 16 files and the maximum-write format when the test completed in 934 seconds, 1.88x faster than the best SCSI time. This is only 13 seconds slower than the normal format which gives 100% more capacity.

Data and log on single 320GB SSD

 

Here the performance truly sucked when the SSD wasn't formatted correctly. Once it was, the performance was roughly the same for 1, 2, or 4 files but degraded by almost 50% with normal formatting for 8 or 16 files. With improved-wait and maximum-write formatting, the performance was the same as for the 640GB RAID-0 SSD array, but the sharp performance increase with 16 files only happened with the maximum-write formatting.

Data and log on separate 320GB SSDs

 

No major difference here - same characteristics as before when formatted correctly, and the best performance coming from maximum-write formatting and 16 data files.

This configuration gave the best overall performance - 909 seconds - 1.93x the bext performance from the SCSI storage.

Data and log round-robin between separate 320GB SSDs

 

No major differences from the previous configuration.

Best-case performance for each number of data files

  

  

 

Clearly the SSDs outperform the SCSI storage for these tests, but not by very much. The improvement factor varied by the number of data files:

  • 1: SSD was 1.11x faster than SCSI
  • 2: SSD was 1.09x faster than SCSI
  • 4: SSD was 1.06x faster than SCSI
  • 8: SSD was 1.04x faster than SCSI
  • 16: SSD was 2.03x faster than SCSI

The configuration of 16 data files on one SSD and the log on the other SSD, with maximum-write format for both, was the best overall performer, beating the best SCSI configuration (8 data files) by a factor of 1.93.

Summary

Reminder: this test was 100GB of sequential inserts with no reads or updates (i.e. no random IO). It is very important to consider the limited scenario being tested and to draw appropriate conclusions

Several things are clear from these tests:

  1. The Fusion-io SSDs do not perform well unless they are formatted with Fusion-io's tool, which takes seconds and is very easy. I don't see this as a downside at all, and it makes sense to me.
  2. For sequential write-only IO workloads, the improved-write and maximum-write SSD formats do not produce a performance gain and so the loss in storage capacity (30% and 50% respectively) is not worth it.
  3. For sequential write-only IO workloads, the SSDs do not provide a substantial gain over SCSI storage (which is not overloaded).

All three of these results were things I'd heard anecdotally and experienced in ad-hoc tests, but now I have the empirical evidence to be able to state them publicly (and now so do you!).

These tests back-up the assertion I've heard over and over that sequential write-only IO workloads are not the best use-case for SSDs.

One very interesting other result came from these tests - moving to 16 data files changed the characteristics of the test to a more random write-only IO workload, and so the maximum-write format produced a massive performance boost - almost twice the performance of the SCSI storage!

The next set of tests is running right now - 64GB of inserts into a clustered index with a GUID key - random reads and writes in a big way. Early results show the SSDs are *hammering* the performance of the SCSI storage - more in a week or so!

Hope you find these results useful and thanks for reading!

Many times I'm asked whether having multiple data files can lead to an improvement in performance. The answer, as with all things SQL (except concerning auto-shrink) is a big, fat "it depends." It depends on what you're using the database for, and the layout of the files on the IO subsystem, and the IO subsystem capabilities. I've heard examples of "yes" and I've heard examples of "no."

Just for kicks, I put some of my test hardware to use to do some experimentation. (You can get to all my other Benchmarking posts using this link.)

My setup for this series of tests is:

  • Log file pre-sized to 8GB (to avoid log growth) on 8 x 1TB 7.2k SATA RAID-10 array, one iSCSI NIC, 128 KB stripe size
  • 160GB database, variously setup as (all in the PRIMARY filegroup): 
    • 1 x 160GB file
    • 2 x 80GB files
    • 4 x 40GB files
    • 8 x 20GB files
    • 16 x 10GB files
  •  16 connections inserting 100/16GB each, no other activity, all code executing on the server, no data transfer from clients

Each test was run 5 times and then the time-for-test calculated as the average of the 5 test runs, so the two tests together represent 50 test runs. Luckily I wrote a test harness that will tear down and setup the database automatically each time in the different configurations, so just double click a cmd file and then a day or so later I get an email saying the test has finished. Great when we're traveling!

Here are the test results:

As you can see, it's pretty clear that with both test setups, having more data files definitely does produce a performance improvement, but only up to a point.

Test 1: Data files on 8 x 300GB 15k SCSI RAID-10 array, two iSCSI NICs, 128KB stripe size

Test 2: Data files round-robin between two 4 x 300GB 15k SCSI RAID-10 array, one iSCSI NIC each, 128KB stripe size

In both cases, the performance increases up to eight data files, and then begins to decrease again with sixteen data files. The single data file case was bound to be slower on the SCSI array with fewer drives, and we see that in the results (left-most result in red).

In the best case, the eight-file case on two arrays was just over 6% faster than the single-file case on the single array. Hardly earth-shattering, but still a non-trivial gain.

Where's the gain coming from? I ran wait stats analysis for a few test variations - for example, between the eight data files test and the single data file test using two arrays, the cumulative wait stats were almost identical - 38/39% PAGELATCH_EX, 19/21% PAGELATCH_SH, 12/13% WRITELOG. The gain is mostly coming from the IO subsystem, but the SCSI arrays are still overloaded, as I showed in plenty of the previous benchmarking tests.

Now, this is a very contrived test, with a single operation in my workload - it's definitely NOT representative of a mixed-operation OLTP workload. However, I did see a gain from having multiple data files - and I believe I would have seen more gain had the SCSI array(s) not been pretty much maxed out already.

I've heard plenty of anecdotal evidence that adding a few more data files for user databases can lead to performance improvements, but your mileage is definitely going to vary. I'd be very interested to hear your observations in production as comments to this post (but please keep the comments constructive - don't give me a laundry-list of tests/settings you want me to try, or rant about real-life vs. lab tests).

Enjoy!

PS The next post on SSDs is forthcoming - just finishing up the (extensive) tests - and also the post on how you all have your log files configured, from the survey I did a while ago. Thanks for being patient!

In the previous post in the series I introduced SSDs to the mix and examined the relative performance of storing a transaction log on an 8-drive 7.2k SATA array versus a 640-GB SSD configured as a 320-GB RAID-1 array. The transaction log was NOT the I/O bottleneck in the system and so my results showed only a limited 2.5%-3.5% performance gain from substituting the RAID-1 SSD.

Several people pointed out that many people use such an SSD in a RAID-0 configuration instead of RAID-1, so I promised to try the same set of tests using the SSD as a 640-GB RAID-0 array.

Now, the merits of using a single SSD in a RAID-0 configuration are mixed. If used just on its own, there's no redundancy but the performance should be better, so I'd make sure my client understood the risk involved and made sure that for critical data some form of redundancy is added (e.g. synchronous database mirroring).

I re-ran the 100-GB table population tests with the transaction log on the RAID-0 SSD to see what extra performance gain could be had. To be honest, I didn't expect much, as I know the transaction log isn't the limiting factor.

Here are the graphs for the tests. On the top is the SSD configured for RAID-1, on the bottom is the SSD configured for RAID-0. Each test was performed 5 times and the average time used for the graph.

 

 

You can see that in the RAID-0 configuration, there's a higher performance gain - peaking at just over 6%.

The average perf gain for the RAID-1 SSD over the SATA array across all tests was 2.04%.

The average perf gain for the RAID-0 SSD over the SATA array across all tests was 4.25%.

Clearly the RAID-0 SSD outperformed the RAID-1 SSD (and they both outperformed the SATA array in all tests), as we'd expect, but not by much, as again, the transaction log *isn't the bottleneck*. We should see much better gains by altering the data file configuration - so that's what I'll do next.

Thanks for following the series and for all the comments. Please bear in mind that I'm taking things slowly so I won't jump straight to the best configuration - I want to analyze the changes as we go along.

PS I tried increasing the number of SQL Server connections in the test from 16 to 32 as someone suggested in a comment - the performance got 10% worse! Decreasing the number of connections didn't help either.

Well it's been almost 6 weeks since my last benchmarking blog post as I got side-tracked with the Myth-a-Day series and doing real work for clients :-)

In the benchmarking series I've been doing, I've been tuning the hardware we have here at home so I can start running some backup and other tests. In the last post I'd finally figured out how to tune my networking gear so the performance to my iSCSI arrays didn't suck - see Benchmarking: 1-TB table population (part 5: network optimization again).

Now I've decided that instead of using a 1-TB data set, I'm going to use a 100-GB data set so I don't have to wait 6 hours for each test to finish, it also means that I can start to make use of the SSDs that the nice folks at Fusion-io sent me to test out (sorry guys - I did say it would be a while until I got around to it!). See New hardware to play with: Fusion-io SSDs. I've got two 640-GB cards but I can only install one right now as I need the other PCI-E slot for the NIC card so I can continue to make comparisons with the iSCSI gear without compromising network performance. Then I'll move on to some pure iSCSI vs pure SSD tests.

I've been starting to discuss SSDs a lot more in the classes that I teach and there's some debate over where to put an SSD if you can afford one. I've heard many people say that using them for transaction logs and/or tempdb is the best use of them. We all know what the real answer is though, don't we - that's right - IT DEPENDS!

When to consider SSDs?

An SSD effectively removes the variability in latency/seek-time that you experience with traditional disks, and does best for random read/write workloads compared to sequential ones (I'm not going to regurgitate all the specs on the various drives - Google is your friend). Now, for *any* overwhelmed I/O subsystem, putting an SSD in there that removes some of the wait time for each IO is probably going to speed things up and lead to a workload performance boost - but is that the *best* use of an SSD? Or, in fact, is that an appropriate use of hardware budget?

If you have an overwhelmed I/O subsystem, I'd say there are two cases you fall into:

  1. SQL Server is doing more I/Os than it has to because the query workload has not been tuned
    1. You can tune the queries
    2. You can't do anything about the queries but you can add more memory
    3. You can't to anything about the queries and you can't add more memory
  2. The query workload has been tuned, and the I/O subsystem is still overwhelmed.

I'd say that case #2 is when you want to start looking at upgrading the I/O subsystem, whether with SSDs or not.

If you're in case #1, there are three sub-cases:

  1. You can tune the queries
  2. You can't do anything about the queries but you can add more memory
  3. You can't to anything about the queries and you can't add more memory

If you're in #1.1, go tune the queries first before dropping a bunch of money on hardware. Hardware will only get you so far - sooner or later your workload will scale past the capabilities of the hardware and you'll be in the same situation.

If you're in #1.2, consider giving SQL Server more memory to use for its buffer pool so it can hold more of your workload in memory and cut down on I/Os that way. This is going to be cheaper than buying SSDs.

If you're in case #1.3, you're in the same boat as for case #2.

Now you might want to start considering SSDs, as well as more traditional I/O subsystems. But what to put on it? Again, any overwhelmed I/O subsystem will benefit from an SSD, but only if you put it in the right place.

I have some clients who are using SSDs because they can't connect any more storage to their systems and this gives them a lot of flexibility without the complexity of higher-end storage systems. Others are thinking of SSDs as a greener alternative to traditional spinning disks.

The other thing to consider is redundancy. A 640-GB SSD card presents itself as 2 x 320-GB drives when installed in the OS. As you know, I'm big into availability so to start with I'm going to use these as a 320-GB RAID-1 array (not truly redundant as they're both in the same PCI-E slot - but I can't use both PCI-E slots for SSDs yet). I need to make sure that if a drive fails, I don't lose the ability to keep my workload going. This is what many people forget - just because you have an SSD, doesn't mean you should compromise redundancy. Now don't get me wrong, I'm not saying anything here that implies SSDs are any more likely to fail than any other technology - but it's a drive, and prudence dictates that I get some redundancy in there.

The first test: not overwhelmed transaction log

But which of your data and log files to put onto an SSD? It depends on your workload and which portion of your I/O subsystem is overloaded. This is what I'm going to investigate in the next few posts.

For my first set of tests I wanted to see what effect and SSD would have on my workload if the transaction log isn't the bottleneck but I put in on an SSD, as much advice seems to suggest. In other words, is the log *always* the best thing to put on an SSD? I'll do data files in the next test.

This is a narrow test scenario - inserting 100-GB of server-generated data into a single clustered index. The post Benchmarking: 1-TB table population (part 1: the baseline) explains the scripts I'm using. The hardware setup I'm using is explained in Benchmarking hardware setup, with the addition of the PCI-E mounted 640-GB ioDrive Duo.

For the data file, I had it pre-grown to 100GB, on a RAID-10 array with 8 x 15k 300-GB SCSI drives.

For the log file, I had a variety of sizes and auto-growths, both on a RAID-10 array with 8 x 7.2k 1-TB SCSI drives, and on the 320-GB RAID-1 SSD. The log configurations I tested are:

  • Starting size 256MB with 50MB, 256MB, 512MB auto-growth
  • Fixed auto-growth of 256MB, but starting size of 256MB, 512MB , 1GB, 2GB, 4GB, 8GB

In each case I ran five tests and then took the average to use as the result.

Fixed initial size and varying auto-growth

This test case simulates a system where the log starts small and grows to around 6GB, then remains in steady state. Transaction size is optimized for the maximum 60-KB log block size, when the log *must* be flushed to disk, no matter whether a transaction just committed or not. 

The results from the first test are below:

 

At the 50MB growth rate, the SSDs gave me a 5% gain in overall run time (the green line), but this decreased to around 2% when I started to increase the auto-growth size. Remember each data point represents the average of 5 tests.

Varying initialize size with fixed auto-growth

This test is the same as the previous one, but tries to remove the overhead of transaction log growth by increasing the initial size of the log so it reaches steady-state faster.

The result from the second test are below:

 

For initial log sizes up to and including 2-GB, the SSD gives me a 2.5-3.5% gain over the SATA array. This makes sense as the zero initialization that is required by the log whenever it grows will be a little faster on the SSD. Once I hit 4-GB for the initial size they performed almost the same, and at 8-GB, when no autogrowth occured, the SSD was only 0.8% faster than the RAID-10 array.

Summary

What am I trying to show with these tests? For this post I'm not trying to show you wonderful numbers that'll make you want to go out and buy an SSD right away. I wanted to show you first of all that just because I have an SSD in my system, if I put it in the wrong place, I don't get any appreciable performance gain. Clearly the RAID-10 array where my log is was doing just fine when I set the correct log size and growth. You have to think about your I/O workload before making knee-jerk changes to your I/O subsystem.

But one thing I haven't made clear - a single SSD configured in RAID-1 outperformed a RAID-10 array of 8 7.2k SATA drives in every test I did!

For this test, my bottleneck is the RAID-10 array where the data file is. Before I try various configurations of that, I want to try the SSD in a RAID-0 configuration, which most people use, and also see if I can get the RAID-10 array to go any faster by formatting it with a 64K stripe size.

I'm starting a new blog category to talk about some of weird and confusing stuff I see while query tuning.

First up is the case of the unexpected Key Lookup (Clustered) in a query that looks like it should be covered. This is a follow on from the post Missing index DMVs bug that could cost your sanity...

Today I was tracking down some performance issues on a client site and came across something in a query plan that confused me for a bit. All code and screenshots below are from my simplified repro on my test machine at home.

The code is using a cursor to drive a process, and the SELECT statement driving the cursor is covered by a non-clustered index.

However, the query plan for the cursor-driving statement is as below:

 

You can see the SELECT statement above, and you can see that the query plan is using the nonclustered index on c2 and c3 to satisfy the SELECT. So where is the Key Lookup coming from? If I hover over the Key Lookup operator, I see the details of the operator in a tooltip:

 

According to the details, the output list of the Key Lookup is Chk1002. What's Chk1002? It's not a column in my table, and it's not a check constraint on the table either - I don't have any. And why is it looking up the cluster key *anyway*?

This is where I turned to Kimberly to help figure out what's going on, who said it's to do with the cursor.

I then proceeded to work thigns out and got it a little twisted up. Check out the comments discussion with Brad who explained things, and it's in Books Online too.

Because the cursor is a dynamic optimistic cursor (by default as I didn't specify anything else) SQL Server persists a checksum of all rows picked up by the cursor-driving query in a worktable in tempdb (which, if this is a heavily executed query, can cause perf issues with tempdb). When I say FETCH NEXT, it goes back to the table to get the next value, but if I want to update a column in the row I'm working on, it recalculates the checksum to make sure that nothing has changed in that row outside the confines of my cursor. If so, my update fails - if not, it allows it.

There are plenty of ways to improve the situation (e.g. removing the cursor altogether and using a nice set-based operation... or changing the cursro type) but that's not the point of this post. I just wanted you to be aware of this, and how the query operator properties ins't explicit in working out what's going on.

Enjoy!

Here's yet another reason to be very careful when using the missing index DMVs...

There's a bug in the missing index DMVs that could end up causing you to knock your head against a brick wall and question your sanity. I know I did.

The bug is this: the missing index code may recommend a nonclustered index that already exists. Over and over again. It might also recommend an index that won't actually help a query.

Yes, I'm surprised by this too - as the missing index code is in the query optimizer too. However, it will continue to recommend you create the already-existing index - which is terribly annoying.

This is a little-known bug (Connect item #416197) which is fixed in SQL11 but won't be fixed in earlier versions.

I experienced this on SQL Server 2008 SP1 this weekend and I wanted to blog about it so you don't spend ages trying to work out what's going on.

Here's a repro for you:

CREATE TABLE t1 (
    c1 INT IDENTITY,
    c2 AS c1 * 2,
    c3 AS c1 + c1,
    c4 CHAR (3000) DEFAULT 'a');
GO
CREATE UNIQUE CLUSTERED INDEX t1_clus ON t1 (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO t1 DEFAULT VALUES;
GO 100000

This creates a table with a a bunch of rows, with each row pretty large so that the cost of scanning the table is expensive.

Now say I want to run a query:

SELECT COUNT (*) FROM t1
    WHERE c2 BETWEEN 10 AND 1000
    AND c3 > 1000;
GO
 

If I display the estimated execution plan...

 

...it will tell me there's a missing index I should create:

 

So I go ahead and create the index and everything's cool:

CREATE NONCLUSTERED INDEX [_missing_c2_c3] ON [dbo].[t1] ([c2],[c3]);
GO

Now what if I want to do something more complicted? How about a cursor over the table? (Don't start on about not using cursors - they're everywhere in application code we see - this is just an easy example to engineer.)

DECLARE testcursor CURSOR FOR
    SELECT c1 FROM t1
    WHERE
        c2 BETWEEN 10 AND 1000
        AND c3 > 1000;

DECLARE @var BIGINT;

OPEN testcursor;

FETCH NEXT FROM testcursor INTO @var;

WHILE (@@fetch_status <> -1)
BEGIN
    -- empty body
    FETCH NEXT FROM testcursor INTO @var;
END

CLOSE testcursor;
DEALLOCATE testcursor;

If I display the estimated execution plan again, it shows:

 

Hmm. That index is actually exactly the same as the one we created earlier (even though it's asking for c1 to be INCLUDEd, it already is in the existing nonclustered index as c1 is the cluster key and is included automatically). However, just to prove I'm not doing anything dodgy, I'll create the index it wants:

CREATE NONCLUSTERED INDEX [_missing_c2_c3_inc_c1] ON [dbo].[t1] ([c2],[c3]) INCLUDE ([c1]);
GO

And nothing changes. You cannot get the missing index code to stop recommending the index. The index isn't being used for the *Key Lookup* in the query plan above  - but the missing index code thinks the index would be useful and suggests it. Not only would that index not actually help the Key Lookup, it already exists!

If you use a query that aggregates the missing index DMV output (such as Bart Duncan's excellent script) and you have some very common queries on your system that are hitting this bug, you will find that the missing index DMV aggregation will be broken too.

Be careful out there!

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

The BULK_LOGGED recovery model continues to confuse people...

Myth #28: various myths around the BULK_LOGGED recovery model.

28a) regular DML operations can be minimally-logged

No.

Only a small set of bulk operations can be minimally-logged when in the BULK_LOGGED (or SIMPLE) recovery model. The list is in the Books Online topic Operations That Can Be Minimally Logged has the list. This is the 2008 link - make sure to check the link for the version you're running on.

28b) using the BULK_LOGGED recovery model does not affect disaster recovery

No.

Firstly, if a minimally-logged operation has been performed since the last log backup, and one or more data files were damaged and offline because of the disaster, a tail-of-the-log backup cannot be performed and so all user transactions performed since the last log backup will be lost.

Secondly, if a log backup contains a minimally-logged operation, a point-in-time restore cannot be performed to any time covered by the log backup. The log backup can either not be restored, or be restored in its entirety (plus additional log backups if required) - i.e. you can restore to a point:

  • Before the beginning of that log backup
  • At the end of that log backup
  • After the end of that log backup

But you can't restore to a point during that log backup.

28c) using the BULK_LOGGED recovery model also reduces the size of log backups

No.

A log backup that includes a minimally-logged operation must backup the minimal amount of transaction log *and* all the data file extents changed by that operattion - otherwise the restore of the log backup would not fully reconstitute the minimally-logged operation. This means that log backups are roughly the same size whether in the FULL or BULK_LOGGED recovery model.

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

Nested transactions are an evil invention designed to allow developers to make DBAs' lives miserable. In SQL Server, they are even more evil...

Myth #26: nested transactions are real in SQL Server.

FALSE!!!

Nested transactions do not actually behave the way the syntax would have you believe. I have no idea why they were coded this way in SQL Server - all I can think of is someone from the dim and distant past is continually thumbing their nose at the SQL Server community and going "ha - fooled you!!".

Let me explain. SQL Server allows you to start transactions inside other transactions - called nested transactions. It allows you to commit them and to roll them back.

The commit of a nested transaction has absolutely no effect - as the only transaction that really exists as far as SQL Server is concerned is the outer one. Can you say 'uncontrolled transaction log growth'? Nested transactions are a common cause of transaction log growth problems because the developer thinks that all the work is being done in the inner transactions so there's no problem.

The rollback of a nested transaction rolls back the entire set of transactions - as there is no such thing as a nested transaction.

Your developers should not use nested transactions. They are evil.

If you don't believe me, here's some code to show you what I mean. First off - create a database with a table that each insert will cause 8k in the log.

CREATE DATABASE NestedXactsAreNotReal;
GO
USE NestedXactsAreNotReal;
GO
ALTER DATABASE NestedXactsAreNotReal SET RECOVERY SIMPLE;
GO
CREATE TABLE t1 (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
CREATE CLUSTERED INDEX t1c1 ON t1 (c1);
GO
SET NOCOUNT ON;
GO

Test #1: Does rolling back a nested transaction only roll back that nested transaction?

BEGIN TRAN OuterTran;
GO

INSERT INTO t1 DEFAULT Values;
GO 1000

BEGIN TRAN InnerTran;
GO

INSERT INTO t1 DEFAULT Values;
GO 1000

SELECT @@TRANCOUNT, COUNT (*) FROM t1;
GO

I get back the results 2 and 2000. Now I'll roll back the nested transaction and it should only roll back the 1000 rows inserted by the inner transaction...

ROLLBACK TRAN InnerTran;
GO

Msg 6401, Level 16, State 1, Line 1
Cannot roll back InnerTran. No transaction or savepoint of that name was found.

Hmm... from Books Online, I can only use the name of the outer transaction or no name. I'll try no name:

ROLLBACK TRAN;
GO

SELECT @@TRANCOUNT, COUNT (*) FROM t1;
GO

And I get the results 0 and 0. As Books Online explains, ROLLBACK TRAN rolls back to the start of the outer transaction and sets @@TRANCOUNT to 0. All changes are rolled back. The only way to do what I want is to use SAVE TRAN and ROLLBACK TRAN to the savepoint name.

Test #2: Does committing a nested transaction really commit the changes made?

BEGIN TRAN OuterTran;
GO

BEGIN TRAN InnerTran;
GO

INSERT INTO t1 DEFAULT Values;
GO 1000

COMMIT TRAN InnerTran;
GO

SELECT COUNT (*) FROM t1;
GO

I get the result 1000, as expected. Now I'll roll back the outer transaction and all the work done by the inner transaction should be preserved...

ROLLBACK TRAN OuterTran;
GO

SELECT COUNT (*) FROM t1;
GO

And I get back the result 0. Oops - committing the nested transaction did not make its changes durable.

Test #3: Does committing a nested transaction at least let me clear the log?

I recreated the database again before running this so the log was minimally sized to begin with, and the output from DBCC SQLPERF below has been edited to only include the NestedXactsAreNotReal database.

BEGIN TRAN OuterTran;
GO

BEGIN TRAN InnerTran;
GO

INSERT INTO t1 DEFAULT Values;
GO 1000

DBCC SQLPERF ('LOGSPACE');
GO

Database Name         Log Size (MB) Log Space Used (%) Status
--------------------- ------------- ------------------ -----------
NestedXactsAreNotReal 12.05469      95.81983           0

Now I'll commit the nested transaction, run a checkpoint (which will clear all possible transaction log in the SIMPLE recovery model), and check the log space again:

COMMIT TRAN InnerTran;
GO

CHECKPOINT;
GO

DBCC SQLPERF ('LOGSPACE');
GO

Database Name         Log Size (MB) Log Space Used (%) Status
--------------------- ------------- ------------------ -----------
NestedXactsAreNotReal 12.05469      96.25324           0

Hmm - no change - in fact the Log Space Used (%) has increased slightly from writing out the checkpoint log records (see How do checkpoints work and what gets logged). Committing the nested transaction did not allow the log to clear. And of course not, because a rollback can be issued at any time which will roll back all the way to the start of the outer transaction - so all log records are required until the outer transaction commits or rolls back.

And to prove it, I'll commit the outer transaction and run a checkpoint:

COMMIT TRAN OuterTran;
GO

CHECKPOINT;
GO

DBCC SQLPERF ('LOGSPACE');
GO

Database Name         Log Size (MB) Log Space Used (%) Status
--------------------- ------------- ------------------ -----------
NestedXactsAreNotReal 12.05469      26.4339            0

And it drops right down.

Nested transactions - just say no! (a public service announcement from the nice folks at SQLskills.com :-)

Over the last few months there's been some noise (mostly of my making) on Twitter about the number of VLFs in transsction logs. Given the large numbers of people who read the blog and follow me on Twitter, I thought it would be very interesting to collect some statistics from a few hundred of you about how your transction logs are configured - in terms of the size, number of log files, and number of VLFs.

To that end, I've created a little script that you can download and run which will generate output of the form:

DB ID  Recovery Model  Log Size (MB)  Log Used (%)   Log File Count  VLF Count 
------ --------------- -------------- -------------- --------------- ----------
1      SIMPLE          6.12           34.42          1               22
2      SIMPLE          0.49           75.60          1               2 
3      FULL            14.68          97.90          1               51
4      SIMPLE          2.49           45.92          1               10
11     SIMPLE          1.99           33.55          1               4
12     FULL            359.99         77.84          1               96
13     FULL            0.48           45.67          1               2

I'd like as many of you as possible to send me the results for some of your systems - there are no database names in the output and the results will be completely anonymous. I'll collect all the results together and blog some scatter graphs showing how people have things set up.

You can get the code from SQLSkillsLogInfo.zip (2.05 kb) (it works on 2000, 2005, 2008) and send me as many results as you like (paul@sqlskills.com) either in plain text or spreadsheet format - or paste them totally anonymously in a comment. The script only takes a few seconds to run, won't affect performance in any way, and won't affect your transaction logs. Now - if your transaction log has many thousands of VLFs, it might take a few minutes to run...

You might think your system is boring - but as far as this is concerned, all results are useful results.

I'll publish the results in a couple of weeks.

Thanks!

(Look in the Misconceptions blog category for the rest of the month's posts and check out the 60-page PDF with all the myths and misconceptions blog posts collected together: CommonSQLServerMyths.pdf (732.96 kb))

This blog post is part of two series - my Myth-A-Day series and the monthly T-SQL Tuesday series that fellow-MVP Adam Machanic (twitter|blog) organizes. This month's T-SQL Tuesday is being run by Aaron Nelson (twitter|blog) and is on the subject of reporting - see this blog post for details.

Myth #13: you cannot run DMVs when in the 80 compat mode.

FALSE

To start with, there's a lot of confusion about what compat mode means. Does it mean that the database can be restored/attached to a SQL Server 2000 server? No. It means that some T-SQL parsing, query plan behavior, hints and a few other things behave as they did in SQL Server 2000 (or 2005, if you're setting it to 90 on a 2008 instance).

In SQL Server 2008 you can use ALTER DATABASE SET COMPATIBILITY_LEVEL to change the compatibility level; in prior versions you use sp_dbcmptlevel. To see what the compability level controls, see:

Compatibility level has no effect on the database physical version - which is what gets bumped up when you upgrade, and prevents a database being restored/attached to a previous version - as they have a maximum physical version number they can understand. See my blog post Search Engine Q&A #13: Difference between database version and database compatibility level for more details, and Msg 602, Level 21, State 50, Line 1 for details on the error messages you get when trying to attach/restore a database to a previous version.

But I digress, as usual :-)

One of the things that looks like it doesn't work is using DMVs when in the 80 compat mode. Here's a simple script to show you, using SQL Server 2005:

CREATE DATABASE DMVTest;
GO
USE DMVTest;
GO
CREATE TABLE t1 (c1 INT);
CREATE CLUSTERED INDEX t1c1 on t1 (c1);
INSERT INTO t1 VALUES (1);
GO

EXEC sp_dbcmptlevel DMVTest, 80;
GO

SELECT * FROM sys.dm_db_index_physical_stats (
     DB_ID ('DMVTest'), -- database ID
     OBJECT_ID ('t1'),  -- object ID       <<<<<< Note I'm using 1-part naming
     NULL,              -- index ID
     NULL,              -- partition ID
     'DETAILED');       -- scan mode
GO

And the really useful error I get back is:

Msg 102, Level 15, State 1, Line 2
Incorrect syntax near '('.

How incredibly useful is that? It pinpoints the problem exactly - not.

Edit: After writing this I realized I'd fallen victim to my own myth too! DMVs *are* supported in the 80 compat-mode completely. What's *not* supported is calling a function (e.g. OBJECT_ID) as one of the DMV parameters. Thanks to Aaron Bertrand for point this out! (Apparently he pointed that out in the recent Boston class we taught, but I missed it.)

Here's the trick to using the functions as parameters. You change context to a database in the 90 or higher compatibility level - and then you can point the DMV at the database in the 80 compatibility level.

Very cool. Check it out:

USE master;
GO

SELECT * FROM sys.dm_db_index_physical_stats (
 DB_ID ('DMVTest'),         -- database ID
 OBJECT_ID ('DMVTest..t1'), -- object ID   <<<<<< Note I'm using 3-part naming here now

 NULL,                      -- index ID
 NULL,                      -- partition ID
 'DETAILED');               -- scan mode
GO

And it works, even though the database DMVTest is in the 80 compatibility level.

One thing to be *very* careful of - you need to make sure you're using the correct object ID. If I'd just left the second parameter as OBJECT_ID ('t1'), it would have tried to find the object ID of the t1 table in the master database. If it didn't find it, it will use the value NULL, which will cause the DMV to run against all tables in the DMVTest database. If by chance there's a t1 table in master, it's likely got a different object ID from the t1 table in DMVTest, and so the DMV will puke (that's a technical term folks :-).

And sys.dm_db_index_physical_stats isn't a true DMV - Dynamic Management View - it's a Dynamic Management Function which does a *ton* of work potentially to return results - so you want to make sure you limit it to only the tables you're interested in. See my recent blog post Inside sys.dm_db_index_physical_stats for details of how it works and how expensive it can be.

So, you'll need to use the new 3-part naming option of OBJECT_ID in SQL Server 2005 onwards to make sure you're grabbing the correct object ID when going across database contexts.

Another way to do it is to use variables and pre-assign the values to them, which you can do from within the 80 compat-mode database:

DECLARE @databaseID INT;
DECLARE
@objectID   INT;

SELECT @databaseID = DB_ID ('DMVTest');
SELECT @objectID   = OBJECT_ID ('t1');

SELECT * FROM sys.dm_db_index_physical_stats (
     @dbid,        -- database ID
     @objid,       -- object ID
     NULL,         -- index ID
     NULL,         -- partition ID
     'DETAILED');  -- scan mode
GO

Bottom line: another myth bites the dust!

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!)  

This is a myth I hear over and over and over...

Myth #12: tempdb should always have one data file per processor core.

FALSE

Sigh. This is one of the most frustrating myths because there's so much 'official' information from Microsoft, and other blog posts that persists this myth.

One of the biggest confusion points is that the SQL CAT team recommends 1-to-1, but they're coming from a purely scaling perspective, not from an overall perf perspective, and they're dealing with big customers with top-notch servers and IO subsystems. Most people are not.

There's only one tempdb per instance, and lots of things use it, so it's often a performance bottleneck. You guys know that already. But when does a performance problem merit creating extra tempdb data files?

When you see PAGELATCH waits on tempdb, you've got contention for in-memory allocation bitmaps. When you see PAGEIOLATCH waits on tempdb, you've got contention at the I/O subsystem level. You can think of a latch as kind of like a traditional lock, but much lighter wait, much more transitory, and used by the Storage Engine internally to control access to internal structures (like in-memory copies of database pages).

Fellow MVP Glenn Berry (twitter|blog) has a blog post with some neat scripts using the sys.dm_os_wait_stats DMV - the first one will show you what kind of wait is most prevalent on your server. If you see that it's PAGELATCH waits, you can use this script from newly-minted MCM and Microsoft DBA Robert Davis (twitter|blog). It uses the sys.dm_os_waiting_tasks DMV to break apart the wait resource and let you know what's being waited on in tempdb.

If you're seeing PAGELATCH waits on tempdb, then you can mitigate it using trace flag 1118 (fully documented in KB 328551) and creating extra tempdb data files. I wrote a long blog post debunking some myths around this trace flag and why it's still potentially required in SQL 2005 and 2008 - see Misconceptions around TF 1118.

On SQL Server 2000, the recommendation was one tempdb data file for each processor core. On 2005 and 2008, that recommendation persists, but because of some optimizations (see my blog post) you may not need one-to-one - you may be ok with the number of tempdb data files equal to 1/4 to 1/2 the number of processor cores.

[Edit: At PASS in 2011, my good friend Bob Ward, who's the top guy in SQL CSS, espoused a new formula, which I agree with: if you have less than 8 cores, use #files = #cores. If you have > 8 cores, use 8 files and if you're seeing in-memory contention, add 4 more files at a time.]

Now this is all one big-ass generalization. I heard just last week of a customer who's tempdb workload was so high that they had to use 64 tempdb data files on a system with 32 processor cores - and that was the only way for them to alleviate contention. Does this mean it's a best practice? Absolutely not!

So, why is one-to-one not always a good idea? Too many tempdb data files can cause performance problems for another reason. If you have a workload that uses query plan operators that require lots of memory (e.g. sorts), the odds are that there won't be enough memory on the server to accomodate the operation, and it will spill out to tempdb. If there are too many tempdb data files, then the writing out of the temporarily-spilled data can be really slowed down while the allocation system does round-robin allocation. The same thing can happen with very large temp tables in tempdb too.

Why would round-robin allocation cause things to slow down for memory-spills to tempdb with a large number of files? A couple of possibilities:

  • Round-robin allocation is per filegroup, and you can only have one filegroup in tempdb. With 16, 32, or more files in tempdb, and very large allocations happening from just a few threads, the extra synchronization and work necessary to do the round-robin allocation (looking at the allocation weightings for each file and deciding whether to allocate or decrement the weighting, plus quite frequently recalculating the weightings for all files - every 8192 allocations) starts to add up and become noticeable. It's very different from lots of threads doing lots of small allocations. It's also very different from allocating from a single-file filegroup - which is optimized (obviously) to not do round-robin.
  • Your tempdb data files are not the same size and so the auto-grow is only growing a single file (the algorithm is unfortunately broken), leading to skewed usage and an IO hotspot.
  • Having too many files can lead to essentially random IO patterns when the buffer pool needs to free up space through the lazywriter (tempdb checkpoints dont' flush data pages) for systems with not very large buffer pools but *lots* of tempdb data. If the IO subsystem can't handle the load across multiple files, it will start to slow down.

I really need to do a benchmarking blog post to show what I mean - but in the mean time, I've heard this from multiple customers who've created large numbers of tempdb files, and I know this from how the code works (my dev team owned the allocation code).

So you're damned if you do and damned if you don't, right? Potentially - yes, this creates a bit of a conundrum for you - how many tempdb data files should you have? Well, I can't answer that for you - except to give you these guidelines based on talking to many clients and conference/class attendees. Be careful to only create multiple tempdb data files to alleviate contention that you're experiencing - and not to push it to too many data files unless you really need to - and to be aware of the potential downfalls if you have to. You may have to make a careful balance of scalability vs performance to avoid helping one workload and hindering another.

Hope this helps.

PS To address a comment that came in - no, the extra files don't *have* to be on separate storage. If all you're seeing it PAGELATCH contention, separate storage makes no difference as the contention is on in-memory pages. For PAGEIOLATCH waits, you most likely will need to use separate storage, but not necessarily - it may be that you need to move tempdb itself to different storage from other databases rather than just adding more tempdb data files. Analysis of what's stored where will be necessary to pick the correct path to take.

 

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For an explanation of log growth and its effect on perf, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.
  • For the increasing performance through separation of data and log files, see this post.
  • For my first (mostly unsuccessful) crack at network optimization, see this post.

The saga continues. Like a bad Star Wars prequel. But with better special effects and more believable dialog.

In the previous post in the series I messed around the network configurations - well, really I flailed a bit as it's not my specialty - and you can't run DBCC CHECKNETWORK with a REPAIR_MAKE_IT_FASTER switch.

The net effect of my flailing was no real increase in my benchmark performance. I managed to shave off a few hundred seconds from a 5.5 hour run-time. The best I could get for generating my 1-TB clustered index was 20317 seconds with the following being a picture of typical network utilization:

 

There's clearly a bottleneck there when the iSCSI NIC to the data array hits 100% utilization - when a checkpoint is occurring. I resolved myself to figuring this out.

Well, it turns out that in the Dell Modular Storage Manager, the iSCSI host port configuration window has a SAVE button that's off the bottom of my screen. So when I thought I'd enabled jumbo frames, I actually hadn't.

So, in these tests I made sure to save the configuration so that it actually applied. And, funnily enough, I saw some better performance. Who'd have thought, eh?

All these tests were with an unchanging log file size/auto-growth and data file size/auto-growth, as documented at the top of the previous post, so I won't repeat it here.

1) Enabling 4088-byte jumbo frames on the Intel server NIC to the data array

The Intel PRO/1000 PT Dual Port GigE NICs can handle 4088 or 9014 byte jumbo frames, and the iSCSI host ports on the data file MD3000i can only go up to 9000 bytes, so 4088 it is. The test ran in 18623 seconds, an 8.4% improvement over the baseline.

2) Enabling 9000-byte jumbo frames on the BroadCom server NIC to the log array

The BroadCom BCM5708C NetXtreme II GigE NICs can use a manually specified jumbo frame size so I set it to 9000 bytes, as well as the iSCSI host ports on the log file MD3000i. The test ran in 18284 seconds, a 1.8% improvement over the configuration in #1 above, and an overall 10% improvement over the baseline.

3) Using two Intel server NICs to the data array with least-queue-depth

Even with the jumbo frames enabled, I was still saturating the network link to the data array so I decided to use a pair of NICs cooperating on the same iSCSI channel, as several people had suggested. I configured two Intel NICs with 4088-byte jumbo frames, set them up as a paired set of connections to the same iSCSI controller (using two host ports on the same controller, so the NICs are essentially one-to-one), and let rip. The test ran in 18178 seconds, a 0.6% improvement over the configuration in #2 above, and an overall 10.5% improvement over the baseline.

4) Using two Intel server NICs to the data array with round-robin

I then changed the algorithm for the two Intel NICs to use round-robin instead of least-queue-depth. The test ran in 17719 seconds, a 2.5% improvement over the configuration in #3 above, and an overall 12.8% improvement over the baseline. You can see that in the perfmon capture lower down this post.

Clearly using paired-NICs with jumbo frames is the way to go. I also measured the throughput on the iSCSI arrays, and the 8-drive 15k RAID-10 SCSI array was pushing just under 125000 bytes per second - a 14% improvement over the best previous measurement.

My network utilization pretty-picture is as shown below:

 

Connections 1 and 2 are the Intel NICs going to the data file array and connection 3 is the BroadCom NIC going to the log file array. This screen shot is capturing the start of a performance run as the log gradually increases in size. The network spikes on connections 1 and 2 are when data pages are being flushed out during checkpoints. You can see that the checkpoint get longer and further apart as the log file size increases. I explain this in depth in the post Interesting case of watching log file growth during a perf test. But as you can clearly see - no more network bottleneck!

And here's a perfmon capture from the middle of one of the tests with paired-NICs:

 

Points to note:

  • The Pages Allocated/sec (black highlighted line) is averaging a much higher number than before, with spikes up to almost 9000.
  • The Disk Write Bytes/sec for the K: (the green line at the top) is off the scale, hitting near to the maximum capacity of the array as I described above.
  • The Avg. Disk Write Queue Length for K: (the browny-red line) doesn't get much over 20, whereas previously it had been spiking in the mid 30s.

Now that I've removed the network bottleneck I'm going to experiment with the log file initial size and autogrowth rate, and then the data portion of the database.

One thing to bear in mind is that I'm tuning this for the optimal performance for the one task I'm trying to achieve right now. It'll be interesting to see whether this is the optimal configuration for large-scale reads, mixed workloads, index rebuilds, etc etc, and I'm also going to start throwing the Fusion-IO drives into the mix.

Hope you're enjoying learning along with me, stay tuned!

Way back in the mists of time, at the end of the last century, I wrote DBCC SHOWCONTIG for SQL Server 2000, to complement my new invention DBCC INDEXDEFRAG.

I also used to wear shorts all the time, with luminous orange, yellow, or green socks.

Many things change - I now have (some) dress sense, for one. One other thing that changed was that DMVs came onto the scene with SQL Server 2005. DBCC SHOWCONTIG was replaced by sys.dm_db_index_physical_stats. Under the covers though, they both use the same code - and the I/O characteristics haven't changed.

This is a blog post I've been meaning to do for a while now, and I finally had the impetus to do it when I heard about today's T-SQL Tuesday on I/O in general being run by Mike Walsh (Twitter|blog). It's a neat idea so I decided to join in this time. In retrospect, reading this over before hitting 'publish', I got a bit carried away (spending two hours on this) - but it's one of my babies, so I'm entitled to! :-)

This isn't a post about how to use DMVs in general, how to use this DMV in particular, or anything about index fragmentation. This is a blog post about how the DMV works.

DMV is a catch-all phrase that most people (myself included) use to describe all the various utility views in SQL Server 2005 and 2008. DMV = Dynamic Management View. There's a catch with the catch-all though - some of the DMVs aren't views at all, they're functions. A pure DMV gets info from SQL Server's memory (or system tables) and displays it in some form. A DMF, on the other hand, has to go and so some work before it can give you some results. The sys.dm_db_index_physical_stats DMV (which I'm going to call 'the DMV' from now on) is by far the most expensive of these - but only in terms of I/O.

The idea of the DMV is to display physical attributes of indexes (and the special case of a heap) - to do this it has to scan the pages comprising the index, calculating statistics as it goes. Many DMVs support what's called predicate pushdown, which means if you specify a WHERE clause, the DMV takes that into account as it prepares the information. This DMV doesn't. If you ask it for only the indexes in the database that have logical fragmentation > 30%, it will scan all the indexes, and then just tell you about those meeting your criteria. It has to do this because it has no way of knowing which ones meet your criteria until it analyzes them - so can't support predicate pushdown.

This is where understanding what it's doing under the covers comes in - the meat of this post.

LIMITED

The default operating mode of the DMV is called LIMITED. Kimberly always makes fun of the equivalent option for DBCC SHOWCONTIG, which I named as a young and foolish developer - calling it WITH FAST. Hey - it's descriptive!

The LIMITED mode can only return the logical fragmentation of the leaf level plus the page count. It doesn't actually read the leaf level. It makes use of the fact that the next level up in the index contains a key-ordered list of page IDs of the pages at the leaf level - so it's trivial to examine the key-ordered list and see if the page IDs are also in allocation order or not, thus calculating logical fragmentation.

The idea behind this option is to allow you to find the fragmentation of an index by reading the minimum number of pages, i.e. in the smallest amount of time. This option can be magnitudes faster than using the DETAILED mode scan, and it depends on how big the index's fanout is. Without getting too much into the guts of indexes, the fanout is based on the index key size, and determines the number of child-page pointers an index page can hold (e.g. the number of leaf-level pages that a page in the next level up has information about).

Consider an index with a char(800) key. Each entry in a page in the level above the leaf has to include a key value (the lowest key that can possibly appear on the page being referred to), plus a page ID, plus record overhead, plus slot array entry - so 812 bytes. So a page can only hold 8096/812 = 9 such entries. The fanout is at most 9.

Consider an index with a bigint key. Each entry is 13 bytes, so a page can hold 8096/13 = 622 entries. The fanout is at most 622, but will likely be smaller, depending on operations on the index causing fragmentation at the non-leaf levels.

For a table with 1 million pages at the leaf level, the first index will have 1 million/9 = 111112 pages at least at the level above the leaf. The second index will have at least 1608 pages. The savings in I/O from using the LIMITED mode scan will clearly differ based on the fanout.

I've created a 100GB clustered index (on the same hardware as I'm using for the benchmarking series) with 13421760 leaf-level pages and a maximum fanout of 540. In reality, I populated the index using 16 concurrent threads, so there's some fragmentation. The level above the leaf has 63012 pages, an effective fanout of 213. Still, the LIMITED mode scan will read 213x less than a DETAILED scan, but will it be 213x faster?

Here's a perfmon capture of the LIMITED mode scan on my index:

 

There's nothing special going on under the covers in a LIMITED mode scan - the chain of pages at the level above the leaf is read in page-linkage order, with no readahead. The perfmon capture shows:

  • Avg. Disk Read Queue Length (light blue) is a steady 1.
  • Avg. disk sec/Read (pink) is a steady 4ms.
  • Disk Read Bytes/sec (green) is roughly 14.5million.
  • Page reads/sec (dark blue) is roughly 1800.

DETAILED 

The DETAILED mode does two things:

  • Calculate fragmentation by doing a LIMITED mode scan
  • Calculate all other statistics by reading all pages at every level of the index

And so it's obviously the slowest. It has to do the LIMITED mode scan first to be able to calculate the logical fragmentation, because it reads the leaf level pages in the fastest possible way - in allocation order. DBCC has a customized read-ahead mechanism for allocation order scans that it uses for this DMV and for DBCC CHECK* commands. It's *incredibly* aggressive and will hit the disks as hard as it possibly can, especially with DBCC CHECK* running in parallel.

Here's a perfmon capture of the DETAILED mode scan on my index:

 

Not quite as pretty as the LIMITED mode scan, but I like it :-) Here's what it's showing:

  • Avg. Disk Read Queue Length (black) is in the multiple hundreds. Clearly it's appetite for data is outstripping what my RAID array can do. It basically tries to saturate the I/O subsystem to get as much data as possible flowing into SQL Server.
  • Avg. disk sec/Read (pink line at the bottom) is actually measuring in whole seconds, rather than ms. Given the disk queue length, I'd expect that.
  • DBCC Logical Scan Bytes/sec (red) varies substantially as the readahead mechanism throttles up and down, but it's driving anywhere up to 80MB/sec. You can see around 9:49:20 AM when it drops to zero for a few seconds.
  • Readahead pages/sec (green) is tracking the DBCC scan. This is a buffer pool counter, the DBCC one is an Access Methods counter (the dev team I used to run during 2005 development). If I had Disk Read Bytes/sec and Pages reads/sec showing, they'd track the other two perfectly - I turned them off for clarity.

So the DETAILED mode not only reads more data, but it does it a heck of a lot more aggressively so has a much more detrimental effect on the overall I/O capabilities of the system while it's running.

SAMPLED

There is a third mode that was introduced just for the DMV. The idea is that if you have a very large table and you want an idea of some of the leaf level statistics, but you don't want to take the perf hit of running a DETAILED scan, you can use this mode. It does:

  • LIMITED mode scan
  • If the number of leaf level pages is < 10000, read all the pages, otherwise read every 100th pages (i.e. a 1% sample)

Summary

There's no progress reporting from the DMV (or DBCC SHOWCONTIG) but if you look at the reads column in sys.dm_exec_sessions you can see how far through the operation it is. This method works best for DETAILED scans, where can compare that number against the in_row_data_page_count for the index in sys.dm_db_partition_stats (yes, you'll need to mess around a bit if the index is actually partitioned).

In terms of timing, I ran all three scan modes to completion. The results:

  • LIMITED mode: 282 seconds
  • SAMPLED mode: 414 seconds
  • DETAILED mode: 3700 seconds

Although the LIMITED mode scan read roughly 200x less than the DETAILED scan, it was only 13 times faster, because the readahead mechanism for the DETAILED scan is way more efficient than the (necessary) follow-the-page-linkages scan of the LIMITED mode.

Just for kicks, I ran a SELECT COUNT(*) on the index to see how the regular Access Methods readahead mechanism would fare - it completed in 3870 seconds - 5% slower, and it had less processing to do than the DMV. Clearly DBCC rules! :-)

Although the DETAILED mode gives the most comprehensive output, it has to do the most work. For very large indexes, this could mean that your buffer pool is thrashed by the lazy writer making space available for the DMV to read and process the pages (it won't flush out the buffer pool though, as the pages read in for the DMV are the first ones the lazywriter will kick out again). One of the reasons I advise people to only run the DMV on indexes they know they're interested in - and better yet, run it on a restored backup of the database.

Hope this is helpful!

PS Oh, also beware of using the SSMS fragmentation wizard. It uses a SAMPLED mode scan, but I found it impossible to cancel!

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For an explanation of log growth and its effect on perf, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.
  • For the increasing performance through separation of data and log files, see this post.

In the previous post in the series, I examined the effects of separating the data and log files (one file each) to different RAID arrays. It was very obvious that separation gave a performance boost, and that having the portion of the database with the highest I/O write throughput requirements on the faster array (the 8-drive 15k SCSI RAID10) produced the biggest gain.

Now - a confession.  In the last post, when I posted it I found that moving the data an 8-drive 7.2k SATA RAID10 array was the best approach. *But* during the testing for this post, I found that one of my tests had screwed up and only half the client threads had run. You'll notice in that post I went back in and edited it to explain that and update the graph and results. I've now augmented my test harness with a way to check that all client threads are running - to make sure the tests are apples to apples, rather than apples to pomegranates :-)

So - the best I've been able to do so far with the tests is creating 1TB of data using 128 connections (each creating 1/128th TB using inserts with default values) with the single data file on an 8-drive 15k SCSI RAID array (pre-created to 1TB) and the log file on an 8-drive 7.2k SATA RAID10 array (pre-created to 256MB with 50MB autogrowth) in 20842 seconds.

Lots of people have been asking how my network is setup in these tests. Here's what I've been running with (all 1GB ethernet):

  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 10.x.x.x network
  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 192.168.x.x network
  • 2 x PowerConnect 5424 24-port iSCSI optimized switches , with no separation of traffic
  • The 10.x.x.x server NIC connected to all iSCSI arrays

Over the last couple of weeks I've been playing around with the network setup to make sure things are optimized, and this post will describe what I did and what effect it had. In all the tests below, I kept the dat aon the faster SCSI array and the log on the slower SATA array.

I'm very grateful to the help I received from Wes Brown (twitter|blog) and Denny Cherry (twitter|blog) to the technical questions and WTF?s I sent (and to anyone else on twitter I may have forgotten!).

1) Separation of network traffic

I decided to make one of the 5424 switches dedicated to iSCSI traffic on the 10.x.x.x network and the other for general network activity, including connecting to the management ports on the MD3000s. Turns out that I didn't really need to, as each 5424 can handle 48GB of throughput, way more than I'm generating. But hey ho, at least the wiring in the back of the 42U rack is a little tidier now :-)

Running the 128-way test with the new configuration gave a test time of 21252 seconds, slightly slower than the best time without separation! This was the first of the WTF?s. Until I realized that I hadn't actually removed any network bottleneck at all. I can't explain why things are slightly slower here, so I decided to take the switches out of the equation. My suspicion is that if I ran the test ten times, I'd get ten different results, but within a standard deviation of the median. So - no cause for concern. (In fact, I'm going to try this as part of the next set of tests.)

2) Direct connections to the iSCSI arrays

I configured another NIC (one from an Intel PRO/1000 PT Dual Port GigE card) and then had one NIC directly connected to one of the RAID controllers on the SCSI MD3000 (only one configured volume, so no concerns about having multiple volumes suddenly switching over to a non-optimal RAID controller) and the other NIC directly connected to the SATA MD3000.

Running the 128-way test with the new configuration gave a test time of 21859 seconds, slower than test #1. Very surprising - I expected to get some *gain* so I looked at the peak throughput of the arrays:

  • For test 1, peak SATA was 50500 bytes/sec and peak SCSI was 106012 bytes/sec.
  • For test 2, peak SATA was 46923 bytes/sec and peak SCSI was 107708 bytes/sec.

Things are slower with the network bottleneck removed.

3) Upgrading 5424 switch firmware and reconfiguring

Although the 5424 switches are supposed to come iSCSI optimized, I thought I'd flatten them and reconfigure the. I got the latest version of the switch firmware and re-flashed both switches. I think configured the 10.x.x.x one specifically for iSCSI using this excellent Dell whitepaper.

Running the 128-way test with the new configuration gave a test time of 20745 seconds. Finally an improvement, but nothing major, and still possibly just a statistical variation.

4) Upgrading the NIC drivers

Next I figured I'd bring the NICs up to the latest driver versions so upgraded all the NICs on all the servers.

Running the 128-way test with the new configuration gave a test time of 21743 seconds. Hmmm.

5) Homogenizing the network paths

At this point I started wondering if the Broadcom and Intel NICs had different characteristics so I decided to use the two Intel NICs for the iSCSI traffic. I also enabled jumbo frames. The Intel NICs have three setting for jumbo frames - off, 4088 bytes or 9014 bytes. The MD3000s can only go up to 9000 bytes, so I chose 4088 bytes and configured the MD3000 iSCSI ports to use the same.

Running the 128-way test with the new configuration gave a test time of 21526 seconds - nothing to write home about.

None of the network configuration changes I made had much effect on performance, apart from removing the network bottleneck, which made performance slightly worse overall. I checked other stuff like TCP offloading, but that wasn't enabled. My suspicion was that by removing the network bottleneck, I unmasked a SQL Server contention issue with my 128-connection test. I decided to try fewer client connections.

Here are the results:

There's clearly a SQL Server bottleneck that's being alleviated by reducing the number of connections and allowing the throughput to each array to increase slightly. With 8 connections, SQL Server isn't being driven hard enough and the elapsed time increases again, and this is reflected in the array throughput measurements too (a 10-15% drop compared to the 16-way test). One thing I forgot to do was examine the distribution of wait types while these tests were running, but my guess would be the bottleneck was in the transaction log manager.

Summary

By separating the network traffic and moving to two iSCSI NICs, I removed the network bottleneck I had (see the image at the bottom of the last post) and replaced it with a SQL Server bottleneck. Here's a snapshot of network utilization with the new setup:

 

In the next set of tests, I'm going to look at the effect of altering the transaction log auto-growth size, and pre-allocation size. In all the tests so far, the log has grown from the pre-allocated 256MB to somewhere between 6.5-8GB.

Should be interesting - stay tuned.

PS Yes, I'll be doing a bunch of stuff with the Fusion-io drives too - be patient! :-)

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!) 

A while ago I blogged about disk partition alignment, and how the default alignment of 31.5Kb on Windows Server 2003 can lead to enormous I/O performance problems (see Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly?). We've been on-site with clients this week and that topic came up again. I thought it would be useful to do a quick blog post showing how to use the diskpart and wmic tools. Google them for lots of info from the Microsoft site - but be careful not to play around with any of the destructive options on productions systems. The options I'm using below will not alter the disks in any way.

Note: This stuff applies to MBR disks, not GPT or dynamic disks. Although these require correct alignment too, I don't have any information on how to do it for those disks. The SQLCAT team will be publishing some guidelines but has not yet done so, AFAIK. Check out the SQLCAT team whitepaper Disk Partition Alignment Best Practices for SQL Server for full details on this topic.

Bring up a command prompt and type diskpart. You'll see something like:

C:\Users\Administrator>diskpart

Microsoft DiskPart version 6.0.6001
Copyright (C) 1999-2007 Microsoft Corporation.
On computer: MONKEY

DISKPART>

Next you need to list the logical disks that Windows knows about. Type list disk. You'll see something like:

DISKPART> list disk

  Disk ###  Status      Size     Free     Dyn  Gpt
  --------  ----------  -------  -------  ---  ---
  Disk 0    Online       136 GB  1434 MB
  Disk 1    Online      1116 GB      0 B
  Disk 2    Online      2036 GB      0 B

DISKPART> 

Disks 1 and 2 are two RAID arrays I'm using right now for the performance benchmarking series I'm doing. Notice that the numbers in the Free column aren't correct - not sure why not. 

To see the partitions on a disk, you need to set the diskpart focus to be that disk. Type select disk X, where X is the disk you want to focus on. You'll see something like:

DISKPART> select disk 1

Disk 1 is now the selected disk.

DISKPART>

And now you can list the partitions on the disk using list partition. You'll see something like:

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary           1116 GB  1024 KB

DISKPART>

This is the output from one of my Windows Servr 2008 servers, where the default partition offset is 1MB - which doesn't lead to perf issues.

On another Windows XP system, I get the following output:

DISKPART> select disk 0

Disk 0 is now the selected disk.

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary            119 GB    32 KB

DISKPART>

This disk isn't aligned correctly. If this was a RAID array, I'd pay a perf penalty every time a read or write straddled a RAID stripe offset. See the blog post link at the top of this post for more details.

Unfortunately, diskpart isn't always the best tool to use to get partition offsets, as it rounds up the values, and when there are multiple partitions, it can be hard to tell exactly what's what, especially whtih lots of disks where you need to select each one and then list the partitions.

In this case, use wmic to get the exact numbers. The command is as follows:

wmic partition get BlockSize, StartingOffset, Name, Index

For my server, I get the following output:

BlockSize  Index  Name                   StartingOffset
512        0      Disk #1, Partition #0  1048576
512        0      Disk #2, Partition #0  1048576
512        0      Disk #0, Partition #0  1505755136
 

For dynamic disks, use:

dmddiag.exe -v 

Now - go out to your servers and check the partition alignment - fixing this can give you up to 30-40% I/O performance boost!!

How do you fix it? Well, that's the downside - fixing it means reformatting the disk to have the correct partition offset or moving the data to a disk that already has the correct partition offset. Remember - although Windows Server 2008 creates disks with the correct offset, taking a disk that was created on Windows Server 2003 and attaching it to Windows Server 2008 will have no effect on the existing partition offset.

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.

In the previous post in the series, I optimized the log block size to get better throughput on the transaction log, but it was very obvious that having the log file and the data file on the same RAID array is a bottleneck.

Over the last couple of weeks I've been running some tests with the log and data files on separate RAID arrays, and this post will explain what I've found.

The first thing I did was setup the systems for remote access so I can log in to them from anywhere in the world. This involved:

  • Changing the TCP/IP port that Remote Desktop listens to (KB 306759)
  • Allowing that port through Windows Firewall (having to perform this step wasn't obvious)
  • Turning on port forwarding on our Internet-facing router for that port to the right server

It's very cool being able to play with these servers from anywhere - and to show live servers-under-load during a class, as I could last week.

The previous two tests were performed using a single 2TB RAID-10 array comprised of 8 300GB 15k SCSI drives. The new array I added to the mix for this test is comprised of various numbers of 1TB 7.2k SATA drives. I tested:

  • 4 drives in a RAID-10 configuration, giving 2TB
  • 6 drives in a RAID-10 configuration, giving 2TB usable
  • 8 drives in a RAID-10 configuration, giving 2TB usable

The volume size limit that the Dell MD3000i arrays allow is 2TB, so the 2nd and 3rd array configurations described above made use of extra spindles but with a lot of wasted space. In this set of tests I limited myself to a single data file, but coming up I'll try multiple data files which will enable more efficient usage of the available raw disk capacity.

I used the same scripts as in the previous tests, with 128 concurrent connections each inserting 4.1KB rows in batches of 10 inserts per transaction into the same table, for a total of 1/128 TB in each connection. The data file is pre-sized to 1TB. The log file is pre-sized to 250MB, with a 50MB auto-growth increment. These are invariants from the previous test (and are also something obvious that can be changed for better performance, but that's not what this set of tests was about).

I tried the following configurations:

  • Log on SCSI RAID-10, data on 4-, 6-, 8-drive SATA RAID-10
  • Data on SCSI RAID-10, log on 4-, 6-, 8-drive SATA RAID-10

And very interesting results they are too! In the previous set of tests, the best performance I could get was 21167 seconds for test completion, with the data and log files sharing the SCSI RAID-10 array.

Moving the log to a different array

The left-hand graph above shows that moving the log file off managed to get the time down to 20842 seconds in the best case, but that's using 8 drives. The 6- and 8-drive times were basically the same - which shows that 4-drives didn't provide enough IO parallelism for the load SQL Server was pushing through the array, but moving to 6 drives provided basically enough so that 8 drives didn't lead to a big performance gain.

Overall, I'd say there was no real performance gain from moving the log file to a different array.

As far as log disk queue lengths go, when the log was on the 4-drive array, the average log write disk queue length was in the 20s. For the 6- and 8-drive cases, the queue length dropped to low single-digits. It didn't drop right down to around zero because the perfmon counter is measuring the queue length as far as Windows sees it - and the iSCSI traffic was being bottle-necked through a single NIC, as we'll see later.

For data disk queue lengths, when the log was on the 4-drive array, the average data write queue length was around 5, with spikes to 20+. For the 6- and 8-drive cases, the queue length increased to an average of 10-15, with spikes to 30+. Clearly the log drive bottle-neck in the 4-drive case was lowering the overall transaction throughput, which reduced the load on the data drives.

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the log array. The results are in the right-hand graph above. You can see that the throughput of the data array remains basically static, but the log array throughput increases with more drives thrown into the mix. Nothing stunning here.

In the 4-drive case, the transaction log grew to 5.5GB but in the 6- and 8-drive cases the log grew to over 8GB - which is what I'd expect given the higher transaction throughput.

Moving the data to a different array

Edit 02/28/10: When performing the tests for the next post in the series, I found the results from this test didn't make sense. I went back to re-run the single-NIC tests and found what I feared - the original tests didn't complete properly. The corrected results are below. 

The left-hand graph above shows that moving the data file off managed to get the time down to 25400 seconds in the best case, but that's using 8 drives. The data file's throughput requirements are clearly higher than the log file's (for this specific benchmark test, not as a general statement by *any means*). It's clear that having the data file on the slower SATA array wasn't going to lead to any better performance than having it on the SCSI array.

As far as log disk queue lengths go, they were between 1-3 in all cases, as the 8-drive SCSI array could clearly provide enough IO throughput to satisfy the log's need.

For data disk queue lengths, when the data was on the 4-drive array, the average data write queue length was around 30, with wild spikes. For the 6- and 8-drive cases, the queue length decreased to an average of 20, and then down to 10 with spikes to 30+. When there was no checkpoint or lazy writer activity, the write queue lengths dropped to zero, as I'd expect. 

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the data array. The results are in the right-hand graph above. You can see that the throughput of the log array remains pretty static, but the data array throughput increases *dramatically* with more drives thrown into the mix. Again, nothing stunning here.

In the 4-drive case, the transaction log grew to 19GB but in the 6- and 8-drive cases the log grew to around 8GB - again I'd expect this. It seems that for this workload, with these default checkpoint settings, and with 8GB of server memory, the log needs to be around 8GB when there's adequate IO throughput. This is something I'll be trying to address in future tests.

Perfmon captures

I took a few perfmon snapshots during the various tests to provide some details of the system performance. I'm not going to go into as much detail explaining the various counters and what they mean, I did that in the previous post, but I will point out interesting details. These aren't necessarily meant to be representative of what you'll see running similar tests - they're of interesting times during the traces that I thought would be fun to look at and understand.

1) Log file on the 6-drive SATA array

 

This graph is highlighting the Avg. Disk Write Queue Length for the K: data array (the black line). You can see that when the Checkpoint pages/sec (pink line) and Lazy writes/sec (light green line) both drop to zero, the write queue length drops to zero - as there's no other activity on that RAID array. When either or both of these start up again, the write queue length spikes wildly. Other things to note:

  • The Avg. Disk Write Queue Length for the I: log array (the green line at the bottom) is pretty static in the low single-digits.
  • The Log Bytes Flushed/sec (the red line) tracks the Disk Write Bytes/sec for the I: log array, except during the heavy checkpoint activity at the start of the trace - this is a period of log file auto-growth.
  • The Pages Allocated/sec (the top light blue line) is fairly static, but varies wildly during the heavy checkpoint activity. This is because the log auto-growth is effectively stalling transaction throughput.

2) Log file on the 8-drive SATA array

 

This is highlighting the Pages Allocated/sec (the black line), which remains basically static, and is a measure of the overall transaction throughput. At this point in the test, the log has auto-grown as far as it will and performance is stable. Other things to note:

  • The Log Bytes Flushed/sec also remains static, as there's a constant transaction workload with no interruptions for log growth, and directly correlates with the spikes and troughs in the Pages Allocated/sec counter.
  • The Avg. Disk Write Queue Length for the I: log array (the bottom green line) is static.

3) Data on the 4-drive SATA array

 

This represents the worst combination of array sizes and file placements, and is a pretty chaotic trace. It's highlighting the Avg. Disk Write Queue Length of the I: data array (the black line) and you can see that it varies wildy between 0 and 80(!!!), clearly showing that the data file performance is hampered by under-powered RAID configuration. All other perf counters around the data array vary wildly - clearly not a recipe for high performance.

Summary

There's clearly a performance gain to be had from separating the data and log portions of the database in this case. However, doing so has highlighted the fact that the simple networking configuration I have is now a bottleneck.

Here's a Task Manager trace showing the network utilization during one of the tests:

 

The troughs are the constant traffic going to the log array, and the sustained peaks are when there's checkpoint and/or lazy writer activity going to the data array as well.

In the next post, I'll tune the network configuration, and then I'll move on to trying multiple data files in multiple RAID arrays.

Hope you're still enjoying the series!

(For the hardware setup I'm using, see this post. For the baseline performance measurements for this benchmark, see this post.)

In my previous post in the series, I described the benchmark I'm optimizing - populating a 1-TB clustered index as fast as possible using default values. I proved to you that I had an IO bottleneck because the IOs to the data file (from checkpoints) and the transaction log file (from transactions committing) were causing contention.

Several people commented that I might have mis-configured the iSCSI IO subsystem - so first off I want to look at that. Fellow MVP Denny Cherry (twitter|blog), who knows a lot more than me about IO subsystems, volunteered to discuss my iSCSI setup with me to make sure I hadn't goofed anywhere (many thanks Denny!). It seems like I haven't. I'm using a single iSCSI array right now, with a single NIC on the server dedicated to the iSCSI traffic - once I move to multiple volumes, I'll add in more NICs.

Looking at Task Manager in the middle of a 6-hour test run to see the network utilization through the NIC shows that it's not saturated, as shown below.

 

I ran the DELL smcli utility for two hours during the most recent test to see what peak throughput I'm getting, using the following command:

smcli -n Middle_MD3000 -c "set session performanceMonitorInterval=5 performanceMonitorIterations=1440;save storageArray performanceStats file=\"c:\\MiddlePerfStats.csv\";"

I saw around 101MBytes/sec. and earlier when testing the smcli settings I'd seen 106MBytes/sec. I'm sure once I remove some of the contention that this will get better.

On to the test!

The first thing I wanted to try was optimizing my use of the transaction log - i.e. doing as few and as large IOs as possible to the log. My hypothesis is that by changing the batch size from one to, say, 10 or 100 SQL Server can make more efficient use of the log.

I changed my T-SQL script to take a batch size parameter and use explicit transactions inserting <batch-size> records. Everything else remained the same as the baseline. I picked 128 concurrent threads as my test point. In the baseline, the 128-thread insert test took 8 hours 8 minutes 27 seconds to complete (29307 seconds). My simple T-SQL code changed to (with $(rows) and $(batch) being SQLCMD parameters to the script):

DECLARE @counter BIGINT;
DECLARE @inner   SMALLINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   SELECT @inner = 0;
   BEGIN TRAN;
   WHILE (@inner < $(batch))
   BEGIN
      INSERT INTO MyBigTable DEFAULT VALUES;
      SELECT @inner = @inner + 1;
   END
   COMMIT TRAN;
   SELECT @counter = @counter + $(batch);
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

Below are the results for 128 threads with batch sizes varying from 10 to 10000:

Increasing the batch size from 1 record to 10 records caused the elapsed time to drop from 29307 seconds to 21167 seconds - a 28% improvement! However, increasing the batch size further didn't improve things any more. Hmmm.

Next I tried different numbers of concurrent connections with a batch size of ten to see if the improvement was universal:

 

So moving from a single-record batch to a ten record batch has the same dramatic effect on performance for varying numbers of connections.

What's going on?

The answer involves some deep internals of the transaction log structure and IO behavior. The transaction log is split up internally into sections called virtual log files (VLFs for short). These VLFs are used to manage which parts of the log can be overwritten with new log records. If this is all unfamiliar, go read my TechNet Magazine article on Understanding Logging and Recovery in SQL Server which explains about the log in more detail and about transactions, otherwise what's coming next may not make much sense.

Whenever a transaction commits, all the transaction log records up to and including the LOP_COMMIT_XACT log record for that transaction (including log records from other transactions that are inter-mingled with those from the one that just committed) must be written out to disk, so the transaction is durable (called write-ahead logging). But it's not individual log records that are written to disk, the unit of IO for the transaction log is a log block. VLFs are split internally into dynamically sized log blocks, ranging from 512-bytes to 60KB in 512-byte increments, as needed. There are algorithms to figure out how big log blocks should be based on what log records are being flushed out.

With the single record batch, the amount of log generated in the transaction totals 4952 bytes. You can find this by doing something like:

CHECKPOINT;  -- to force the log in the SIMPLE recovery model to clear
GO

INSERT INTO MyBigTable DEFAULT VALUES;
GO

SELECT * FROM fn_dblog (NULL, NULL); -- show me all active log
GO

Then add up the values in the Log Record Length column for all records from the first LOP_BEGIN_XACT log record to the LOP_COMMIT_XACT with the matching Transaction ID.

Anyway, a single-record transaction generates 4952 bytes of transaction log, which will be flushed out in our case in a log block that's 5120 bytes (the nearest multiple of 512 above 4952), with a bunch of zero-padding at the end. For ten single-record transactions, that's 10 small IOs.

The ten-record transaction generates 48688 bytes of transaction log and will be flushed out in a log block that's 49152 bytes (the nearest multiple of 512 above 48688). This is clearly more efficient than 10 smaller IOs and is why changing to batches makes things go faster.

A 100-record transaction generates 489628 bytes of transaction log, which is clearly more than the 60KB log-block maximum, so it actually covers multiple log blocks (varying from 14-16 in my tests - I don't remember the exact algorithms). You can see the log block changing when the middle number in the Current LSN column of the fn_dblog output increases. It looks like it's jumping up, and it is - this is the offset of the log block within the current VLF divided by 512.

Because the 100-record transaction is split into multiple log blocks, there's no real IO gain during log flushes over the 10-record transaction - which is illustrated by the results I saw.

Now, even with this speed increase from the increased batch size, the average-disk-write-queue-length is still anywhere from 20-40 when running the 128-connection test, so there's still a significant bottleneck there. In fact, the transaction log grew significantly still during these tests (up to 23GB in one case) - for an explanation of this phenomenon, see Interesting case of watching log file growth during a perf test. I picked up some wait stats queries from fellow-MVP Glenn Berry (twitter|blog) to run while the tests were running. I'm using his as they're published and anyone can download and use them (see his blog post for the queries, and Bob Ward's CSS blog post that's starting to document the wait types).

His queries tell me that with the 10-record per batch and 128-connections:

  • Average write-stall to the data file is 37ms
  • Average write-stall to the log file is 18ms
  • Top wait types are PAGELATCH_EX (55% of all waits), PAGELATCH_SH (28% of all waits), and WRITELOG (14% of all waits)

The first two waits are the Storage Engine waiting for the buffer pool to kick out dirty pages so that buffers can be used to hold newly created pages, and the third is the log manager waiting for log block flushes to disk to complete.

Clearly IO IO IO.  Let's look at a a perfmon capture during the 10-records per batch test with 128 connections:

 

Let's go through each counter (top to bottom in the list in the image) and explain what's going on. I deliberately picked this time-slice, as it really simply shows the effect of contention:

  • Pages Allocated/sec: this is the light blue line and is the Access Methods part of the Storage Engine (the dev team I used to run) creating new data and index pages for the clustered index we're populating.
  • Checkpoint pages/sec: this is the pink line at the bottom left and bottom right of the capture. This is the buffer pool writing out dirty pages during a periodic checkpoint.
  • Avg. Disk sec/Write: this is the dark blue line that's tracking just above the thick black line. It's the amount of time in ms for a write to complete. You can see that it has a minimum around 12:51:00 and then varies wildly, hitting as high as 50+ms for a single write.
  • Avg. Disk Write Queue Length: this is the highlighted line in thick black. It also has a minimum around 12:51:00 and varies wildly the rest of the time.
  • Disk Write Bytes/sec: this is the dark green line at the top that shows the number of bytes being written to disk from all IO sources. Same story around 12:51:00.
  • Log Growths: A simple counter since the database was created/server started. It's over 100 and off the chart.
  • Log Bytes Flushed/sec: this is the red, fairly constant line around 1/3 the way up and is log blocks being flushed to disk because of transaction commits or checkpoints.
  • Lazy writes/sec: this is the light green line at the bottom and is the buffer pool having to force data pages to be written to disk (along with all transaction log flushed up to the point of the last log record to change the page being written) to make space for images of newly created pages.

This time-slice is really cool in that it shows what happens when contention goes away. Just before 12:51:00, a checkpoint ends and the lazywriter has nothing to do - so the only IOs hitting the disks are those coming from the transaction log flushing out log blocks as transactions commit. You can see the Avg. Disk Write Queue Length drop down to 2-3, the Avg. Disk sec/Write drop to about 5ms, and most beautiful of all (look, I'm a big geek ok? :-), the Disk Write Bytes/sec (the green line) drops down to be exactly equal to the Log Bytes Flushed/sec - proving that it's just log flushes hitting the disk. This is the no-contention case. It happens again for brief spell about 10 seconds later - the lazywriter most likely created a temporary surfeit of empty buffers. All the rest of the time, the lazywriter and checkpoints play havoc with the write throughput on the drives by causing contention.

It's clearly time to try some separation of files to relieve the contention - and that's what I'll cover in the next post in the series.

Hope you're enjoying the series - these take a long time to write up!

(For the hardware setup I'm using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn't matter what the data in the database is, as backup doesn't care - as long as there's a terabyte of it. Why a terabyte? It's a nice round number, it's a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that's what I'm doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what's going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived - I'm going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I'm starting with a brain-dead database layout, configuration, and method of populating the table, and then I'm going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads - the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it's *essential* to have a baseline with which to compare - otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I'm deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE BackupPerfTest ON
(NAME = 'BackupPerfTest_Data',
   FILENAME = 'K:\BackupPerfTest.mdf',
   SIZE = 1TB,
   FILEGROWTH = 1GB)
LOG ON
(NAME = 'BackupPerfTest_Log',
   FILENAME = 'K:\BackupPerfTest.ldf',
   SIZE = 256MB,
   FILEGROWTH = 50MB);
GO

ALTER DATABASE BackupPerfTest SET RECOVERY SIMPLE;
GO

USE BackupPerfTest;
GO

CREATE TABLE MyBigTable (c1 BIGINT IDENTITY, c2 CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134217728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134217728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don't do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I'm deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   INSERT INTO MyBigTable DEFAULT VALUES;
   SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:

sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So - a simple setup.

Here's a graph of the results:

 

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster runtime, but the improvements from doubling the number of threads clearly aren't directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let's look at the page allocations per second for each experiment too:

 

As I'd expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There's a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I've already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here's a perfmon graph for the 64-way test once it reached steady-state and the log wasn't growing:

 

Analysis: 

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there's a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I'm speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I've got them on a fast RAID-10 array. This debunks the theory I've often heard that contention doesn't matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem - it just depends what you're trying to do with it. Imagine if I had nonclustered indexes on this table too - more logging and more pages being allocated - much worse performance...

So now I've got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I'm not saying that all of these things are going to help, but over the next few weeks I'm going to try them all and report on the results. You'll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people's words for it.

I hope you're going to find these benchmarks and experiments useful - I'll be learning (hopefully) as I go along too.

Let me know if there's anything else you'd like to see me try, and if you're following along (I don't want to spend all this time if no-one's reading the series!)

Thanks!

I'm running some performance tests on the hardware we have (more details on the first of these tomorrow) and I was surprised to see some explosive transaction log growth while running in the SIMPLE recovery model with single row insert transactions!

Without spoiling tomorrow's thunder too much, I've got a setup with varying numbers of connections populating a 1TB table with default values, with the single data and log files on the same 8-drive RAID-10 array (again more details on why tomorrow). I was looking at perfmon while the test was running with 128 concurrent connections, and noticed some weird-looking behavior - basically what looked like a never-ending checkpoint occuring. So I dug in deeper and discovered the transaction log had grown over 7GB since the start of the test. Huh?

I restarted the test from scratch, with a 1TB data file and a 256MB log and watched what happened in perfmon; here's the screenshot of the first 90 seconds:

 

This is *so* interesting. The black line is the number of log growths, so you can see the log grows every time the line level goes up. The bright blue line is the number of pages being allocated per second to hold all the table rows my 128 connections are inserting. You can clearly see that every time there's a log growth, the allocations take a nose-dive - because no transactions can commit while the new portion of the log is being zeroed (remember that instant file initialization does not apply to the log - see this blog post). The green line (disk write bytes per second) spikes when the log grows because of the zeroes being written out to the log file by SQL Server (remember that SQL Server does the zeroing).

But why is the log growing in the first place? I'm in the SIMPLE recovery model and doing single-row implicit transactions (yes, I'm deliberately doing this) so there shouldn't be anything stopping the log from clearing during a checkpoint, right?

Wrong. The log starts out small (256MB) so one of the thresholds for triggering a checkpoint gets hit pretty fast (70% of the log file is used). So a checkpoint occurs (you can see the checkpoints occuring when the pink line at the bottom of the perfmon screen is above zero), and starts writing out to the data file, which is on the same disk as the log (see How do checkpoints work and what gets logged for an explanation of checkpoint mechanics), but it can't write fast enough (because of disk contention) to get to the log-clearing part before the log fills up completely and has to grow (because transactions are continuing at break-neck speed). So the log grows, and the insert transactions stop while the log is zeroed. And then the log starts to fill up again very quickly and another checkpoint is triggered, and so on and so on.

Eventually a steady state is reached where there's enough free log space during a checkpoint that no new log is required for the concurrent transactions to commit. You might ask why the checkpoint is so slow in the first place? Because I deliberately put the data file on the same RAID array as the log file, and both are being steadily hammered with writes - classic disk contention. Even though the RAID array is RAID-10 with 8x300GB 15k SCSI drives, the average disk queue write length is over 20 most of the time during the 128-way test because I'm simply trying to do too much.

The point of my blog post? Just because you don't have any of the classic causes of transaction log growth going on, doesn't mean you're going to be immune. In this case my (deliberate) poor physical layout of the database files and workload growth up to 128 concurrent connections caused the log to grow. What started out working when I was running 16 connections didn't work any more at 128 (actually I went back and re-ran some of the earlier tests and even with only 64 connections, the log grew to over 1GB before reaching steady-state).

Interesting eh?

Tomorrow I'll be continuing the perf/benchmarking series by creating my first benchmark and then tweaking the setup to see how I can improve performance (for instance with multiple data files, separation of log and data files - all the things I preach but have never *demonstrated*), but this behavior merited a post all on its own.

I've blogged about ghost records and the ghost cleanup task a couple of time before (the only place it is really explained AFAIK), but one of my fellow MVPs was asking me some questions about it today for a customer of theirs and couldn't find the trace flag to turn it off.

My previous blog posts about this are:

These explain what ghost records are and how the ghost cleanup process works.

On big systems it's possible for the ghost cleanup process to fall behind the rest of the system, with no hope of it catching up. It's a single-threaded task - so imagine a 16-way box with lots of deletes occuring, and a single CPU spending a few seconds every 5 seconds trying to remove all the ghosted records resulting from the deletes of all the other CPUs. It's pretty obvious that the ghost cleanup process is going to lag behind.

The problem with this scenario is that the ghost cleanup process will still pop up every 5 seconds (every 10 on 2008) and start removing ghost records, potentially causing performance issues by keeping pages in the buffer pool, generating log records, and causing physical IOs. The ghost cleanup task is also one of the background processes that can cause IOs to occur on what looks like a totally quiescent system.

There is a way to turn off the ghost cleanup task, using trace flag 661, as documented in KB 920093. Be careful though!!! If you disable the ghost cleanup task, the space taken up by deleted records will *NOT* be released for reuse by SQL Server until you do something else to remove it, like rebuilding an index.

One method people sometimes consider is to force ghost cleanup to clean everything by performing a table or index scan (thus queuing all the deleted records up for the ghost cleanup task). Although this is an alternative, it still uses the ghost cleanup task to do the work, and on a very busy system with a very large number of deletes (warning: generalization! :-) it can be much more efficient to remove the deleted-but-not-yet-reclaimed records using index reorganize or index rebuild.

Turning on this trace flag can be useful as a performance gain on systems with very heavy delete workloads, but only as long as you're careful about it. It's not something that's generally recommended but it may be useful to you.

Enjoy!

This is an interesting performance bug concerning a broken query optimizer rule in 2008. Thanks to Dan Shargel (Twitter) for sending me info on this and letting me use some of the stuff he sent.

The scenario involves using MIN or MAX in a query against a partitioned view. In 2005 the query plan includes a TOP (1) operator which uses the right index, but in 2008 the optimizer rule was broken and the plan turned into a stream aggregate, much more expensive in this case.

Here's the 2005 query plan:

 

and here's the 2008 query plan (before the bug fix):

 

You can get the fix in CU4 for 2008 SP1 (or later) and read a bit more about it in KB 973255.

Note that you have to turn on trace flag 4199 to enable the fix - that requirement will be removed in SQL11.

It's been a few weeks since my last posts but I've got a bunch in the pipeline coming up.

Firstly, I've got it together to start using the hardware we got a while back. I'm going to be doing some benchmarking, perf testing and playing with various HA technologies, and of course blogging a bunch about what I discover. The hardware was a lot of fun to put together!

This is kind of a reference post, detailing the hardware setup, so I can link to it in all the posts where I use our hardware.

Photos: front and back (no Facebook login required).

Here's what I'll be using, all mounted in a 42U rack:

  • 2 x Dell PowerEdge 1950 servers, each with
    • 2 x quad-core Xeon E5405 Processor2x6MB Cache, 2.0GHz, 1333MHz FSB
    • 8GB 667MHz RAM
    • 2 x 146GB 15K RPM SCSI drive (for OS etc)
    • 4 3GBps NICs
    • Windows Server 2008 Enterprise SP1 64-bit
    • Multiple SQL Server 2008 Enterprise instances
  • (Upper array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 15 x 1TB 7.2K RPM SATA drives
  • (Middle array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 8 x 300GB 15K RPM SCSI drives
  • (Bottom array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 8 x 1TB 7.2K RPM SATA drives
  • Connected with 1GbE through two PowerConnect 5424 24-port iSCSI optimized switches 

The drives will be reconfigured and I'll specify what I'm using in each post.

The SQL Server instances are vanilla installs, with no special parameters, sp_configure options, or trace flags. If I use any I'll detail them in each post.

Time to get using it!

As part of the 2008 DBA class we're teaching down here in Melbourne, I did a demo of using predicates and file targets with extended events, so I want to blog the script for people to play with.

For background info on extended events see:

In this scenario, I'd like to track queries that make heavy usage of the CPU on my system.

First off I'll create a test database to play with.

USE MASTER;
GO
IF DATABASEPROPERTYEX ('production', 'Version') > 0 DROP DATABASE production;
GO

CREATE DATABASE production;
GO
USE production;
GO

CREATE TABLE t1 (c1 INT IDENTITY, c2 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID(), c3 CHAR (5000) DEFAULT 'a');
CREATE CLUSTERED INDEX t1_CL ON t1 (c1);
CREATE NONCLUSTERED INDEX t1_NCL ON t1 (c2);
GO

SET NOCOUNT ON;
INSERT INTO t1 DEFAULT VALUES;
GO 1000

-- Get the database ID to plug into the event session
SELECT DB_ID ('production');
GO

Notice that I grabbed the database ID of the new database I created, as I'll need that when defining the extended event predicate.

IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'EE_ExpensiveQueries')
   DROP EVENT SESSION EE_ExpensiveQueries ON SERVER;
GO

CREATE EVENT SESSION EE_ExpensiveQueries ON SERVER
ADD EVENT sqlserver.sql_statement_completed
   (ACTION (sqlserver.sql_text, sqlserver.plan_handle)
      WHERE sqlserver.database_id = 18 /*DBID*/  AND cpu > 10 /*total ms of CPU time*/)
ADD TARGET package0.asynchronous_file_target
   (SET FILENAME = N'C:\SQLskills\EE_ExpensiveQueries.xel', METADATAFILE = N'C:\SQLskills\EE_ExpensiveQueries.xem')
WITH (max_dispatch_latency = 1 seconds);
GO

I'm monitoring the event that fires whenever a statement completes, and I'm filtering by database ID (make sure you plug in the correct database ID when you try this yourself) and by the number of milliseconds of CPU time the statement used. Note: when using predicates you should always make sure that you order the predicate tests such that the most restrictive predicates (those most likely to evaluate to false) are first in the list, so the predicate evaluation 'short-circuits' as quickly as possible (a standard programming practice with boolean logic). I'm also using a file target, just to show how it can be done, instead of the ring buffer which is commonly used for extended events demos.

Now I'm going to turn on the event session and do some queries in the production database.

ALTER EVENT SESSION EE_ExpensiveQueries ON SERVER STATE = START;
GO

USE production;
GO

SELECT COUNT (*) FROM t1 WHERE c1 > 500;
GO

SELECT SUM (c1) FROM t1 WHERE c3 LIKE 'a';
GO

ALTER INDEX t1_CL ON t1 REORGANIZE;
GO

ALTER INDEX t1_CL ON t1 REBUILD;
GO

That should have generated some long-running queries. Now I'll switch in to the context of master (so the querying of the event session itself doesn't get captured by the event session) and see what I've captured.

USE master;
GO

SELECT COUNT (*) FROM sys.fn_xe_file_target_read_file
   ('C:\SQLskills\EE_ExpensiveQueries*.xel', 'C:\SQLskills\EE_ExpensiveQueries*.xem', NULL, NULL);
GO

In this case I've got 3 entries. I can pull these out using the following code:

SELECT data FROM
   (SELECT CONVERT (XML, event_data) AS data FROM sys.fn_xe_file_target_read_file
      ('C:\SQLskills\EE_ExpensiveQueries*.xel', 'C:\SQLskills\EE_ExpensiveQueries*.xem', NULL, NULL)
 ) entries;
GO

But I get three XML blobs back, like so:

 

What's more useful is to pull everything out of the XML blob programmatically using the code below:

SELECT
   data.value (
      '(/event[@name=''sql_statement_completed'']/@timestamp)[1]', 'DATETIME') AS [Time],
   data.value (
      '(/event/data[@name=''cpu'']/value)[1]', 'INT') AS [CPU (ms)],
      CONVERT (FLOAT, data.value ('(/event/data[@name=''duration'']/value)[1]', 'BIGINT')) / 1000000
      AS [Duration (s)],
   data.value (
      '(/event/action[@name=''sql_text'']/value)[1]', 'VARCHAR(MAX)') AS [SQL Statement],
      SUBSTRING (data.value ('(/event/action[@name=''plan_handle'']/value)[1]', 'VARCHAR(100)'), 15, 50)
      AS [Plan Handle]
FROM
   (SELECT CONVERT (XML, event_data) AS data FROM sys.fn_xe_file_target_read_file
      ('C:\SQLskills\EE_ExpensiveQueries*.xel', 'C:\SQLskills\EE_ExpensiveQueries*.xem', null, null)
) entries
ORDER BY [Time] DESC;
GO

Time                    CPU (ms) Duration (s) SQL Statement                              Plan Handle
----------------------- -------- ------------ ------------------------------------------ --------------------------------------------------
2009-10-16 17:59:29.623 30       1.214875     ALTER INDEX t1_CL ON t1 REBUILD;           0x06001000EB672A07B8C0C807000000000000000000000000
2009-10-16 17:59:28.407 20       0.024076     ALTER INDEX t1_CL ON t1 REORGANIZE;        0x0600100003594903B8C0C807000000000000000000000000
2009-10-16 17:59:28.343 51       0.045144     SELECT SUM (c1) FROM t1 WHERE c3 LIKE 'a'; 0x06001000FAF5B11EB820C307000000000000000000000000

Now I can plug in one of the plan handles to a query of sys.dm_exec_query_plan to get the graphical query plan:

SELECT [query_plan] FROM sys.dm_exec_query_plan (0x06001000FAF5B11EB820C307000000000000000000000000);
GO

And clicking on the resulting XML 'link' gives the query plan:

 

 

And now I can tweak the production workload to potentially behave better.

Just for kicks I went back into the context of the production database and ran the XML parsing again to capture the query plan - try it yourself - pretty gnarly! :-)

Now, if I was going to make this more useful, I'd use a ring buffer target, with a polling mechanism every few seconds to make sure that I can capture the graphical query plan for expensive queries before the plan is pushed out of cache - maybe I'll get around to doing that and publish the complete solution.

In the meantime, here's a zip file containing the entire script that you can download and play with: EE_ExpensiveQueries.zip (1.72 kb)

Enjoy!

This is a question that came up yesterday on Twitter: will SQL Server always do a table scan to find the result of SELECT COUNT (*) FROM mytable?

The answer is no. The query processor will use the index with the smallest number of pages - i.e. with the least I/O cost.

Let me quickly show you. First I'll create a simple table with no indexes.

CREATE TABLE CTest (c1 INT IDENTITY, c2 BIGINT DEFAULT 1, c3 CHAR (1000) DEFAULT 'a');
GO
SET NOCOUNT ON;
GO
INSERT INTO CTest DEFAULT VALUES;
GO 10000

Don't forget to unclick the Include Actual Query Plan button

 

before running the 10000 inserts otherwise it will take forever (generating 10000 graphical plans) and SSMS will barf with:

The query has exceeded the maximum number of result sets that can be displayed in the Execution Plan pane. Only the first 250 result sets are displayed in the Execution Plan pane.

Now if we do the SELECT COUNT (*), the plan is:

 

The query processor has no choice but to do a table scan.

Now I'll add a nonclustered index that will have less pages that the table itself:

CREATE NONCLUSTERED INDEX CTest_1 ON CTest (c2);
GO

And the select plan is now:

 

Notice that the Table Scan operator has changed to an Index Scan operator on the new CTest_1 index. This is because the nonclustered index has fewer pages than the table and so the I/O cost is lower.

Now I'll create an index that's even smaller, over the integer column:

CREATE NONCLUSTERED INDEX CTest_2 ON CTest (c1);
GO

And the plan should change to use the smallest index again:

 

And it does, as I expected.

Let's look at the relative page counts for each index and the table:

SELECT [index_id], [page_count]
FROM sys.dm_db_index_physical_stats (DB_ID (), OBJECT_ID ('CTest'), NULL, NULL, 'LIMITED');
GO

index_id    page_count
----------- --------------------
0           1436
2           28
3           19

Any time the query processor is working out what plan to use, one of the key factors in plan choice is the I/O cost.

Hope this helps!

This survey is based on a suggestion from Jonathan Kehayias (blog/Twitter).

In this survey I'm asking two questions:

  1. When you're designing/implementing a database, do you manually create statistics or let SQL Server create them automatically?
  2. How do you maintain statistics? Manually or letting SQL Server do it for you? Or both? 

I'll probably collaborate with Kimberly on the editorial for this, as she's forgotten more about statistics than I'll ever know!

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

   

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

In last week's survey I asked what you think is the most important thing when it comes to performance tuning, if you *had* to choose one - see here for the survey. Here are the results as of 6/7/09.

Now of course you're all calling 'foul' because I didn't put an 'it depends' option on there which you all would have chosen, but that would have been boring, and I wanted you to pick what you really think is the #1 thing in the majority of cases. Of course, in general the answer is always 'it depends', but some things turn out to be the #1 problem more often than others. For the record, mine and Kimberly's answers were overall indexing strategy, but I deliberately broke them out into clustered and nonclustered to see what people would pick. I'm not surprised that application issues and nonclustered indexing strategy came out top, but good to see that all answers were represented and there's a variety of opinion out there. Variety is the spice of life as they say, or is it garlic? Depends who's cooking I suppose - but I digress.

Let's walk through each option and see how badly they can affect performance if something's wrong. This is by no means an exhaustive list of what could cause problems, just me rambling on at the end of a Sunday, as usual. Blog posts are so much easier than articles because you can have run-on sentences, short little fragments and all sorts of other verboten grammatical beasts.

IO subsystem design/tuning, including RAID: Quite a few things here could be wrong. Is there a write-intensive workload (either data files or log files) sitting on a RAID-5 array (with RAID-5 there's a performance penalty on writes)? Is the disk partitioning alignment, RAID stripe size, or NTFS allocation unit size incorrectly set? (See Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly? for details.) Is the I/O subsystem just not up to the load being placed on it - high disk queue lengths, page IO latch wait errors, long-running IO errors? Is tempdb the bottle-neck because there a multiple procs with multiple users all trying to create/drop lots of temp tables? (See Misconceptions around TF 1118 for details.) Are there multiple databases using the same I/O subsystem such that disk-head contention is causing performance issues? Is there a network bottleneck with iSCSI storage? Are there filter drivers getting in the way and slowing things down?

Server hardware (CPUs/memory/other): Ok - I'm going to sound like a stuck-record saying 'quite a few things here could be wrong'. Let's just take it as read that each category has lots that could go wrong. A heavy load with a single CPU isn't going to perform well. However, sometimes excessive parallelization can lead to the dreaded CX Packet waits - multiple threads in the query processor waiting for information from each other because the server is overloaded - or blocking issues as locks start getting in the way - but that's more an application problem - oops, wrong section. Having more CPUs is usually a good thing as parallel plans usually run faster than single-threaded plans. Having too little memory can severely limit the buffer pool size and force buffer pool thrashing (low page life expectancies), and large operations to spool data to tempdb, placing further load on resources. Not sure what I meant now by 'other' - maybe having a wireless keyboard could slow down commands getting to SQL Server? Ok - I remember - CPU architecture. 64-bit vs. 32-bit. With 64-bit there's no real limit on the amount of memory that can be addressed, and no mucking around with AWE. There are some tricky issues with NUMA to make sure you don't get cross-node memory accesses (just like CPU cache invalidation could hit you on SMP architectures) but that's getting a little advanced and geeky. Btw - watch out for this when you're using Extended Events in SQL 2008 on multi-proc boxes - there's a setting where you can tell it what kind of CPU architecture you have.

Virtualized vs. 'real' server: Now, I'm the first to admit that I'm not an expert in virtualization, but now I have 3 8-ways and lots of disks to play with at home, I want to be. That's why I picked up a Hyper-V book today after reading Tom's virtual bookshelf post here. What I do know is what I've heard from talking to people at conferences and during classes - the main thing here is I/O virtualization. If there's software virtualization of I/Os, then performance is going to suck under load and you better not do it in production. Microsoft's Hyper-V gets around this, as do things like VMware ESX, by allowing you to assign actual real physical resources to virtual machines. And that's as much as I know about it until the book arrives and I get back from Houston to read it. And no doubt I'll be posting more about it.

Database physical layout (files/filegroups): Tempdb is the obvious case, and I've covered that above. User databases are a bit trickier, and it really depends on the underlying I/O subsystem. A single monolithic data file on a single physical disk isn't going to perform well under heavy concurrent load. But it may do fine on a RAID-10 array sliced-and-diced in a clever way by an expensive SAN. In today's world, more and more databases are on SANs so user database performance is taken care of in that respect, and all you need to worry about is tempdb. I touched on this a little bit in the editorial of a previous survey Physical database layout vs. database size.

Table design: Both ends of the spectrum work well here - from the plastic Costco tables with the fold-down metal legs to antique French-oak farmhouse tables. Both will give many years of service with no degradation of performance. Just like regular tables, SQL Server tables come in all shapes and sizes and you can easily pick the wrong one for what you need, leading to poor performance. This is a massive can of worms - in fact Kimberly spends 2 brain-busting days on this in one class she teaches and she could easily spend a week of 10-hour days on it (oh yes, it's called the SQL Masters program :-) From picking the right column types (see Michelle's post at Performance Considerations of Data Types) to deciding how to store you LOB values (see my post at Importance of choosing the right LOB storage technique) to deciding on a good primary key. Just don't pick a random GUID, or if you have one, replace it. Clustered index keys should be unique, narrow, static and ever increasing. Table design encompasses so many things I can't do it justice in a late evening paragraph. But you get the idea.

Heaps vs. clustered indexes: Kimberly likes to wrap this up with nonclustered indexes too, but I wanted to break them out so I could reference a whitepaper that strongly suggests you should use clustered indexes: Comparing Tables Organized with Clustered Indexes versus Heaps. Read it yourself and draw your own conclusions. Basically, with the possibility of forwarding records in heaps, and the majority of a clustered index's upper levels being in memory, the extra random physical I/Os to access a record in a heap outweigh the aggregated cost of the in-memory binary-searches at each level of clustered index tree navigation. I'm sure some people will argue about this - bring it on. There are very special cases where heaps are better, but not in general.

Nonclustered index strategy: Biggie #1. Where to start here and what to say? For the majority of scenarios, if you don't have a good nonclustered indexing strategy you're not going to have good performance. There's a simple reason for this - without the right nonclustered indexes to allow the query optimizer to choose them, you're going to get table scans. Of your 124 billion row table. How fast is that going to be? Again, Kimberly has this class that she teaches on this where she goes into details on how to index for ORs, for ANDs, for JOINs, and all sorts of other kinky stuff. I need to sit and listen to her one day so I can appear knowledgeable about this. I just know that you need to have indexes with the right selectivity. Ok - I just called Kimberly downstairs in the office to make sure I'm right before I blog something and look like an idiot and she starts with 'erm, well it depends'. I give up. I do *corruption* and *HA* and *maintenance*, not performance tuning. It's bloody hard. Luckily I'm married to one of the best people on the planet for SQL index tuning - wow, what hope do you have? (Of index tuning, not of marrying Kimberly :-) Seriously, index tuning isn't that bad - I just had a moment of weakness. You need to make sure you have the right indexes and no completely unused indexes otherwise you're wasting resources maintaining them. You could do worse than listen to her on RunAs Radio Interview Posted - "Kim Tripp Indexes Everything".

Statistics: If the statistics are out-of-date, the optimizer will not be able to pick a good plan and your performance will suck. It's that simple. Turn on AUTO_UPDATE_STATISTICS and make statistics updating is part of your regular maintenance. Don't forget to update non-index statistics too.

Application (design and code): Biggie #2. Sometimes no amount of cleverness can wring good performance from the twisted logic of a deranged application programmer. An application that is written with absolutely no concern for how SQL Server works is likely to not perform well. An application that is written with too much knowledge of how SQL Server works is likely to fall foul of relying on 'accidental' behaviors, or behaviors that are limited to certain data volumes or workloads.  An application that is going to make heavy use of SQL server has to take into account how SQL Server is going to behave under a variety of workloads, on a variety of servers, and on a variety of I/O subsystems. The common application test framework? A ten-row customer table with a single connection. 'Excellent - my 16-table join with CLR-based aggregations runs in less than 2 seconds'. For 10 rows. Six months later: 'Now we've got 3 million customers, why does performance suck so bad?' Go figure.

Database maintenance: Ok - starting to get tired now and I still need to do this week's survey. Maybe I could just say 'do maintenance' and be done with it? No? Ok - how about go checkout the article I wrote last August for TechNet Magazine: Effective Database Maintenance article. If you don't maintain your indexes, they'll likely get fragmented and affect performance in one way or another and you'll need to play with FILLFACTOR (see Kimberly's Database Maintenance Best Practices Part II – Setting FILLFACTOR). If you don't take care of transaction log VLF fragmentation, it will affect log-based operations (Kimberly again at Transaction Log VLFs - too many or too few?). Seriously - she says I blog so much - how many links are in this post to her stuff? She blogs a lot too!. If you have corruption, it could manifest itself as long-running I/Os. If I don't put my food on to cook, then I'll be eating way too late again. That was a 'Paul maintenance' one that slipped in - and can have disastrous effects on blogging performance. Seriously, you can't just put a database into production and walk away. It will slowly degrade over time. Like red jello melting on a hot summer's day, but probably not as fast, unless you use random GUID primary keys, or it's really hot. What?!?

Performance tuning is an art and a science. But there's a huge amount of science behind it before you have to get into the art side of things. I've just scratched the surface here in a blog post that took me more than an hour to write, banging away non-stop as I do. And I don't do the art side of things. I leave that to Kimberly - she just got one of our client's batch jobs from a 72-hour run-time down to a 6.5 hour run-time. They were pretty happy. We can't all be the Goddess of Performance Tuning Smile

Next post - this week's survey!

Categories:
Performance Tuning | Surveys

This week's survey is going to provide some fun debating topics. I'd like to know what the number 1 thing is that YOU go after when performance tuning - what do you think is most important? (Assuming you're not just randomly walking up to a server to do some perf tuning - there's an actual problem).

[Edit: And, as I say below - please no comments about what you picked until the results post next week - it skews the results. I've already had to delete 6 comments and ask for them to be held until next week.]

There is deliberately no 'it depends' answer, no 'combination' answer, and no way to suggest something else. If what you think is most important isn't on the list, don't vote, but I'd like to hear your opinion after the editorial next week.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS And again, as always, no comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

Categories:
Performance Tuning | Surveys

I've just read a very good, very deep, and very interesting blog post by James Rowland-Jones. In the post, James investigates some locking issues using a variety of means and explains about the undocumented %%lockres%% function with you can use to figure out what the wait resource will be for individual table rows (basically the lock-hash value). He also shows something I've known about but never seen before - how the lock hash algorithm isn't perfect and can actually cause lock collisions where you wouldn't expect them - and how to mitigate the problem.

Excellent post and well worth reading. Check it out at The Curious Case of the Dubious Deadlock and the Not So Logical Lock.

I'm teaching a class this week on database maintenance, for DBAs inside Microsoft. One of the things we're discussing today is index fragmentation and how poor cluster key choice can lead to page splits, poor performance, index fragmentation, and so on - not just in the clustered index, but also in nonclustered indexes.

One of the students looked in a database underpinning an application and found a unique cluster key, which is the worst I've ever seen (although not the worst that Kimberly's ever seen apparently - the mind boggles!).

The cluster key is defined as a combination of the following column types:

  • 16-byte GUID
  • varbinary (16)
  • nvarchar (512)
  • nvarchar (256)
  • tinyint

Now, the wide cluster key isn't a big deal UNLESS there are nonclustered indexes, but there are in this case - so the cluster key is included in all nonclustered index rows. And the random GUID high-order key is always a bad idea, as it means the clustered index will be heavily fragmented as records are inserted. This is all simplified and generalizations (and I open this can of worms happily) - but you get the idea.

Good design up-front, with an understanding of how key choice affects the behavior of SQL Server and how indexes are stored and indexed, can lead to vastly reduced performance problems and maintenance issues.

Quickie this morning to start the day off. I saw a question on a forum: if I *have* to use a GUID and *must* have a primary key, should I make the primary key clustered or nonclustered?

Now, I'm not getting into the whole GUID vs. bigint identifier, or random GUID vs. GUID generated by NEWSEQUENTIALID(), so please don't comment on those issues, they're not relevant here. I just want to address the question - what kind of index should it be?

From a Storage Engine perspective, my answer is nonclustered. Here are three reasons why:

  • If the index is clustered, then the cluster key is immediately at least 16 bytes (the size of a GUID). This doesn't change the size of the clustered index records (as the GUID column has to be stored in the table anyway, and a clustered index IS the table), but it does change the size of the nonclustered indexes. All nonclustered indexes on the table must include the cluster keys, even of they are not explicitly part of the nonclustered index keys (I'll do a post on this later). This means the GUID is present in every nonclustered index record too. From this perspective, it would be better to use a smaller clustered index key and have the GUID primary ley be nonclustered so it's only present in that one nonclustered index.
  • Random GUIDs used as the high-order key cause index fragmentation. Their random nature means the insertion point into the index is also random. This causes page splits, which cause fragmentation and are *expensive*. (I touched on this a bit a few days ago in my post How expensive are page splits in terms of transaction log?.). With a random key value, it's hard to avoid page splits and fragmentation, although you can delay them somewhat using FILLFACTOR, but at the expense of using extra space. By making the GUID index nonclustered, you can delay page splits even further. The clustered index is the table, so the records are (usually always) larger than nonclustered index records. This means you can get fewer clustered index records on an 8KB page than nonclustered index records. With fewer records per page, you can do fewer random insertions on the page before a page split occurs. So using a nonclustered index for the GUID key means you can do fewer expensive page splits.
  • Given that whatever kind of index you create for the GUID key is going to experience index fragmentation, you're probably going to want to periodically remove the fragmentation as part of your database maintenance plan. It makes sense to try to limit the amount of resources used by the fragmentation removal operation (e.g. cpu, IO, disk space, transaction log space), and so the smaller the fragmented index, the better. A nonclustered index for the GUID key will be smaller than a clustered index, so if you choose a non-fragmentation-causing clustered index key, and confine the fragmentation to the nonclustered index, you can use fewer resources during database maintenance.

And there you have it. I'm sure some of you have seen pathological cases that disprove one of the above points, but my arguments are generalizations. Maybe this is a can of worms I've opened, in which case I look forward to the comments!

PS Brent did a great post about humor when blogging, the cartoon links he includes are great. Check it out here.

I've just been setting up some of our new hardware, and wanted to do some background reading to ensure I use the correct disk partition offset, RAID stripe size, and NTFS allocation unit size to enable the best possible performance for the volumes I'm creating.

You may not of heard about this (or your disk admins may not have heard about this) but on Windows Server 2003 and before, the default partition offset typically causes worse-than-optimal performance - and correcting it can get gains of maybe as high as 30% in terms of IO latency and duration. The SQLCAT team have just published a *fantastic* whitepaper (written by Jimmy May and Denny Lee) which explains the issue simply and clearly and shows you how to correct it. You should checkout the whitepaper at Disk Partition Alignment Best Practices for SQL Server.

The summary is that on Windows Server 2003 and before, the default partition offset is 31.5KB (63 x 512byte disk sectors), which does not align nicely with the common RAID stripe sizes of 64K or 128K, or the optimal NTFS allocation unit size of 64KB. This can lead to having to read/write multiple stripes every so often and a big perf drop. It can be fixed, as detailed in the whitepaper. For volumes *created* on Windows Server 2008, the problem does not exist as it creates a default partition offset of 1024KB - although see the edit below...

In fact Jimmy just published a blog post to help you make the case to your disk admins/customers a few days ago: Disk Partition Alignment (Sector Alignment): Make the Case: Save Hundreds of Thousands of Dollars.

Luckily I'm using Windows Server 2008, which correctly sets the disk partition for the vast majority of cases.

[Edit 06/30/2011: Beware! Even with Windows Server 2008, it's been reported that some I/O subsystem vendors intercept what Windows is trying to do and *still* create partitions with the incorrect offset. Create them and then check - just to be sure.]

Next thing I considered was RAID stripe size and NTFS allocation unit size (previously known as 'cluster size'). Kendal Van Dyke just published an *excellent* blog post series that provides a lot of empirical evidence as to what the best numbers are for the RAID level you're using. This saved me a lot of time. Check out his series at Disk Performance Hands On Series.

The Dell MD3000i units I'm using don't go any lower than 128KB for a RAID stripe size, so the default is fine. Unfortunately, I forgot to set the NTFS allocation unit size to 64KB when creating the partitions in Windows, so I need to recreate the partitions.

A massive thank-you to these guys for saving me a lot of time and hassle. You should go read this stuff too.

Page splits are always thought of as expensive, but just how bad are they? In this post I want to create an example to show how much more transaction log is created when a page in an index has to split. I'm going to use the sys.dm_tran_database_transactions DMV to show how much more transaction log is generated when a page has to split. You can find the list of columns and a small amount of explanation of each column in Books Online here - I was reminded of its existence by someone on Twitter (sorry, don't remember who it was and I couldn't find it in search).

In the example, I'm going to create a table with approximately 1000-byte long rows:

CREATE DATABASE PageSplitTest;
GO
USE pagesplittest;
GO

CREATE TABLE BigRows (c1 INT, c2 CHAR (1000));
CREATE CLUSTERED INDEX BigRows_CL ON BigRows (c1);
GO

INSERT INTO BigRows VALUES (1, 'a');
INSERT INTO BigRows VALUES (2, 'a');
INSERT INTO BigRows VALUES (3, 'a');
INSERT INTO BigRows VALUES (4, 'a');
INSERT INTO BigRows VALUES (6, 'a');
INSERT INTO BigRows VALUES (7, 'a');
GO

I've engineered the case where the clustered index data page has space for one more row, and I've left a 'gap' at c1=5. Let's add it as part of an explicit transaction and see how much transaction log is generated:

BEGIN TRAN
INSERT INTO BigRows VALUES (8, 'a');
GO

SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('PageSplitTest');
GO

database_transaction_log_bytes_used
-----------------------------------
1228

That's about what I'd expect for that row. Now what about when I cause a page split by inserting the 'missing' c1=5 row into the full page?

-- commit previous transaction
COMMIT TRAN
GO

BEGIN TRAN
INSERT INTO BigRows VALUES (5, 'a');
GO

SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('PageSplitTest');
GO

database_transaction_log_bytes_used
-----------------------------------
6724

Wow. 5.5x more bytes are written to the transaction log as part of the system transaction that does the split.

The ratio gets worse as the row size gets smaller. For a row with an approximately 100-byte long row (use the same code as above, but change to a CHAR (100), insert 67 rows with a 'gap' somewhere then insert the 68th to cause the split), the two numbers are 328 and 5924 - the split cause 18 times more log to be generated! For a row with an approximately 10-byte long row, I got numbers of 240 and 10436, because I created skewed data (about 256 rows with the key value 8) and then inserted key value 5 which forced a (rare) non-middle page split. That's a ratio of more than 43 times more log generated! You can try this yourself if you want: I changed the code to have a CHAR (10), inserted values 1, 2, 3, 4, 6, 7, then inserted 256 key values of 8 and then 2 of 5. The resulting page had only 6 rows - it split after the key value 5 - the Storage Engine doesn't always do a 50/50 page split. And that's not even causing nasty cascading page-splits, or splits that have to split a page multiple times to fit a new (variable-sized) row in.

Bottom line: page splits don't just cause extra IOs and index fragmentation, they generate a *lot* more transaction log. And all that log has to be (potentially) backed up, log shipped, mirrored....

Another quickie today. If you're using Activity Monitor in SQL Server 2008, or Performance Data Collection with the Server Activity system data collection set, then you may see a constant, and large number of 'Other' SQL Server Waits. When you drill into these, you see that the highest number of waits are for FSAgent. The FSAgent is part of the FILESTREAM subsystem, and it's what fools the rest of the Storage Engine into accepting the FILESTREAM data as if it was real varbinary(max) data. If you look in the Books Online entry for sys.dm_os_wait_stats, which defines all the different wait types, it explains that the wait should only show up when FILESTREAM I/Os are occuring. However, this wait type shows up *all* the time - clearly a bug.

Turns out that it's a known bug that's been fixed - see KB 958942.

This is a quick follow-on from my Misconceptions around null bitmap size post.

The null bitmap is *always* present in a data record (i.e. records in a heap or the leaf-level of a clustered index) except when all columns are defined as SPARSE in SQL Server 2008 onwards, but is optional in index records if all the columns in the index records are not nullable. The misconception is around what happens when a new column is added to the table. The common misconception is that if you have 8 columns in the table (and hence 8 bits in the null bitmap), if you add a ninth column then SQL Server has to go update every record so the null bitmaps all contain 9 bits. (Same misconception applies to adding the 17th, 25th, 33rd, etc column).

This is usually not true. Let's consider the cases:

  • New column is nullable, with a NULL default. The table's metadata records the fact that the new column exists but may not be in the record. This is why the null bitmap also has a count of the number of columns in that particular record. SQL Server can work out whether a column is present in the record or not. So - this is NOT a size-of-data operation - the existing table records are not updated when the new column is added. The records will be updated only when they are updated for some other operation.
  • New column is nullable, with a non-NULL default. This IS a size-of-data operation. The non-NULL default forces all existing records to be updated when the column is added, and so the null bitmap will be updated too.
  • New column is not-nullable (obviously with a non-NULL default). This IS a size-of-data operation, for the same reasons as above.

Hope this helps.

5/11/09: Little bit of a rewrite today as it seems some people are taking what I'm saying the wrong way.

Over on Ed Bott's blog, last week he showed some inside info about the notorious "SQL Server problem" that caused the Windows 7 RC downloads to be so slow. The reason given was that SQL Server fragmentation caused the server to spike CPU and an index rebuild fixed the problem. Then the SQL CAT team posted that the problem was (paraphrasing) that the download team had only planned for a 100% jump in traffic, and got a 500% jump. The server (me: for a so-far undisclosed reason) couldn't handle the load, so a new server was dropped in and (me: this is interesting) operational practices were changed. Today (5/11/09) I find out that there's an ongoing investigation into just why the server couldn't handle the load. Notice I'm not saying the SQL Server was deficient in any way, but that the server as a whole couldn't handle the load.

Based on the initial information, and the CAT team not stating what the actual problem was, plus they had to change an operational practice (e.g. more frequent index rebuilds to remove fragmentation), we (Kimberly and I) came up with a hypothesis as to what might have happened. Remember that saying that the server wasn't setup for a 500% jump in load could mean a very large number of things. It could mean that the raw number of inserts overloaded the hardware on the server. It could mean that the application logic didn't scale to the extra workload. It could mean that the database schema didn't scale to the extra workload. The latter is the one that we're hypothesizing.

Whatever the actual technical issue was, there are two real problems: 1) the lack of planning, whether it was in terms of hardware capacity, application logic, schema design - whatever 2) the way the problem was handled by the MS PR folks, which not only made out that it was a SQL problem, but made people like me think that of-course-it-wasn't-a-SQL-problem, but it must have been a design/schema problem.

I discussed this with Kimberly and here's what we think happened (from her initial idea). There's been some discussion on how simply downloading a package could cause fragmentation - it's just a SELECT right? Wrong. If you look at what happens when you download, there's a GUID (Globally Unique IDentifier) that gets generated for your download, and there must be a table that tracks downloads, so the GUID is entered into a table. The table most likely has a clustered index, with the random GUID value as the primary key (or the high-order key in a composite key).

Every download thus produces an insert into the clustered index at a random location in the index. As the index pages get full, they eventually need to split, so more inserts can occur in the index at points defined by the random key. These 'page splits' are very expensive, causing lots of IO, log record generation, and fragmentation (both logical fragmentation that interrupts range scan performance and reducing page density from pages becoming less full). All of this takes CPU, which explains some of the spiked CPU from all the downloads. If there are queries that are scanning this data too, the fragmentation would cause terrible performance issues, resulting in more, smaller IOs - and also adding to the CPU load. Rebuilding the index (as was explained in the original blog post) would remove the existing fragmentation, increasing performance again, but wouldn't remove the page-split issue - which is why (apparently) they're going to rebuild that index every night now.

(This is the 100-level explanation for non-SQL geeks, a more in-depth explanation can be found on Kimberly's blog at GUIDs as PRIMARY KEYs and/or the clustering key.)

One of the comments below from Andrew Kelly suggests it could also be an ad-hoc plan problem - again, not a SQL Server issue, but a problem with the application. That could equally be explained by unplanned workload blowing capacity.

So was this a SQL Server issue? No. SQL Server did exactly as it was told, and has no choice where in the index to insert the new records - the GUIDs provide the insertion point. The problem was in the developer who created that schema without testing it with a very high workload - and the lack of a database maintenance plan to pro-actively find and resolve fragmentation issues, etc, etc. 

Now, this is pure conjecture, based on the facts that have come out so far. However, nothing that's been said since the issue (by the SQL team or anyone else) has made me question the hypothesis that some design/app issue led to the problem. As I said, capacity planning can mean a huge number of things - but saying that it's a capacity issue certainly doesn't mean it's NOT a schema design/app logic issue.

Once we hear what the root cause of the capacity overload problem was, I'll post more details. In the meantime, of course I'm not making out that it's a SQL Server problem - if you tell SQL Server to (for instance) use random GUIDs a the clustered, primary key - under high load it's not going to perform well. Inherent SQL Server problem? No. Human problem? Yes.

I hope this rewritten version is clearer now.

Categories:
General | Performance Tuning

Here's an interesting bug that was just recently fixed, and someone actually hit it today, so it's time to blog it.

Bottom line is that if you're doing NOLOCK scans of tables involving LOB data, the perfmance might suck. 2005 SP3 CU3 has the bug fix, but unfortunately the KB article was written by someone who didn't seem to know what they were talking about, so I suggested it was rewritten, which it has been (but not republished yet). The original KB article is at http://support.microsoft.com/kb/961049/. It's also fixed in 2008, but I don't know which build (I believe CU4 at least).

Here's my explanation of the problem.

LOB data can be stored in-row or off-row (my previous LOB post Importance of choosing the right LOB storage technique has more details). When it's stored off-row, it must be accessed by first reading the pointer to the LOB data from the data/index record, and then following the pointer (remember that 'record' is synonymous with 'row'). When a LOB data value is updated, the off-row value is updated first, and then if the off-row link changed, the data/index record is updated with the new link. There's obvioulsy a window here, where someone reading the data/index record might see the wrong/non-existent off-row pointer. This is exactly what can happen when a NOLOCK scan occurs.

To mitigate the possibility of a NOLOCK scan trying to follow a bad off-row link, the old behavior was to scan all the IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) for the table/index to make sure that the off-row link actually pointed to a page allocated to the table/index. If there are lots of IAM pages, this means lots of logical IOs, and poorly performing SELECT queries. And it does the scan once for *every row*. The person that hit it today had a 500 row select of ~20KB per row taking 20 seconds - 10MB of physical IOs and 30MB of logical IOs!

The fix is to make further use of an already existing in-memory cache of IAM pages to do a quick lookup of the right IAM page covering the GAM interval of the LOB page being read, without having to scan the whole IAM chain.

Hope this helps explain things, and track down perf problems for some of you.

A couple of weeks ago I kicked off the latest survey, on what the physical layout of your databases are and why you have them that way (see here for the survey). I let the survey run for a while to get a good sampling, and I wasn't disappointed, with over 1000 responses! Here are the results as of 4/27/2009.

Just like any other 'best practice' kind of topic, the question of how to design the physical layout of a database provokes a lot of (sometimes heated) discussion. There are lots of options and there are even more factors to consider - so the best answer is my perennial favorite "it depends"! In this post, I don't want to tell you how I think you should layout your database - instead I want to discuss some of the options and let you make up your own mind, with the added benefit of data on what your peers are doing with their databases. The main point of this survey was to see what people are doing, rather than as a driver for an editorial blog post.

As you can clearly see from the results above, and predictably, the distribution of layout types shifts as the database size increases - but I was very surprised by the number of single file databases over 10GB. Rather than go through each option in the survey, I'm going to talk a bit about some of the things to consider when planning a layout.

Underlying I/O subsystem

This could be the most important factor to consider. If you only have a single physical drive, for instance, there's arguably not much point creating multiple data files, as that will force the disk heads to bounce back and forth to the different file locations on the disk. On the other hand, if you have a SAN with several thousand drives grouped together into multiple LUNs, your possibilities are a lot wider (and maybe much harder to come up with the optimal layout). Several people asked if I'd go into depth around having multiple controllers, and different drive layouts in a SAN - and my answer is no. I'm not an expert at storage design, which, like indexing, is both an art and a science. There's a good whitepaper that discusses some of this: Physical Database Storage Design, which I helped review back in my MS days.

Performance, recoverability, manageability

Having multiple files with separate storage for each allows reads and writes to be parallelized for increased performance, lowering the amount of disk head contention. When a checkpoint occurs, and pages are written to disk, spreading the I/O load over multiples files can speed up the checkpoint and reduce the IOPS spikes that you may see. It can also lead to reduced contention for the various allocation bitmaps - in the same way as I've described for tempdb. In user databases with a very high rate of allocations, contention can arise on the GAM pages - but it's not common. Some people also advocate having separate filegroups for tables and indexes, and although this can sometimes be more trouble than it's worth, and often turns into a religious debate, I have heard of people getting a perf boost from this.

One of the most convincing reasons (I find) for having multiple filegroups is the ability to do much more targeted recovery. With a single file database, if it gets corrupted or lost, you have to restore the whole database, no matter how large the file is - and this can seriously affect your ability to recover within the RTO (Recovery Time Objective) agreement. By splitting the database into multiple filegroups, you can make use of partial database availability and online piecemeal restores (in Enterprise Edition) to allow the database to be online as soon as the primary filegroup is online, and then restore the remaining filegroups in priority order - bringing the application online as soon as the relevant filegroups are online. You can even use this layout to spread your backup workload - moving to filegroup-based backups instead of database backups, although this isn't very common.

As far as manageability is concerned, there are a few reasons to have multiple filegroups. Firstly, you can isolate a table that requires a lot of I/O (e.g. in terms of index maintenance) on separate storage from other tables, so that maintenance operations (and the I/O overhead of doing them) doesn't interfere with the I/O of the other tables. Also, you can provision different kinds of storage for different tables - in terms of disk speed and RAID level (redundancy), for instance. If you want to be able to move data around, you can do it much more easily if the database is split up, than if it's a single file.

Summary 

Ok - so I lied. I *am* going to offer advice - against one of the options: single filegroup, single file. For smaller databases, this is fine - but as the database size gets larger, say, over tens of GB, then having a single file can become a serious liability. With a single file database (or even a single filegroup database), you lose most of the benefits mentioned above.

Bottom line - as your databases get larger, you're going to need to think more carefully about their layout, otherwise you could run into big problems as your workload increases or when disaster strikes. As the survey results show, this is what your peers are doing.

Next post - this week's survey!

Finally! All those customers out there who are frustrated by not being able to turn on the 'lock pages in memory' option to protect the SQL working set on Standard Edition - the wait is nearly over. Bob Ward (Principal Escalation Engineer in CSS, and a good friend of mine) announced at PASS Europe that the option is soon to be supported - in the May CU for SQL 2008 and the June CU for SQL 2005. You can read a little more in his blog entry here.

If you don't know what this is, checkout this entry in Books Online - it basically prevents Windows paging out SQL memory to disk.

[Edit: the bits containing this fix have been released for 2008. See here for details.]

There's been a recent flurry of confusion and misconceptions about trace flag 1118 in SQL Server 2008. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It's used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion.

There are multiple points of confusion, which I'll address in turn. Then I'll prove that the trace flag still works in SQL Server 2008.

1) Why is the trace flag usually required in 2000? In SQL 2000, whenever a temp table is created in tempdb and a row inserted, an IAM page must be allocated and a single data page must be allocated. These two pages are both 'single-page' allocations, from a mixed extent (see Inside The Storage Engine: Anatomy of an extent for more info). This means that an SGAM allocation bitmap page must be accessed, and a PFS page must be accessed (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps for more info).

With lots of very small temp tables being created, this means the very first SGAM page and the very first PFS page in the data file are accessed/changed by all the threads, leading to latch contention problems on these two pages. When the temp tables are deleted again, the various pages are deallocated, which again needs to access and change the PFS page, and potentially the SGAM page.

There are two ways to alleviate this problem. Firstly, create multiple data files in tempdb - which splits the latch contention over multiple allocation bitmaps (from having allocations come from multiple files) and thus reduces the contention. The general rule of thumb was one tempdb data file for each processor core. Secondly, turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent. This means one extent is allocated from the GAM page, rather than 8 single pages (and potentially 8 accesses to the SGAM page). The pages within the extent are reserved and allocated singly from this extent, as needed. This also cuts down on contention and is documented in KB 328551.

2) What does reserved vs. allocated mean? When an extent is allocated to a table, the 8 pages in the extent are not immediately allocated as well. Allocating an extent means those 8 pages are reserved exclusively for subsequent allocation to that table. The pages are allocated individually as needed, but no other table can allocate them. This is why such extents are called 'dedicated' extents (see my blog post link above for more details). You can see the counters of reserved pages vs. allocated pages in the output from sp_spaceused.

3) Why is the trace flag not required so much in 2005 and 2008? In SQL Server 2005, my team changed the allocation system for tempdb to reduce the possibility of contention. There is now a cache of temp tables. When a new temp table is created on a cold system (just after startup) it uses the same mechanism as for SQL 2000. When it is dropped though, instead of all the pages being deallocated completely, one IAM page and one data page are left allocated, and the temp table is put into a special cache. Subsequent temp table creations will look in the cache to see if they can just grab a pre-created temp table 'off the shelf'. If so, this avoids accessing the allocation bitmaps completely. The temp table cache isn't huge (I think it's 32 tables), but this can still lead to a *big* drop in latch contention in tempdb.

4) Does the trace flag still exist in 2005 and 2008? Yes it does - KB 328551 clearly states:

Note Trace flag -T1118 is also available and supported in Microsoft SQL Server 2005 and SQL Server 2008. However, if you are running SQL Server 2005 or SQL Server 2008, you do not have to apply any hotfix.

Just to make extra-sure (as I'm always paranoid about saying absolutes), I checked with my good friend Ryan Stonecipher, who's the dev lead for the team that owns allocation (and a bunch of other stuff, including DBCC). He confirmed the code is exactly the same in 2008 as it was in 2005. And I prove it to you below too.

5) And why is it still there in 2005 and 2008? It does the same thing in 2005/2008 as it did in 2000. If the temp table creation/deletion workload is high enough, you can still see latch contention, as the temp table cache won't be enough to completely alleviate the need for creating actual new temp tables, rather than just being able to grab one 'off the shelf'. In that case, using the trace flag to change to extent-based allocation (in *exactly* the same way as for 2000) can help, as can creating more tempdb data files.

As far as data files go though, the number has changed. Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there's latch contention), now you don't need so many - so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there's usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization - your mileage may vary - don't post a comment saying this is wrong because your system benefits from 12 data files. It's a generalization, to which there are always exceptions.

6) Why does DBCC IND still show two pages, even with the trace flag on? I've heard of some people being confused by the output of DBCC IND in SQL 2008 when the trace flag is turned on. Creating a single row temp table will only show two pages allocated in the DBCC output - one IAM page and one data page. Yes, that's completely correct - as only two pages are allocated, but the data page comes from a dedicated extent, not a mixed extent. (IAM pages are *always* single-page allocations from mixed-extents).

And now the proof, on SQL 2008.

SELECT @@VERSION;
GO

Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)

First off, I'll create a temp table without the trace flag enabled, and see what pages the table has allocated, by looking at the first IAM. I'll use a temp table with an 8000+ byte row size, and insert two rows - so we have two data pages for clarity.

DBCC TRACEOFF (1118, -1);
GO

USE tempdb;
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

Now I'll figure out what is the first IAM page, using my sp_AllocationMetadata script (see here for the script and details), and dump it with DBCC PAGE to see the single-page allocations it's tracking, and which dedicated extents are allocated to the table:

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1152921505223016448  IN_ROW_DATA      (1:158)     (0:0)      (1:199)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 199, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A35C08E

Slot 0 = (1:158)                     Slot 1 = (1:200)                     Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A35C0C2

(1:0)        - (1:1016)     = NOT ALLOCATED   

As you can clearly see from the partial output of the dump of the IAM page, there are two single-page allocations and no extents allocated to the temp table. This is what should happen when the trace flag is not enabled.

Now I'll do the same thing with the trace flag 1118 enabled.

USE tempdb;
GO

DROP TABLE #temp;
GO

DBCC TRACEON (1118, -1);
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1224979099301904384  IN_ROW_DATA      (1:208)     (0:0)      (1:158)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 158, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A8FC08E

Slot 0 = (0:0)                       Slot 1 = (0:0)                       Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A8FC0C2

(1:0)        - (1:200)      = NOT ALLOCATED                              
(1:208)      -              =     ALLOCATED                              
(1:216)      - (1:1016)     = NOT ALLOCATED
                              

Now as you can clearly see, there are no single-page allocations, and there's a single extent allocated to the table. Proof that trace flag 1118 still does exactly what it should in SQL Server 2008.

Now for a DBCC IND on the table: 

DBCC IND ('tempdb', '#temp', -1);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID    
------- ----------- ------ ----------- ----------- -----------
1       158         NULL   NULL        293576084   0
1       208         1      158         293576084   0
1       209         1      158         293576084   0

(I've removed some of the trailing columns for clarity.) We see that it still only lists the two data pages (1:208, 1:209) and the IAM page (1:158) - although an entire extent was allocated to the temp table, only two pages from the extent were actually allocated and used - the rest are reserved for use by that table, but remain unallocated. 

Hopefully this post has cleared up a lot of the confusion around this trace flag and what it does.

A short post to start the day (in India) and then at lunchtime I'll do how it works: FILESTREAM garbage collection.

This is a question that came up recently on an MVP discussion list: why does a 200MB log backup take 40 minutes to restore on my log-shipping secondary?

First answer I thought of was that the I/O subsystem on the log-shipping secondary was very slow, or broken in some way, so I/Os were taking ages to complete. Nope - I/O subsystem was performing well.

Next thing I suggested was the the log-shipping secondary was restoring the log backups using WITH STANDBY and there was something like an index rebuild that was being rolled back. As an aside, when you restore log backups on the log-shipping secondary, you have a choice how they recovery: either WITH NORECOVERY or WITH STANDBY. The first option doesn't allow any access to the database, as the database is still "in recovery". The second option runs the REDO part of recovery, then runs the UNDO part of recovery, but saves the details of what it had to do for UNDO into a file (who's name and location you specify). It then allows read-only access to the database, for queries/reporting/whatever. Obviously if there's a lot of transaction log that has to be undone (i.e. rolling back transactions that weren't committed at the time the log backup completed), then this could take some time. But 40 minutes? No. That wasn't it.

I was partly right on my second guess. The *previous* log backup that was restored WITH STANDBY contained a long-running index operation, and so the undo file that the restore created was *huge*. The next log backup that's restored after a RESTORE ... WITH STANDBY, must first undo everything in the undo file (i.e. put the database back to the exact state it was in as if the WITH STANDBY part of the restore never happened) before it can restore more transaction log. In this case, no-one had noticed that the previous log restore *also* took a lot of time and created the huge undo file. It just looked like the 200MB log backup was causing the problem.

My advice: in any kind of there's-suddenly-a-corruption-or-performance-problem situation, don't just look at what's happening now. Always look at what happened leading up to the problem, as the answer is often there.

Microsoft's popular performance troubleshooting whitepaper has been updated for SQL Server 2008. You can download Troubleshooting Performance Problems in SQL Server 2008 at http://download.microsoft.com/download/D/B/D/DBDE7972-1EB9-470A-BA18-58849DB3EB3B/TShootPerfProbs2008.docx.

I've also added it to our whitepaper-links collection page at http://www.sqlskills.com/whitepapers.asp.

At the last few conferences I've presented at, there have been questions about using SSDs (Solid-State Drives) for enterprise storage and whether that will change some of the database maintenance practices. My answer to that is "I don't know" (ha - bet you thought I was going to say "It depends!") because adoption of SSDs is very low. I haven't been able to find much info about using them, but the Microsoft Research group in Cambridge just published a research paper Migrating Sever Storage to SSDs: Analysis of Tradeoffs, which does a nice job of walking through the issues involved and concludes that for the majority of workloads, it makes more economic sense to host them on HDDs. The exception is for top-end OLTP databases. I'll warn you that this isn't a whitepaper - it's a research paper, and gets a bit deep into algorithms and mathematical analyses, but if you're up to the challenge it's a great read.

You can download the paper from http://research.microsoft.com/en-us/um/people/antr/ms/ssd.pdf. Enjoy!

PS I found it on James Hamilton's blog.

This week's survey is a little more complicated. I'm interested in the physical layout of your databases. I've got four surveys, for a variety of database sizes. Please vote multiple times in each survey, as you see fit - and by all means forward this link to your friends/clients/etc or re-blog/tweet it. I'm going to report on this survey in two weeks, to give a bit more time for people to respond (and because we're travelling next week). I think we're going to see some interesting statistics come out of this - the more people that respond the better. I'll report on it 4/24/09.

One thing to note - this is just for user databases, not for tempdb. In the surveys, "multiple filegroups" implies multiple files too, and if you don't have them spread exactly one per drive/etc, just choose that option - I only have 10 options to choose from in the free surveys. 

(If you're in the over 1TB range and have multiple files/filesgroups spread over multiple drives/arrays/LUNs, vote using the last option on the >1TB survey and I'll lump them together.) 

Phew - thanks!

Last week's survey was on how you should store large-value character data in SQL 2005+ (see here for the survey). Here are the result as of 4/3/2009 - and I think my favorite answer is starting to catch-on:

My favorite answer is, of course, it depends! For all those who didn't answer 'it depends', your answer is valid, but only for particular circumstances, as each method has its pros and cons and won't be applicable in all cases. It's extremely important when designing a schema to consider how to store LOB data, as making the wrong choice can lead to nasty performance issues (where 'performance' is a catch-all to include things like slow queries, fragmentation, and wasted space). Now I'd like to run through each of the options and detail what I think of as the pros and cons. A couple of definitions first: 'in-row' means the column value is stored in the data or index record with the other columns; 'out-of-row' or 'off-row' means the column value is stored in a text page somewhere in the data file(s), with a physical pointer stored in the data/index record (taking either 16 or 24 bytes itself).

  • As a N/CHAR column. This is a great choice when the data that's stored in the column is a fixed size all the time, and always uses the full width of the column. Any time that the data may be smaller than the defined wdith of the column, space is being wasted in the row. Wasted space leads to fewer rows per page, more disk space being used to store the data, more I/Os to read the data, and more memory used in the buffer pool. However, if the character values are very volatile, and can change size, then having a fixed-width column can avoid the problem of a row having to expand and there not being enough space on the page to allow that - leading to a fragmentation-causing page split in an index (or forwarding record in a heap). There's a tipping point that can be hard to identify for your particular application...
  • As a N/VARCHAR (1-8000) column. For data values less than 8000 bytes, this is the common choice as it avoids wasted space. However, if the application can change the size of the data after the initial creation of the row, then there is the possibility of fragmentation occuring through page-splits. In SQL Server 2005+, a row can also be created that is more than 8060 bytes - one or more variable-length columns is pushed into off-row storage and replaced by a physical pointer. This means any access of the column has to do an extra I/O to reach the data - and this is commonly a physical I/O as the text page is not already in memory. This can lead to hard-to-diagnose performance issues if a query selects the column and some rows have the data in-row, and some out-of-row. Also, if the data values tend towards the larger end of the 1-8000 byte spectrum, individual rows can become vary large, leading to very few rows per page - and the problems described in the first option. If the data isn't used very much, then storing it in-row like this isn't very efficient.
  • As a N/VARCHAR (MAX) column in-row. This has the same pros and cons as the option above, with the added benefit that the value can grow larger than 8000 bytes. In that case it will be pushed off-row automatically, and start incurring the extra I/O for each access. These data types also work with the intrinsic functions in the same way as the character data types discussed above. I guess one drawback of this type compared to FILESTREAM is that it's limited to 2GB. Also, if there's a LOB data column in the table definition, the table's clustered index cannot have online operations peformed on it - even if all the LOB values are NULL or stored in-row!
  • As a N/VARCHAR (MAX) column out-of-row. The drawback of storing this data out-of-row is that accessing it requires an extra I/O to retrieve it, but if the data isn't used very much then this is an efficient way to go, but still uses space in-row to store the off-row pointer. An additional benefit of storing the data off-row is that it can be placed in a separate filegroup, possibly on less expensive storage (e.g. RAID 5) - but then there's the drawback that it can't be moved after being created except with an export/import operation. This option has the same online operations drawback as storing the data in-row.
  • As a N/TEXT column in-row. This has the same pros and cons as the N/VARCHAR (MAX) column in-row option, but these data types are deprecated and don't work with the majority of the intrinsic functions.
  • As a N/TEXT column out-of-row. Same as above.
  • In a seperate table and JOIN to it when required. This option is great when the data isn't used very much, as it doesn't require any storage at all in the main table (except for a value to use for the JOIN), but it does require some extra up-front design and slightly more complicated queries. There's another HUGE benefit to doing this - by moving the LOB data to another table, online operations become available on the main table's clustered index. (This concept is 'vertical partitioning' a huge topic in itself...)
  • As a FILESTREAM column. (Yes, I didn't have this in the survey, but it's a possibility). If your data values are going to be more than 1MB, then you may want to consider using the FILESTREAM data type in SQL 2008 to allow much faster access to the data than having to read it through the buffer pool before giving it to the client. There are lots of pros and cons to using FILESTREAM - see my whitepaper for more info here.

So, as you can see, the best answer for a general question like this is definitely It Depends!. Although I haven't covered every facet of each storage option, the aim of this post is to show that it is very important to consider the implications of the method you choose, as it could lead to performance problems down the line.

Next post - this week's survey!

(Continuing my habit of blogging while Kimberly's presenting - at least I'm not on stage this time...)

In early versions of SQL Server, it was sometimes necessary to 'pin' the pages from a heavily used table in the buffer pool to help performance. In later versions, the buffer pool became very sophisticated and does an excellent job of keeping the right data in memory (basically using an LRU algorithm). I saw a blog post today which mentioned that you can get into trouble using DBCC PINTABLE if you pin a table in memory and then the table grows and grows, and ends up taking up such a large proportion of the database that it causes massive performance problems - absolutely true. The blog post also mentioned that DBCC PINTABLE was deprecated in SQL Server 2005.

That's also true - it's deprecated, but in SQL Server 2005 it actually does NOTHING at all. I personally removed all the code - so now it just returns a success message.

Categories:
DBCC | Performance Tuning

The April edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Disappearing errors with DBCC CHECKDB
  • Provisioning tempdb when moving from 2000 to 2008
  • Does fillfactor prevent fragmentation and should it be set instance-wide
  • Avoiding FILESTREAM performance problems

Check it out at http://technet.microsoft.com/en-us/magazine/2009.04.sqlqa.aspx

Back in 2005 Kimberly produced two very popular webcast series - an 11-part webcast series for TechNet called SQL Server 2005 for the IT Professional and a 10-part webcast series for MSDN called A Primer for Proper SQL Server Development. The webcast links and blog posts were broken for quite a while but now they're all fixed up and working again. I've created some web pages that link to all the webcasts and blog posts, along with abstracts. I've also included some more recent ones too and will be adding to the list over the next few weeks.

There's over 30 hours of good stuff to watch - check them out at http://www.sqlskills.com/webcasts.asp

Microsoft's Steffen Krause has written an excellent whitepaper on Tuning the Performance of Change Data Capture in SQL Server 2008, that I technically reviewed, and it was published late last year. To get an overview of Change Data Capture (CDC) before reading the whitepaper, see the TechNet Magazine article I wrote for the November issue, titled SQL Server 2008: Tracking Changes in Your Enterprise Database. The new whitepaper covers:

  • Configuration of sys.sp_cdc_enable_table parameters
  • Configuration of the CDC capture job
  • Using Extended Events to determine the performance and characteristics of CDC
  • Influence of CDC scan job parameters on CDC performance
  • Influence of workload characteristics on CDC performance
  • Influence of sys.sp_cdc_enable_table parameters on CDC performance
  • CDC cleanup job considerations
  • Transaction log file considerations

Check it out at http://msdn.microsoft.com/en-us/library/dd266396.aspx.

One of the things I mentioned in my recent TechNet Magazine article on Understanding Logging and Recovery was the need to manage the transaction log so the number of VLFs (virtual log files) does not get too large (with too large being more than, say, 100 VLFs). Linchi Shea (a fellow MVP) has just posted some performance numbers comparing insert/update/delete performance between a database with 16 VLFs and one with 20000 VLFs, representing a poorly managed transaction log. You can see his blog post here.

For those wishing to take charge of unruly transaction logs, checkout Kimberly's blog posts - starting with this one. And if you want to quickly know how many VLFs your transaction log has, use the undocumented DBCC LOGINFO command - the number of lines of output is the number of VLFs you have.

Next posts coming up will be photos - we're in Bangkok right now and I've got 3 Where In The World Are Paul and Kimberly posts queued up - St. Lucia, Hyderabad, and Bangkok.

Over the last few weeks Sunil Agarwal (from the SQL Storage Engine team) has posted a great series of blog articles about tempdb and the version store, over on my old stomping ground - the Storage Engine blog. The articles are well worth reading - the links are:

Enjoy!

In the previous blog post I mentioned our partner company in Australia, so I'd better explain...

We have a new partner company - SQLskills.com.au - run by our good friend and fellow MVP Greg Linwood, along-side his other company MyDBA, which provides DBA support, consulting and staffing services to customers world-wide. By extending SQLskills.com into Australia, we can provide world-class training to the burgeoning SQL Server base in and around Australia, without customers having to travel to the US. As well as providing their own custom courses, the SQLskills.com.au team will be teaching courses developed by me, Kimberly, and Bob Beauchemin - using only the best instructors that we've personally taught and approved.

Update 12/10/08: the Australia classes have been pushed out to June 2009 for various reasons - watch the blog for more details.

The courses have already been running the last two months to great success and I'm very excited to announce that Kimberly and I are coming to Australia in February 2009 to teach four classes in the SQLskills.com.au training facility in Melbourne. The classes we have planned are:

Click the links for more details and registration info. The internals course (or equivalent knowledge) is really a pre-requisite for the perf tuning and maintenance courses that immediately follow it, as these courses will be really in-depth.

Checkout the SQLskills.com.au site for other classes they offer for developers and BI specialists.

We hope to see you in Melbourne next year!

It's really scary how quickly time flies - seems like it was just last week when I last blogged about TechNet Magazine (actually it was a month ago when I blogged about my 2008 Change Tracking article - see here). Anyway, a new issue of TechNet Magazine has just come out and this one has the latest installment of my bi-monthly SQL Q&A column.

This month's topics are:

  • How row-overflow columns can lead to poor range scan performance, even on completely defragmented indexes.
  • Combining database mirroring and failover clustering without undesired failover behavior.
  • Adding differential backups to a full+log backup strategy to lower recovery time.
  • Memory settings for multi-instance failover clustering

You can get to the column online at http://technet.microsoft.com/en-us/magazine/dd228989.aspx.

Enjoy!

PS If you have any ideas for what would make a good SQL Q&A topic, please drop me a line - paul@sqlskills.com

During early SQL Server 2008 development, the Storage Engine team introduced additional minimally-logged functionality. And then it was removed again for RTM. Now it seems that they've bowed under customer pressure and have made the functionality available again (see Sunil's announcement here).

Using trace flag 610 in the RTM build, you enable the potential for minimal-logging when:

  • Bulk loading into an empty clustered index, with no nonclustered indexes
  • Bulk loading into a non-empty heap, with no nonclustered indexes

Sunil's previous blog post here gives more info on the required syntax - very useful!

Over the last few weeks I've had lots of questions about FILESTREAM performance and how to coax NTFS into scaling nicely. I just finished writing a 30-page whitepaper about FILESTREAM for the SQL Server team, which should be published before PASS in November (I'll blog the link when I have it). Although my whitepaper isn't strictly about performance, there is a long section about setting up your system to get high-performance from FILESTREAM. What I want to do in this blog post is give a bullet list of things to do that will help you get good performance. All of these are explained in more detail in the whitepaper.

Here you go, in no particular order:

  • Make sure you're storing the right-sized data in the right way. Jim Gray (et al) published a research paper a couple of years ago based called To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?. To summarize the findings, BLOBs smaller than 256-KB should be stored in a database, and 1-MB or larger should be stored in the file-system. For those in-between, "it depends" - my favorite answer. The upshot of this is that you won't get good performance if you store lots of small BLOBs in FILESTREAM.
  • Use an appropriate RAID level for the NTFS volume that will host the FILESTREAM data container (the NTFS directory structure corresponding to the FILESTREAM filegroup in the database). Don't use RAID-5, for instance, for a write-intensive workload.
  • Use an appropriate disk technology. SCSI will be usually be faster than SATA/IDE, but more expensive. This is because SCSI drives usually have higher rotational speeds, so lower latency and seek times.
  • Whichever disk technology you choose, if SATA, ensure it supports NCQ, and if SCSI, ensure it supports CTQ. Both of these allow the drives to process multiple, interleaved IOs concurrently.
  • Separate the data containers. Separate them from each other, and separate them from other database data and log files. This avoids contention for the disk heads.
  • Defrag the NTFS volume if needed before setting up FILESTREAM, and periodically to maintain good scan performance
  • Turn off 8.3 name generation on the NTFS volume. This is an order-N algorithm that has to check that the new name generated doesn't collide with any existing names in the directory. This slows insert and update performance down *a lot*. Do this using the command line fsutil utility.
  • Turn off tracking of last access time using fsutil.
  • Set the NTFS cluster size appropriately. For BLOBs 1-MB or large, use a cluster size of 64-KB. This will help to reduce fragmentation.
  • A partial update of FILESTREAM data creates a new file. Batch lots of small updates into one large update to reduce churn.
  • When streaming the data back to the client, use an ~60-KB SMB buffer size (or multiples thereof). This is so that the buffers don't get overly fragmented as TCP/IP buffer are 64-KB.

Hope this helps!

At the user group meeting on Monday I spent some time explaining how GUIDs can cause fragmentation in clustered indexes AND in non-clustered indexes, even if the GUID isn't specifically included in the non-clustered index key. GUIDs are essentially random values (pseudo-random in ranges if generated using NEWSEQUENTIALID) that are also unique. Their uniqueness is what makes them attractive to many developers as a key value, without understanding the havoc they can cause in production in terms of fragmentation and poor query performance.

A GUID key causes fragmentation because of its randomness. The insertion point of a new record in an index is dictated by the value of the index key, so if the key value is random, so is the insertion point. This means that if an index page is full, a random insert that happens to have to go onto that page will cause a page split to make room for the new record. A page-split is where a new page is allocated and (as near as possible to) half the rows from the splitting page are moved to the new page. The new row is then inserted into one of the two pages, determined by the key value. Usually the newly allocated page is not physically contiguous to the splitting page, and so fragmentation has been caused. In this case *two* kinds of fragmentation have been caused - logical fragmentation (where the next logical page as determined by the index order is not the next physical page in the data file) and physical (or internal) fragmentation (where space is being wasted on index pages). These can both affect query performance (topic for a later post), as well as the expense of having to do the page split in the first place.

It's fairly well known that GUIDs can cause fragmentation in the index where the GUID is the key (e.g. a clustered index), but not about the knock-on effects in non-clustered indexes. Here's an example - I'll create two clustered indexes with GUID keys (one generated from NEWID and one from NEWSEQUENTIALID), plus a non-clustered index on each. Let's see what happens when we insert 100000 rows:

-- Create a table with a GUID key
CREATE TABLE BadKeyTable (
   
c1 UNIQUEIDENTIFIER DEFAULT NEWID () ROWGUIDCOL,
   
c2 DATETIME DEFAULT GETDATE (),
   
c3 CHAR (400) DEFAULT 'a');
CREATE CLUSTERED INDEX BadKeyTable_CL ON BadKeyTable (c1);
CREATE NONCLUSTERED INDEX BadKeyTable_NCL ON BadKeyTable (c2);
GO

-- Create another one, but using NEWSEQUENTIALID instead
CREATE TABLE BadKeyTable2 (
   
c1 UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID () ROWGUIDCOL,
   
c2 DATETIME DEFAULT GETDATE (),
   
c3 CHAR (400) DEFAULT 'a');
CREATE CLUSTERED INDEX BadKeyTable2_CL ON BadKeyTable2 (c1);
CREATE NONCLUSTERED INDEX BadKeyTable2_NCL ON BadKeyTable2 (c2);
GO

DECLARE @a INT;
SELECT @a = 1;
WHILE (@a < 10000)
BEGIN
   
INSERT INTO BadKeyTable DEFAULT VALUES;
   
INSERT INTO BadKeyTable2 DEFAULT VALUES;
   
SELECT @a = @a + 1;
END;
GO

-- And now check for fragmentation
SELECT
   
OBJECT_NAME (ips.[object_id]) AS 'Object Name',
   
si.name AS 'Index Name',
   ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
   
ips.page_count AS 'Pages',
   
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats (DB_ID ('DBMaint2008'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE
   
si.object_id = ips.object_id
   
AND si.index_id = ips.index_id
   AND ips.index_level = 0;
GO

Object Name    Index Name        Fragmentation  Pages  Page Density
-------------  ----------------  -------------  -----  ------------
BadKeyTable    BadKeyTable_CL    99.13          8092   66.08
BadKeyTable    BadKeyTable_NCL   30.97          78     64.1
BadKeyTable2   BadKeyTable2_CL   0.83           5556   96.26
BadKeyTable2   BadKeyTable2_NCL  1.88           372    99.61

The BadKeyTable_CL clustered index with the GUID key generated from NEWID is almost perfectly fragmented, with 34% space being wasted on each page. Conversely, the BadKeyTable2_CL clustered index with the GUID key generated from NEWSEQUENTIALID is hardly fragmented and only 4% of free space is wasted (and this is just because of the row size chosen). These numbers are entirely expected given the nature of the cluster keys.

Now look at the non-clustered indexes. BadKeyTable_NCL is 31% fragmented with 36% space wasted on each page! BadKeyTable2_NCL is harldy fragmented with no free space wasted on each page. So what's going on? The non-clustered index key in both cases is a datetime column, which has a minimum granularity of 3 milliseconds. The code above runs in a tight loop inserting records and so can insert more than one record per 3ms time interval - and I allow this because I didn't create make the non-clustered indexes unique. For all the records inserted in one 3ms time interval, there *has* to be something that makes the non-clustered index key unique internally (as even though an index can be defined as non-unique, the Storage Engine requires that each record really is unique in an index and will add whatever it needs to so that happens).

In this case, the clustered index key (which must be present in the non-clustered index anyway) is used to differentiate between all the non-clustered index records with the same datetime value. For BadKeyTable_NCL, the cluster key is a random GUID, so the non-clustered index record insertion points ALSO become random within each 3ms time interval - leading to the fragmentation above. The BadKeyTable2_NCL non-clustered index has the same time interval issue, but it's cluster key is a sequential GUID, so the non-clustered index doesn't get fragmented. If the non-clustered index key was a time datatype with a larger minimum granularity (like smalldatetime, or the new date), the fragmentation of BadKeyTable_NCL would be even worse - try it for yourself and you'll see.

So the answer to the question in the blog post title is really - it depends! Under the right conditions, a GUID cluster key can also seriously fragment a non-clustered index as well.

Fresh off a week of teaching classes on the Microsoft campus, we've finalized some user group dates. Here's what we have coming up:

  • Monday August 18th: user group meeting in Redmond
  • Monday September 1st to 3rd: public class in England
    • In conjunction with our UK partners, SQL Know How at Hatfield, England
    • Topic: Best Practices in Performance and Availability in SQL Server 2005/2008 
  • Thursday September 4th: user group meeting in Ireland
  • Friday September 5th: SQL Server launch event in Ireland
    • In conjunction with Microsoft Ireland at the Dublin Microsoft office
    • Topic: SQL Server 2008 Overview for DBAs
  • Monday September 8th to 9th: public class in Scotland
    • In conjunction with our UK partners, SQL Know How at Edinburgh, Scotland
    • Topic: Indexing for Performance in SQL Server 2000/2005/2008

It's going to be a busy few weeks - hope to see you at one of these events!

Kimberly and I were presenting at our local (Redmond) .Net Developers Association on Monday and the following question came up while Kimberly was talking about missing and extra indexes (paraphrasing):

What's the best non-clustered index to use for the query with a predicate WHERE lastname = 'Randal' AND firstname = 'Paul' AND middleinitial = 'S'?

Kimberly said that the order of the keys (e.g. lastname, firstname, middleinitial; or middleinitial, lastname, firstname; etc) doesn't matter for this case. I thought about it for a second and then argued, saying that the most selective column should come first. We agreed to discuss with the group at the end, but I thought about it some more and realized (and admitted to the group) that she's right - I should know better than to question Kimberly's knowledge of indexing... :-)

She's right because for a pure equality query using AND for multiple predicates, the Storage Engine will seek straight to the first exactly matching record in the index (and then scan for more matches if it's a non-unique index). It doesn't matter what order the index keys are defined because the Storage Engine is looking for an exact match.

When I started arguing, I was thinking about a phone book, which is ordered by lastname, firstname, middleinitial. You may think that a phone book is ordered that way because lastname is the most selective. Wrong. It's because the lastname is what most people know - it just happens to be the most selective of the three choices. Most SQL geeks should be able to find Kimberly in a phone book by looking for Tripp, Kimberly. But what if it was ordered by middleinital? I'd have no problem finding Kimberly, but how many of you would remember that her middleinitial is L? Probably a few as we both use our middle initials in our public names. What about if it was ordered by middleNAME? Again, no problem for me but who how many other people know her middle name is Lynn?

Then I started thinking about other queries and how they would play into the index choice to answer to the question above. If I also wanted to support a query with the predicate WHERE lastname = 'Randal', then having the left-most index key be anything other than lastname won't work so well. If the key order was firstname, middleinitial, lastname then all the distinct lastname values would be spread through the index rather than being together. The index might still be used to satisfy the query if it's the lowest cost index to use. However, having lastname be the leading key probably wouldn't work very well for a query with a predicate of WHERE firstname = 'Paul' - that argues for having firstname be the left-most index key.

Which should I choose? I probably I can't have both in the same index, so maybe I'd have TWO non-clustered indexes, to support both queries. The answer depends on how often the various queries are used and the trade-off between how much of a performance gain the non-clustered index would provide against the performance drop of having to maintain it during DML operations.

I hear time and again about people adding a non-clustered index for every column in the table, thinking that this will help - and my thinking is that this is wrong because these indexes can only satisfy a query where the only predicate is the column being indexed. I ran this argument past Kimberly and she added that these indexes could also be used if the column is chosen as the most selective in a multi-predicate query, and no other index has a lower cost than that one (a slim chance usually). Even what I though of as a simple case has caveats!

So what's the point of this post? Well, I wanted to show how indexing for one very simple query is pretty straightforward, but as soon as the number of different queries grows, and the query predicates get more complicated, indexing becomes more complex. You really have to know your workload and your data to know which columns are used, in what combinations, and how often - and then it helps to know how indexes are costed and used so that you can make intelligent choices about which indexes to define.

This thought-exercise has really shown me that I didn't know how much I don't know about indexes - I know precisely how they work at the Storage Engine level but not too much about how they're used by the Query Processor. I have new-found respect for Kimberly's indexing expertise. Luckily she's teaching a class at Microsoft called Indexing For Performance next week - I think I'll attend :-)

During the various courses I've been teaching, people are interested in how FILESTREAM performance compares with storing BLOBs in the database itself. I have some performance graphs based on measurements the dev team made - these have been presented publicly by myself and the dev team so I can share them with you here.

There are three graphs below, showing the relative performance for read, insert, and update of:

  • BLOB data stored in FILESTREAM format and accessed through the WIN32 streaming APIs. The times include getting a transaction context from SQL Server, getting the file path, doing the operation, closing the file, and committing the transaction in SQL Server.
  • BLOB data stored in FILESTREAM format and manipulated through T-SQL
  • BLOB data stored in varbinary(max) format (and obviously manipulated through T-SQL)

The data is the same in each test. The tests were performed on a 4-way box with a cold buffer pool. (Note that if this was for a warm-buffer pool, as the graph in the FILESTREAM whitepaper is for, the varbinary(max) and FILESTREAM T-SQL numbers would essentially swap.) One interesting point to note is that for smaller data sizes, it's faster to manipulate them through T-SQL than through the file system - this is expected based on research Jim Gray did when putting together the original TerraServer.

Almost a year ago to the day I asked a question on my old blog - how long does *your* CHECKDB take to run? Well, I had a bunch of replies but I never got around to posting the results. I was reminded of this recently so threw together some graphs of the results. I've plotted database size against run-time of DBCC CHECKDB, for a number of different numbers of CPUs, and mostly on SQL Server 2000. I've also added a linear trend line too.

Now - bear in mind this doesn't take into account the multitude of other factors that can affect how long DBCC CHECKDB takes to run (see my previous post here for the list).

Hopefully these are interesting to some of you!

A bit more traffic on the thread (see previous post here) prompted me to give my thoughts on the many sweeping generalizations that plague the computer industry and make it difficult sometimes to give advice in forums and blogs. I'd like to repost here (with a few tweaks for clarity).

Some examples of questions that breed sweeping generalizations:

  • Should you have clustered indexes on all tables? The well-known clustered-index debate as Kimberly likes to call it.
  • Should you rebuild or reorganize indexes to remove fragmentation?
  • Which high-availabilty solution should you use?

The problem - as with most advice - is that it's extremely hard to make generalizations. This is both because:

  1. without lots of evidence many people (quite rightly) don't believe sweeping generalizations as they may have been bitten by one in the past
  2. nearly every situation is different so many generalizations are useless

What I'd love to see, (and I tried to do this when at MS, and like to think I do it here or when teaching classes or conferences) is for people to provide the justification for generalizations, plus some idea of the exceptions and the circumstances under which they arise.

As for this case (whether to create multiple files because there are multiple cores/CPUs), I think we've about done this one to death. The sweeping generalizations here are:

  1. for non-tempdb you usually don't need multiple files, unless you have a very high-end workload of the specific nature I described in my first post (rare)
  2. for tempdb you usually do, as long as your workload merits it on a multi-core/cpu box
  3. IO vendors may recommend it for increased IO throughput *on their specific hardware*
  4. there exist sweeping generalizations from various sources that dispute all of the above

Unfortunately, you're not going to get a definitive, authoritative answer to a design/strategy question such as this and you'll continue to find contradictions to anything anyone says on the forums, and even MS contradicting itself (sigh).

What I would suggest is the following:
1) go with the majority opinion of responses to questions asked, based on the respondents collective experience with many customers, databases, and workloads
2) do your own testing, on your own hardware, with your own workload and see what works for you (but beware that testing in a vacuum can prove or disprove anything you want - which is why you see so many contradictory statements)

One last thing on MS - it's a very big company, with lots of groups. Anyone can sponsor a whitepaper, write a blog post/MSDN article/technet article and publish it, or reply on a forum as a visible MS person and it has the 'official stamp' of coming from MS. When I was in the product group I was continually dismayed by the misinformation, bad advice, contradictions, and baseless assertions that I saw coming from MS employees who were just trying to be helpful.

Once something's published on the internet, it's *incredibly* hard to undo the damage done. There's a fundamental element of mistrust sometimes on forums and newsgroups which can be wearying when you're trying to help people out. It can be very hard to convince people that someone else's advice isn't the best to follow - I remember several times arguing with people about how CHECKDB works or what a corruption error message means and finally having to resort to 'I wrote that code - I'm afraid you *are* wrong' - which I really hate doing.

Anyway - rant over :-)

There's been a very interesting discussion going on over at SQLServerCentral.com about whether to create multiple files for a user database because the server has multiple CPUs - see the thread here. I wrote a couple of long responses during the thread that I wanted to duplicate here as I think it's of broad interest.

My first response was:

Doesn't make sense to split any database into multiple files for per-cpu performance, except tempdb which can suffer from multiple-CPUs all trying to modify the same allocation bitmaps under high load with small temp tables being created/deleted (see http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx for details).

Now, saying that, there's an exception - and that's when a non-tempdb database has the same allocation bitmap contention issues, but that only happens on the very high-end with thousands of single-row inserts per second happening on each CPU. It's pretty rare. I've never seen it but Kimberly has.

What you should be considering is splitting a database into filegroups to be able to isolate critical tables within separate filegroups, or splitting a partitioned table across multiple filegroups. For SS2005, these aren't for performance but instead for manageability (e.g. being able to restore a single filegroup offline while the rest of the database remains online, or setting several partitions of a table to be read-only). For SS2008, there are improvements to query plan generation over multiple partitions that could allow for better parallelism, but its still not optimal compared to non-partitioned data.

So - complex topic and these are simple guidelines. Hope they help.

This was followed by a discussion pointing out that various hardware vendors imply the opposite, as do several MS sources. So here was my latest response:

Hi folks,

What's really not made clear in the MS literature is that the multiple-files in a database really only applies to tempdb. Even on tempdb, on 2005 you don't need one file per core - more like 1/4 -1/2 the number of files as there are cores.

The tempdb problem is this - common workloads create and drop many worktables per second. The allocations that happen for a table are initially single-pages (as opposed to whole extents). This means that a search of the SGAM page for the 4GB interval needs to be done to find a mixed-extent with a free page to allocate. Multiple CPUs all hammering this page cause contention on it and performance problems. Then, a page needs to be allocated for the first IAM page - same thing happens. Then these pages need to be marked allocated in a PFS page - same thing happens. And then these pages need to inserted into the sysindexes row for the tabel - more contention. On 2000 this was particularly bad - so T1118 plus multiple files was the solution, where SQL Server would round-robin the single page allocations in the files in tempdb, alleviating the contention somewhat.

In SQL Server 2005, we changed the temp table mechanism so that whenever a temp table is dropped, one data page, one IAM page, and the system table entries (no longer sysindexes, but instead is a 'hidden' table called sys.allocation_units) are cached. When a new temp table is allocated, if there's a cached 'template temp table' it is picked up and used without so much contention on the various allocation bitmaps. On a heavily loaded system there can still be contention and so you still need multiple files for an SMP box, but just not so many. And you don't need T1118 any more.

So - this is more prevalent on tempdb, but CAN happen on a user database under extreme load on monster hardware. Testing should show whether this is happening to you - if not, don't create multiple files for performance.

Now, in terms of what works for your particular vendors IO scalability - that's beyond me and you may need to think about that if its something they recommend. However - I'd still take it with a pinch of salt and do your own testing. See the whitepaper at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx for some testing/banchmarking info.

Hope this helps clarify a bit more - great discussion!

I'm interested to hear any other opinions on this - especially cases where you've found it necessary to create multiple files for performance.

Thanks!

There's a new whitepaper on TechNet that I've just come across (even though it was published 6 months ago!) called Predeployment I/O Best Practices. It's really good - discussing the following:

  • Guidelines for determining I/O capacity
  • Disk configuration best practices and common pitfalls
  • Using SQLIO to determine capacity and interpreting its results
  • Using System Monitor to monitor an IO subsystem

You can read it here and I've added a link to our whitepapers page.

While I'm on the subject of I/O, Bob Dorr (A Senior Escalation Engineer in PSS) published a blog post last year that debunks a couple of urban legends around SQL Server's IO, specifically:

  • The myth that SQL Server used one thread per data file
  • The myth that a disk queue length greater than 2 indicates an I/O bottleneck

Check out his blog post and the subsequent discussion here.

Quickie today as I'm preparing to speak at the Pacific Northwest SQL Server User Group meeting tonight on the MS Campus here in Redmond (my trademark Detection and Recovery from Database Corruptions talk).

SQL Server 2005 introduced the concept of hot-add memory, to allow for dynamic workload handling. SQL Server 2008 increases these capabilities by adding hot-add CPU as well. 'Hot-add' means being able to plug in a CPU while the machine is running and then reconfigure SQL Server to make use of the CPU ONLINE! (i.e. no application downtime required at all)

There are a few restrictions:

  • You need a 64-bit system that support hot-add CPU (obviously :-))
  • You need Enterprise Edition of SQL Server 2008
  • You need Windows Server Datacenter or Enterprise Edition

When you plug in the new CPU, SQL Server won't automatically start using it. If you think about it, it can't - you may not want that CPU to be used by SQL Server - so it has to be told that it can use it. You do that by setting the appropriate affinity masks and then running a RECONFIGURE.

Ok - now we come to the bit that needs the "(and affinity masks)" in the title. What's an affinity mask? In a nutshell, it's a bitmap of all the CPUs on the machine that specifies which ones are available for general SQL Server Engine use, and which ones are available only for SQL Server I/O. There are two kinds of masks, a regular affinity mask, and an affinity I/O mask. The regular affinity mask, if all zeroes, says that Windows decides who get's what CPU when. If the affinity mask is non-zero, then there's a bit per CPU. If it's set to 1 then SQL Server can use the CPU. If a bit is set to 1 in the affinity I/O mask, then the CPU can only be used for I/O. A bit cannot be set in both masks. Now - an affinity mask is 32-bits wide, so if you have more than 32 CPUs, you need to use two more affinity masks, called affinity64 and affinity64 I/O. These do the same thing but for CPUs 32-63 on the machine.

So, very cool, especially for those of you that can afford such hardware - I can't so I don't have a box to test it on (the 64-bit server we have here at SQL skills doesn't support it).

Back in September last year I blogged about the native Backup Compression that is in SQL Server 2008 (see here) and promised to blog more when CTP-5 came out and I ran some tests. Well, it's here and I have so here's some data for you.

I expanded the AdventureWorks database to be 322Mb (random size, but big enough to get a decent sized run-time on my server). I used System Monitor to capture %user-mode CPU time, plus backup/restore throughput for a compressed and uncompressed backup operation, and then restores.

1) For the uncompressed backup the average CPU was 5% (the green line at the bottom), the run-time was 39.5s, and, of course, it took 322Mb to store the backup.

2) For the compressed backup the average CPU was way higher at 25%, BUT the run-time was 21.6s (a 45% improvement), and the backup was stored in 76.7MB (a 4.2x compression ratio). Very cool.

3) For the restore of the uncompressed backup the average CPU was 8%, and the run-time was 71.0s.

4) For the restore of the compressed backup the average CPU was 14.5%, and the run-time was 36s (a 50% improvement).

So - to summarize, turning on compression means more CPU and smaller run-times - just what was expected. Note that if you try this on your database you will see different results - the compression ratio and CPU usage is entirely dependent on the data being compressed.

While we were in Barcelona we sat down with Richard Campbell and Greg Hughes from RunAs Radio to record a 1/2 hour interview on SQL Server 2008. We touch on a ton of different features (look at the number of Categories I've tagged this with!) and have a bunch of laughs along the way - check it out here.

PS There's been a ton of interest in the slide deck idea I had so we'll be going ahead with that. Look for an announcement sometime in the first few months of next year about how to get them. Thanks to everyone that replied!

This is a subject I posted about last year on my old blog but it came up at SQL Connections last week several times so I want to repost it for those who’ve just started following my blog.

There's only one time when you should be trying to work out how long a CHECKDB is going to take - when you're planning your regular database maintenance. If you're faced with a corrupt (or suspected corrupt) database and you're only just starting to think about how long a CHECKDB is going to take - you've made a mistake while planning your disaster recovery strategy. You always need to know how long CHECKDB takes (on average) to run for your database so:

  • You can tell whether a particular run of CHECKDB is taking longer than usual - a sign that it's found some corruption
  • You know how long it will take to get results in a disaster recovery situation

At every conference I go to, someone asks me how long CHECKDB will take to run on their database. There are several ways I could answer this:

  • The unhelpful answer - I've got no idea.
  • The almost-helpful answer - how long did it take to run last time and are the conditions exactly the same?
  • The answer I usually give - it depends.

Now, many people would see the third answer as being somewhat equivalent to the first answer - unhelpful. The problem is that there are many factors which influence how long CHECKDB will take to run. Let me explain the ten most important factors so you get an idea why this is actually a helpful answer. These aren't in any particular order of importance.

1) The size of the database

Pretty obvious... CHECKDB has to read every allocated page in the database so the bigger it is, the longer it will take to read all the pages.

2) Concurrent IO load on the server

At the simplest level, what is CHECKDB going to do? It reads every allocated page in the database. That's a lot of IO. CHECKDB takes great pains to do the most efficient IO it can and read the database pages in their physical order with plenty of readahead so that the disk heads move smoothly across the disks (rather than jumping around randomly and incurring disk head seek delays). If there's no concurrent IO load on the server, then the IOs will be as efficient as CHECKDB can make them. However, introducing any additional IO from SQL Server means that the disk heads will be jumping around - slowing down the CHECKDB IOs. If the IO subsystem is at capacity already from CHECKDB's IO demands, any additional IO is going to reduce the IO bandwidth available to CHECKDB - slowing it down.

3) Concurrent CPU activity on the server

At the next level of simplicity, CHECKDB is going to process every page it reads in some way. Depending on the various options you've specified and the database schema (details below), that's going to use a lot of CPU - it's possible that the server may be pegged at 100% CPU when CHECKDB is running. If there's any additional workload on the server, that's going to take CPU cycles away from CHECKDB and it going to slow it down.

Basically what points #2 and #3 are saying is that CHECKDB is very resource intensive! It’s probably one of the most resource intensive things you can ask SQL Server to do and so it's usually a good idea to not run it during peak workload times, as you'll not only cause CHECKDB to take longer to run, you will slowdown the concurrent workload, possibly unacceptably.

4) Concurrent update activity on the database

This is relevant for both SQL 2000 and SQL 2005, but for different reasons.

In SQL 2000, CHECKDB gets its consistent view of the database from transaction log analysis of concurrent DML transactions (see here for details). The more concurrent DML there is while CHECKDB is running, the more transaction log will be generated - and so the longer it will take for CHECKDB to analyze that transaction log. It's possible that on a large multi-CPU box with a ton of concurrent DML and CHECKDB limited to a single CPU that this phase of CHECKDB could take several times longer than the reading and processing of the database pages! (I've seen this in real-life several times.)

In SQL 2005, CHECKDB gets its consistent view of the database from a database snapshot, which is stored on the same disk volumes as the database itself. If there are a lot of changes in the database while CHECKDB is running, the changed pages are pushed to the snapshot so that it remains consistent. As the snapshot files are stored in the same location as the database files, every time a page is pushed to the snapshot, the disk head has to move, which interrupts the efficient IO described in #2. Also, whenever CHECKDB goes to read a page and it needs to read the page from the snapshot files instead of the database files, that's another disk head move, and another efficient IO interruption. The more concurrent changes to the database, the more interruptions to efficient IO and the slower that CHECKDB runs.

5) Throughput capabilities of the IO subsystem

This one's simple. CHECKDB is going to do a boat-load of IOs and it could even end up being IO-bound (meaning that the CPUs are idle periodically waiting for IOs to complete) depending on the options specified and the database schema. This means that the throughput of the IO subsystem is going to have a direct effect on the run-time of CHECKDB. so, if you have a 1TB database and the IO subsystem can only manage 100MB/sec, it's going to take almost 3 hours just to read the database (1TB / 100MB / 3600 secs) and there's nothing you can do to speed that up except upgrade the IO subsystem.

I've lost count of the number of times I've heard customers complain that CHECKDB (or index rebuilds or other IO-heavy operations) are running sloooowly only to find that the disk queue lengths are enormous and the IO subsystem it entirely unmatched to the server and workload.

6) The number of CPUs (processing cores) on the box

This also really encompasses the Edition of SQL Server that's being run. In Enterprise Edition, CHECKDB can run in parallel across all the CPUs in the box (or as many as the query processor decides to parallelize over when the CHECKDB internal queries are compiled). Running in parallel can give a significant performance boost to CHECKDB and lower run times, as long as the database is also spread over multiple files too (so the IOs can be parallelized). There's a nifty algorithm that’s used that allows CHECKDB to run in parallel which I'll explain in detail in a future post.

On the other hand, the fact that CHECKDB can run in parallel in Enterprise Edition can be bad for some scenarios, and so some DBAs chose to force CHECKDB to be single-threaded. SAP typically recommends this to help with user query predictability. The way to do this is to turn on the documented trace flag 2528.

7) The speed of the disks where tempdb is placed

Running CHECKDB against a VLDB uses lots of memory for internal state and for VLDBs the memory requirement usually exceeds the amount of memory available to SQL Server. In this case, the state is spooled out to tempdb and so the performance of tempdb can be a critical factor in CHECKDB performance. See this post for more details of this and how CHECKDB can run out of disk space if tempdb is too small.

8) The complexity of the database schema

This can have a really high impact on the run-time of CHECKDB because it impacts the amount of CPU that CHECKDB requires. For example, the most expensive checks that CHECKDB does are for non-clustered indexes. It needs to check that each row in a non-clustered index maps to exactly one row in the heap or clustered index for the table, and that every heap/clustered index row has exactly one matching row in each non-clustered index. Although there's a highly efficient algorithm for doing this, it still takes around 30% of the total CPU that CHECKDB uses!

There are a bunch of other checks that are only done if the features have been used in the database - e.g. computed column evaluation, links between off-row LOB values, Service Broker, XML indexes, indexed views - so you can see that empirical factors along aren't enough to determine the run-time.

9) Which options are specified

This is almost the same as #7 in that by specifying various options you're limiting what checks CHECKDB actually performs. For instance, using the WITH NOINDEX option will turn off the non-clustered index checks that I described in #7 and using the WITH PHYSICAL_ONLY option will turn off all logical checks, vastly decreasing the run-time of CHECKDB and making it nearly always IO-bound rather than CPU-bound (in fact this is the most common option that DBAs of VLDBs use to make the run-time of CHECKDB manageable).

One thing to be aware of - if you specify any repair options, CHECKDB always runs single-threaded, even on a multi-proc box on Enterprise Edition.

10) The number and type of corruptions that exist in the database

Again, this is similar to #7 and #8. If there are any corruptions present, there may be extra checks triggered to try to figure out more details of the corruptions. For instance, for the non-clustered index checks, the algorithm is tuned very heavily for the case when there are no corruptions present (the overwhelming majority of cases considering the millions of times CHECKDB is run every day around the world). When a non-clustered index corruption is detected, a more in-depth algorithm has to be used to figure out exactly where the corruption is, which involves re-scanning a bunch of data and so taking a bunch more time. There are a few other algorithms like this too.

Summary

So you can see that there's no simple answer.

I’m in the middle of a flight from Washington D.C. to Zurich on the way to Barcelona for TechEd IT Forum and I can’t sleep – Kimberly’s out like a light so what else is there to do except write another blog post? :-) OK - actually posting this from Barcelona on Tuesday before our first of 12 sessions here!

In the Database Maintenance workshop we did at SQL Connections last week I promised to blog a script I used to show how data file shrink operations cause massive fragmentation of indexes. The reason is that data file shrink starts at the end of the data file and moves a single page at a time to a free space below the shrink threshold. In the process of doing so, it perfectly reverses the physical order of the pages comprising the leaf level of an index – thus perfectly fragmenting it!!

Let’s try out my simple script that demonstrates this. First thing I’m going to do is create a new database and create a 10MB ‘filler’ table, which I’m going to then drop later to create a space that shrink can use.

USE MASTER;

GO

 

IF DATABASEPROPERTYEX ('shrinktest', 'Version') > 0

      DROP DATABASE shrinktest;

 

CREATE DATABASE shrinktest;

GO

USE shrinktest;

GO

 

SET NOCOUNT ON;

GO

 

-- Create and fill the filler table

CREATE TABLE filler (c1 INT IDENTITY, c2 VARCHAR(8000))

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO filler VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO

Next I’ll create the ‘production’ table that we’d really like to keep in optimal shape for performance.

-- Create and fill the production table

CREATE TABLE production (c1 INT IDENTITY, c2 VARCHAR (8000));

CREATE CLUSTERED INDEX prod_cl ON production (c1);

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO production VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO 

Now I’ll use the sys.dm_db_index_physical_stats DMV to check the fragmentation of the production table’s clustered index – it should be almost zero:

-- check the fragmentation of the production table

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

0.390930414386239            6

This is what I expected. Now I’m going to drop the filler table, run a shrink operation and then check the fragmentation again:

-- drop the filler table and shrink the database

DROP TABLE filler;

GO

 

-- shrink the database

DBCC SHRINKDATABASE (shrinktest);

GO

 

-- check the index fragmentation again

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

99.7654417513683             1277

Wow! The index went from almost 0% fragmented to almost 100% fragmented – the shrink operation totally reversed the physical ordering of the leaf level of the clustered index – nasty.

One of the common maintenance operations I see at customer sites is to run a database shrink at some interval, and I always advise against it – now you can see why. Running a regular shrink operation can cause horrible fragmentation problems. The worst problems I see are those customers with maintenance plans that rebuild all indexes and then run a shrink to remove the extra space necessary for the index rebuilds – completely undoing the effects of the index rebuild!

One other common thing I see is to have auto-shrink set on for one or databases. This is bad for several reasons:

  • Shrink causes index fragmentation, as I’ve just demonstrated above.
  • You can't control when it kicks in. Although it doesn't have any effect like long-term blocking, it does take up a lot of resources, both IO and CPU. It also moves a lot of data through the buffer pool and so can cause hot pages to be pushed out to disk, slowing things down further. If the server is already pushing the limits of the IO subsystem, running shrink may push it over, causing long disk queue lengths and possibly IO timeouts.
  • You're likely to get into a death-spiral of auto-grow then auto-shrink then auto-grow then auto-shrink... (in my experience, if someone is using auto-shrink, they're most likely using and relying on auto-grow too). An active database usually requires free space for normal operations - so if you take that free space away then the database just has to grow again. This is bad for several reasons:
    • Repeatedly shrinking and growing the data files will cause file-system level fragmentation, which can slow down performance
    • It wastes a huge amount of resources, basically running the shrink algorithm for no reason
    • Auto-grow itself can be bad, especially if you're using SQL Server 2000 (or don't have Instant File Initialization turned on - see this post from Kimberly's blog) where all allocations to the file being grown are blocked while the new portion of the file is being zero-initialized.

Bottom-line: auto-shrink should *NEVER* be turned on…

This is a question I was asked multiple times over the last week: the Microsoft guidelines for database mirroring say not to mirror more than 10 databases per instance - why is that and is it true?

The answer is my favorite 'it depends!'. The number 10 is a rough guess at the sweet-spot for the majority of customers based on hitting a thread limit on 32-bit machines. The factors that need to be considered are:

  • How much memory do the principal and mirror instances have? (hopefully the same)
  • How much processing-power do the principal and mirror instances have? (hopefully the same)
  • How much bandwidth does the IO subsystem have on the mirror instance? (hopefully the same as on the principal)
  • How much transaction log does the workload on each database generate?
  • How much network bandwidth is available between the principal and the mirror instances?

The last two factors are the most critical. If the network bandwidth available between the two instances is not enough to handle the combined transaction log generation rate per second from all databases being mirrored then performance will drop on the principal databases. SQL Server 2008 does alleviate some of this with log stream compression - see here for details. The next most critical thing to consider is the memory and thread requirements for mirroring - each mirrored database takes one thread plus some memory, so on low-powered servers, lots of mirrored databases may be too much load on the server when combined with the regular workload.

Here are some examples that I've seen:

  • A customer with 25 databases, all of which have very small amounts of activity, and not all at the same time, has them all mirrored with no problem.
  • A customer with only 3 heavily-loaded databases, but without a great network connection, that can barely mirror one of the databases without the lack of network bandwidth causing workload degradation.

The key to success here is to do the log generation calculation and then if it seems that the available network bandwidth will support the number of databases you want to mirror, test it first before relying on it in production.

I guess the bottom-line here is that any broad guidance is only that - your mileage may (and probably will) vary. Always do your own calculations and testing.

[Edit 10/15/2009: Checkout the new KB article I helped write that discusses this in detail: http://support.microsoft.com/kb/2001270]

In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose efficient data-types - I'd like to share the discussion here with an example.

The crux of the matter is in the amount of space required to store the data-types in the schema, and ensuring that the data-type choice matches the data being stored without incurring any unnecessary overhead. Here are a couple of examples:

  • A field that stores a person's age will usually be able to use a TINYINT type that can store a maximum value of 256 - unless its an archeological or historical database, for instance. Without putting a lot of thought into it, however, someone may choose to use an INTEGER type. What's the difference? A TINYINT take a single byte to store, whereas an INTEGER takes *4* bytes to store - wasting 3 bytes per record.
  • A field that stores a person's city of residence in the USA needs to be able to hold 24 characters (see my previous post on how long fields have to be) - so what data-type should you use? A CHAR (24) will be able to store all the possible values, but will *always* take up 24 bytes in the record as it's a fixed-length column. A VARCHAR (24), on the other hand will only store the number of bytes equal to the number of characters in the city name, so using the fixed-length type will waste a varying number of bytes per record.
  • In the USA, the names of the various States are commonly abbeviated to two characters - for instance, Washington = WA and California = CA. So is the best type to use CHAR (2) or VARCHAR (2)? Given that the abbreviations are always 2 characters, they'll both store 2 bytes all the time. BUT, a variable length column has a two-byte overhead (the two-byte pointer for the column that's stored in the variable-length column offset array in the record) - so in this case the best data-type to use is CHAR (2).

You can see how choosing data-types without considering whether it's the best type can lead to wasted space. Let's look at an example schema to support a population census of the USA.

CREATE TABLE CensusSchema1 (

SSN CHAR (256),

StateName CHAR (256),

Sex INTEGER,

Employed INTEGER,

SelfEmployed INTEGER,

EthnicOrigin INTEGER,

MaritalStatus INTEGER,

NumberOfDependents INTEGER,

Age INTEGER,

CountryOfBirth INTEGER,

HouseholdIncomeGroup INTEGER,

ZipCode5 INTEGER);

GO

At first glance this may look reasonable, but digging in you'll see that many of the columns are over-sized. Here's a cleaned-up schema to compare against, with notes on the per-column savings:

CREATE TABLE CensusSchema2 (

SSN CHAR (9), -- saving 244 bytes

StateName VARCHAR (256), -- saving at least 240 bytes (longest state name is 14 + 2-byte varchar overhead)

Sex BIT,

Employed BIT,

SelfEmployed BIT, -- saving 11 bytes altogether over these three fields

EthnicOrigin TINYINT, -- saving 3 bytes

MaritalStatus TINYINT, -- saving 3 bytes

NumberOfDependents TINYINT, -- saving 3 bytes

Age TINYINT, -- saving 3 bytes

CountryOfBirth TINYINT, -- saving 3 bytes

HouseholdIncomeGroup TINYINT, -- saving 3 bytes

ZipCode5 INTEGER); -- no saving

GO

The bad schema gives a per-row size of 574 bytes and the cleaned-up schema is no more than 48 bytes per-row. I designed these two schemas to support a US census. The population of the USA is approx. 300 million. This means the bad schema would take around 190GB to store all the info, and the cleaned-up schema only takes around 15GB - more than 12 times more efficient! Now we're starting to see how poor data-type choice can lead to poor performance.

Wider rows means:

  • Fewer rows can fit on an 8k page.
  • More CPU is necessary to crack open a record due to the number of CPU data cache line invalidations necessary to read the record into the CPU's various caches (every time a cache line is invalidated it takes a bunch of CPU cycles - see here for an explanation of CPUs and cache lines).

Less rows per page means:

  • More pages are needed to store the data
  • Indexes could have a smaller fan-out (if the index keys are wider than then need to be), leading to more levels in the index and less efficient index navigation from the index root page to the index leaf-level.

More pages means:

  • More IOs are necessary to read the same amount of actual data
  • More buffer pool memory is necessary to hold the same amount of actual data
  • More disk space is necessary to hold the same amount of actual data

Clearly there's a link between various aspects of workload performance and badly chosen data-types.

There's been some discussion over on the SQL Server Central forums about database mirroring performance, and one question in particular on how to do index rebuilds on mirrored VLDBs.

Remember that in database mirroring, the database has to be in the full recovery mode, so all index rebuilds are fully logged. It all comes down to the amount of transaction log generated and whether this causes a problem.

  • In synchronous mirroring, the additional log being generated could overload the network link between the principal and mirror. This will slow down the hardening of regular transactions in the mirror's transaction log, and thus led to a decrease in transaction throughput on the principal.
  • In asynchronous mirroring, the additional log being generated could again overload the network link - but this time there's no requirement for transactions to harden in the mirror before they can commit on the principal, so instead the SEND queue on the principal gets really large. Now, in asynchronous mirroring the SEND queue represents the amount of work that will be lost if a failover occurs, as its all the transaction log that hasn't yet been sent to the mirror. So, in asynchronous mode, a large index rebuild operation could lead to increase exposure to data loss in the event of a failover.

In SS2008, the log stream compression I blogged about here should go a long way to alleviating this problem. There are three ways I can think of to reduce the amount of transaction log generated by and index maintenance plan in SS2005:

  1. Use a potentially less expensive (in terms of logging) solution for removing index fragmentation. Doing an index rebuild is guaranteed to generate an equivalent amount of transaction log to the size of the index being rebuilt, no matter how much fragmentation there is - because an index rebuild always rebuilds the entire index. The alternative is to do an index reorganize, either using my old DBCC INDEXDEFRAG or the new ALTER INDEX ... REORGANIZE. These will only generate transaction log when index pages are compacted and reorganized - so for less heavily fragmented indexes. There's no hard and fast rule here but I generally say where Logical Scan Fragmentation/Average Fragmentation in Percent from DBCC SHOWCONTIG/sys.dm_db_index_physical_stats, respectively, is less than 30%. You also need to consider page density too - but really this is a topic for a whole other post.
  2. Be very selective on which indexes you choose to rebuild/reorganize. A lot of people have a maintenance plan that does this for every index every week, without checking whether the index is even fragmented or whether removing fragmentation for an index improves workload performance. Again, this a whole other topic but there is an old whitepaper for SS2000 I helped with that describes some of this - Microsoft SQL Server 2000 Index Defragmentation Best Practices.
  3. Partition the tables/indexes so that the changing portion of the data is the only portion that's affected by index maintenance. If most of your data is read-only, there's no point in having it included in reindex/reorganize operations, right? Kimberly recently wrote a blog post about such an architecture here.

For more info on database mirroring performance considerations, checkout the whitepaper Database Mirroring: Best Practices and Performance Considerations. There's also a slide deck presentation based on this whitepaper that's been presented at various TechEds (I did it in China and Hong Kong last year and Kimberly did it in South Africa last year) - you can download it from the Hong Kong website here.

The bottom line is when mirroring is in the mix, you need to be more intentional with your database maintenance.

As I mentioned in a previous post, one of the new features for database mirroring in SQL Server 2008 is log stream compression. There's a good explanation of what this is (along with some example workloads and performance graphs) on the SQL Customer Advisory Team blog so I'm not going to duplicate all that here.

Basically, the way mirroring works is by shipping all the transaction log from the principal to the mirror, and then replaying the log records on the mirror database. Network bandwidth can be a bottleneck for mirrored databases where the rate of transaction log generation is high enough to saturate the network. On synchronously mirrored databases (where a transaction cannot commit on the principal until all the transaction log for it has been hardened in the log of the mirror database), this can even lead to workload throughput degradation.

Log compression is a way to alleviate this problem. SS2008 Books Online states that compression rates of at least 12.5% are achieved - obviously the compression ratio could be much higher than that and is dependant on what's being compressed - i.e. the data that is being processed by the application. One of the tests they did for the post above shows a 5x compression ratio - that's pretty good.

The downside of this is that compression is not free - extra CPU must be used. The obvious CPU load increases comes from having to compress the transaction log on the principal and then decompress it again on the mirror before hardening it in the log. The non-obvious increase will come because the principal and mirror should end up being able to process more transactions/second, which means more CPU is needed.

This extra CPU could be a problem for some systems that are already pegged in terms of CPU load, and so they may actually see a drop in performance when log stream compression is enabled. Now, it's on by default when you upgrade to SS2008 but there is a way to turn it off. The blog post above divulges that there is a trace flag, 1462, that turns off log stream compression and effectively reverts the behavior back to SS2005.

Apart from the obvious benefits of this feature for systems that can afford the extra CPU, I'm excited because it may allow more mirrored databases to perform effective database maintenance - in the form of fragmentation control and index maintenance. When database mirroring is enabled, the database has to be running in the full recovery mode. This means that operations such as index rebuilds are fully logged - that's a lot of extra log being generated if you're used to switching to bulk-logged recovery mode to perform index operations. Of course, this isn't an issue for reorganizing an index (with ALTER INDEX ... REORGANIZE) as this has been fully logged (and has to be that way because of how it works) since I wrote the old DBCC INDEXDEFRAG for SS2000.

Anyway, for some customers the fact that these two operations are fully logged means that running them produces so much additional log that, in conjunction with the log from the workload itself, the network becomes a bottleneck for mirroring and it slows down both the maintenance and reduces the workload throughput. The addition of log stream compression means that for these systems, the log may be compressed enough so that the network is no longer a bottleneck and more regular fragmentation control can take place. This in turn will increase workload throughput as well - a double benefit!

Here's an interesting question I was sent by my friend Steve Jones over at SQL Server Central - will a single CPU with dual-cores perform better than two single-core CPUs? Both have two processing cores but the hardware architecture is different - which one will make SQL Server perform better? Well, there's no hard and fast answer - it depends! I had a discussion on this topic this morning with Jerome Halmans, part of my old team in the SQL Server Storage Engine and I'm basing this post on our discussion with his permission.

My hypothesis (which Jerome confirmed) was that the performance of the two architectures depends on the amount of cache line invalidations and how that is managed (see here for a description of CPU caches and cache lines).

  • On the single-core machine, cache line invalidations needs to go across the main bus between the two CPUs, involving bus arbitration delays.
  • On the dual-core machine, cache line invalidations don't go across the bus because the two cores are contained within the same CPU package. In fact, if the architecture is smart enough, it may be able to just remap the cache line from one processing core to the other, thus avoiding any data copying. I'm not sure if such an architecture exists though.

Here is a very interesting and accessible Intel article that discusses cache-sharing in multi-core Intel systems. In this paper at least, the L2 cache is shared but modifications made by different cores in their private L1 caches still need to bounce through the shared L2 cache before being loaded by the other core. This will still be WAY faster than having to go through main bus between single-core CPUs.

And here is a similar paper from AMD on their Barcelona multi-core architecture that describes each core having separate L1 and L2 caches, with an additional shared L3 cache. The seperate L2 caches are kind-of linked though, in that modifications to a cache line in one L2 cache are immediately mirrored in the other L2 caches (if needed).

But the amount of cache invalidations (of whatever kind) depends on the workload. The two types of workload to consider are:

  1. Where the workload has very independent characteristics, so the data being processed by a thread on one processing core is unrelated to that being processed by a thread on the other core. There should be very few cache line invalidations. In this case, the single-core CPUs will have all their local caches full of data relevant to just the thread running. The two cores on the dual-core CPU will need to share some level of on-chip cache and so their may well be more churn in the cache. In this case I'd expect the single-core CPUs to perform better.
  2. Where the workload is such that data is shared, and threads touch data being processed by others threads on other cores. In this case, the single-core CPUs will fall victim to massive amounts of cache line invalidations, whereas the dual-core CPUs will do on-chip cache line invalidation (of whatever type is supported by the architecture). In this case I'd expect the multi-core CPU to outperform the two single-core CPUs.

Saying that, the majority of workloads on SQL Server are of the second type above. Jerome mentioned that even synthetic workloads (such as the TPCC benchmark) are still going to result in multiple-threads accessing and changing the same data/index pages.

So - what's the conclusion? I expect that a multi-core CPU will outperform an equivalent number of single-core CPUs in most workloads. And as Jerome pointed out, even if that's not the case for your workload, you'll find it pretty hard to find a system that ships with single-core CPUs these days.

I'd love to hear any comments on this, especially any measurements you've done on workloads as I don't have any single-core machines available to run tests on - even the laptop I'm typing this on is a dual-core Centrino.

(This is an updated repost from earlier this year on my old blog)

Whenever I’m discussing index maintenance, and specifically fragmentation, I always make a point of saying ‘Make sure the index is being used before doing anything about fragmentation’. If an index isn’t being used very much, but has very low page density (lots of free space in the index pages), then it will be occupying a lot more disk space than it could do and it may be worth compacting (with a rebuild or a defrag) to get that disk space back. However, usually there’s not much point spending resources to remove any kind of fragmentation when an index isn’t being used. This is especially true of those people who rebuild all indexes every night or every week.

 

You could even go so far as to say if a non-clustered index isn’t being used, why is it there at all? Extra non-clustered indexes drag down performance in a number of ways. Consider a non-clustered index called IX_MyNCIndex on the table MyTable:

  • Any time a record is inserted into MyTable, a matching record is inserted into IX_MyNCIndex. This is a bunch of extra IOs, extra log records, plus maybe even a page-split.
  • Any time a record is deleted from MyTable, the matching record in IX_MyNCIndex must be deleted. Extra IOs again and log records again.
  • Any time a record in MyTable is updated:
    • If MyTable has a clustered index, and the clustered index key value changes, then the matching record in IX_MyNCIndex must be updated. Extra IOs and log records again.
    • If any of the non-clustered index key values changes, or any of the INCLUDEd column values changes, then the matching record in IX_MyNCIndex must be updated. Extra IOs and log records again.
  • If a clustered index is created on MyTable, then IX_MyNCIndex has to be rebuilt to include the logical RIDs rather than the physical heap RIDs (see this post for an explanation). Lot of extra IOs and log records again.

That’s a significant amount of extra IOs and log records to maintain each extraneous non-clustered index.

 

So, how can you tell if an index is being used?

  • In SQL Server 2000 there is no way to do it
  • In SQL Server 2005 there are a few different ways in SQL Server 2005 – the one I want to discuss in this post is the sys.dm_db_index_usage_stats DMV.

This DMV exposes the information that is tracked about index usage (as the name suggests). It does not generate any information itself; it just returns info from a cache inside SQL Server. This cache is empty when the server instance starts, and is not persisted across instance restarts. All cache entries for indexes in a database are removed when that database is closed. So, the cache tracks usage information about which indexes have been used since the database they are part of was last opened (either manually or as part of instance start-up).

 

This continues to confuse people so I'll call it out: if the output from the DMV does not have an entry for the index you're interested in, it has not been used since the last database startup.

 

The cache tracks the following info for each index (for user queries and system queries):

  • The number of times it was used in a seek operation (either looking up a single row, or doing a range scan) along with the time of the last seek.
  • The number of times it was used in a scan operation (e.g. a select * operation) along with the time of the last scan
  • The number of times it was used in a lookup operation (this means a bookmark lookup – where a non-clustered index does not fully cover a query and additional columns must be retrieved from the base table row) along with the time of the last lookup.
  • The number of times it was used in an update operation (this counts inserts, updates, and deletes) along with the time of the last update.

Let’s have a look at its use.

SELECT * FROM sys.dm_db_index_usage_stats;

GO

The output is too wide for a single image so I've split it in two (I won't post any more output from the DMV - I'll just talk about it):

indexusage11.jpg

indexusage21.jpg

Unless you've just re-started your instance, you'll see a bunch of output from this, representing all index activity since the instance/databases started. If you're interested in whether an index is being used, you can filter the output. Let's focus in on a particular table  - AdventureWorks.Person.Address.

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

You'll probably see nothing in the output, unless you've been playing around with that table. Let's force the clustered index on that table to be used, and look at the DMV output again.

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

Now there's a single row, showing a scan on the clustered index. Let's do something else.

SELECT StateProvinceID FROM AdventureWorks.Person.Address

WHERE StateProvinceID > 4 AND StateProvinceId < 15;

GO

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

And there's another row, showing a seek in one of the table's non-clustered indexes.

So, its easy to look at the index usage for particular tables and indexes. But how can you monitor this over time? This is easy too - let's see how. First we need to create our own table to store snapshots of the DMV output.

IF OBJECTPROPERTY (object_id (N'master.dbo.MyIndexUsageStats'), 'IsUserTable') = 1 DROP TABLE dbo.MyIndexUsageStats;

GO

SELECT GETDATE () AS ExecutionTime, * INTO master.dbo.MyIndexUsageStats

FROM sys.dm_db_index_usage_stats WHERE database_id=0;

GO

Next we need to take a baseline snapshot of the DMV output.

INSERT master.dbo.MyIndexUsageStats

SELECT getdate (), * FROM sys.dm_db_index_usage_stats;

GO

And now simulate a few operations and take another snapshot of the DMV:

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT StateProvinceID FROM AdventureWorks.Person.Address

WHERE StateProvinceID > 4 AND StateProvinceId < 15;

GO

INSERT master.dbo.MyIndexUsageStats

SELECT getdate (), * FROM sys.dm_db_index_usage_stats;

GO

And look at the filtered contents of our snapshot table:

SELECT * FROM master.dbo.MyIndexUsageStats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

You should see four rows - two from the baseline snapshot and two from the final snapshot. If you ran just the statements above, you'll see that the user_scans count for the clustered index has increased by two, and the user_seeks count for the non-clustered index has increased by one.

So this is a pretty simple example of how you can track index usage. By putting something like this into a regularly run script you can tell which indexes aren't being used and could be candidates for less-regular index maintenance or removal altogether.

Let me know how you get on.

In SQL Server 2005, queries over partitioned tables use a single-thread per partition. This can cause performance problems under certain circumstances:

  1. On systems with many CPUs, if the table schema has less partitions than there are CPUs, then not all the CPUs will be used to process the query. Some examples:
    1. On a 32-way box, a query over a 12-partition table (e.g. a sales table partitioned by month) will only use 12 threads (one on each of 12 CPUs). This means 20 CPUs are potentially idle.
    2. On a 4-way box, a query over a 12-partition table only accesses a single partition so will only use one thread (on one CPU). This means 3 CPUs are potentially idle.
  2. On tables that have skewed data such that one partition is much larger than another, the length of time the query takes to complete will be bounded by the single thread processing the largest partition.

As part of the set of improvements in SQL Server 2008 for data warehousing there will be an option to change the threading behavior for queries over partitioned tables. The new, alternative model is that all available threads process part of each partition and then move into the next partition. This allows all available CPUs to take part in processing the query, which should lead to a drop in the query completion time.

The only time this model won't work is if the data is not in the buffer pool and is not spread out evenly across the available drives. For example, if an entire partition is stored on a single drive, then multiple threads will be scanning different portions of the drive, causing the disk head to thrash and IO throughput to drop sharply compared with a single thread driving the IO. For this reason, the option to use the new model will be off by default, to avoid surprising people with sudden bad performance after upgrading.

This should be available in the next CTP and then I'll post again with some example datasets and queries to see what the potential benefits and drawbacks are.

Part of the improvements to database mirroring in SQL Server 2008 are the addition of 10 new performance counters under the Database Mirroring Performance Object. These add some cool troubleshooting capabilities.

SQL Server 2005 provided the following 11 counters (from Books Online):

Name Description

Bytes Received/Sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Send Queue

Total number of bytes of log that have not yet been sent to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue

Total number of bytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

SQL Server 2008 now provides 21 counters, with the new ones highlighted in red. This info is taken from the 2008 July CTP 08Books Online that is downloadable here.

Name Description

Bytes Received/sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Redone from Cache/sec

Number of redone log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the mirror server. On the principal server the value is always 0.

Log Bytes Sent from Cache/sec

Number of sent log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the principal server. On the mirror server the value is always 0.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Compressed Bytes Rcvd/sec

Number of compressed bytes of log received, in the last second.

Log Compressed Bytes Sent/sec

Number of compressed bytes of log sent, in the last second.

Log Harden Time (ms)

Milliseconds that log blocks waited to be hardened to disk, in the last second.

Log Remaining for Undo KB

Total kilobytes of log that remain to be scanned by the new mirror server after failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Scanned for Undo KB

Total kilobytes of log that have been scanned by the new mirror server since failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Send Flow Control Time (ms)

Milliseconds that log stream messages waited for send flow control, in the last second.

Sending log data and metadata to the mirroring partner is the most data-intensive operation in database mirroring and might monopolize the database mirroring and Service Broker send buffers. Use this counter to monitor the use of this buffer by the database mirroring session.

Log Send Queue KB

Total number of kilobytes of log that have not yet been sent to the mirror server.

Mirrored Write Transactions/sec

Number of transactions that wrote to the mirrored database and waited for the log to be sent to the mirror in order to commit, in the last second.

This counter is incremented only when the principal server is actively sending log records to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue KB

Total number of kilobytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Send/Receive Ack Time

Milliseconds that messages waited for acknowledgement from the partner, in the last second.

This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, you can analyze the value of this counter to determine whether the network is causing the problem.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

Here's a little more explanation and what you can use these new performance counters to troubleshoot:

  • Log Bytes Redone from Cache/sec
    • This measures how much of the transaction log in the redo queue is being read by the log redo task from the mirror's in-memory transaction log cache. Reading from the cache is a lot faster than having to read from the mirror's actual transaction log. Even though the log gets hardened on the mirror database's log disk, it does not need to be removed from the cache until the cache fills up with new transaction log from the principal.
    • You could think of this as a cache hit ratio measure for the redo queue.
    • If this number is lower than usual, it means that transaction log is arriving from the principal faster than the log redo task can roll forward the transaction log in the redo queue.
  • Log Bytes Sent from Cache/sec
    • This is similar to the counter above. It measures how much of the transaction log being sent from the principal to the mirror is being read from the principal's in-memory transaction log cache. Sending from the cache is a lot faster than having to go to the transaction log itself and read from disk.
    • You could think of this as a cache hit ratio for the send queue.
    • If this number is lower than usual it means that the transaction log is being generated on the principal faster than it can be sent to the mirror.
  • Log Compressed Bytes Rcvd/sec
  • Log Compressed Bytes Sent/sec
    • These are simple counters and can be used with the Log Bytes Sent/sec and Rcvd/sec counters to determine the compression ratio. Log stream compression is another enhancement in SQL Server 2008 that I'll cover in a future post.
  • Log Harden Time (ms)
    • This measures the delay between the mirror server receiving a chunk of transaction log and it being hardened on the mirror database's log disk (i.e. the delay before the chunk of transaction log becomes part of the redo queue on the mirror server).
    • If this number is higher than normal it means the mirror database's log disk is more heavily loaded and may be  becoming saturated.
  • Log Remaining for Undo KB
  • Log Scanned for Undo KB
    • The Books Online entries for these counters are self-explanatory.
    • These counters give a way to monitor the undo phase after a failover occurs.
  • Log Send Flow Control Time (ms)
    • This measures how long a mirroring connection had to wait before it could us the mirroring flow control buffer.
    • If this number is higher than normal it means there is contention for the buffer, most likely because there are too many Database Mirroring partnerships running from a single instance.
  • Mirrored Write Transactions/sec
    • As Books Online mentions, this counts the number of transactions in the principal database that had to wait for a commit record to harden in the mirror database's transaction log.
    • If this value is lower than normal (for the same application workload) it means there is a bottleneck somewhere in the system.
  • Send/Receive Ack Time
    • As Books Online mentions, this can be used to measure network latency between the principal and mirror servers.
    • If this value is larger than normal it means that there is a network bottleneck between the principal and mirror servers.

Hopefully Microsoft will publish a whitepaper or some troubleshooting scenarios showing these counters being used.

Theme design by Nukeation based on Jelle Druyts