Way back in the mists of time, at the end of the last century, I wrote DBCC SHOWCONTIG for SQL Server 2000, to complement my new invention DBCC INDEXDEFRAG.

I also used to wear shorts all the time, with luminous orange, yellow, or green socks.

Many things change - I now have (some) dress sense, for one. One other thing that changed was that DMVs came onto the scene with SQL Server 2005. DBCC SHOWCONTIG was replaced by sys.dm_db_index_physical_stats. Under the covers though, they both use the same code - and the I/O characteristics haven't changed.

This is a blog post I've been meaning to do for a while now, and I finally had the impetus to do it when I heard about today's T-SQL Tuesday on I/O in general being run by Mike Walsh (Twitter|blog). It's a neat idea so I decided to join in this time. In retrospect, reading this over before hitting 'publish', I got a bit carried away (spending two hours on this) - but it's one of my babies, so I'm entitled to! :-)

This isn't a post about how to use DMVs in general, how to use this DMV in particular, or anything about index fragmentation. This is a blog post about how the DMV works.

DMV is a catch-all phrase that most people (myself included) use to describe all the various utility views in SQL Server 2005 and 2008. DMV = Dynamic Management View. There's a catch with the catch-all though - some of the DMVs aren't views at all, they're functions. A pure DMV gets info from SQL Server's memory (or system tables) and displays it in some form. A DMF, on the other hand, has to go and so some work before it can give you some results. The sys.dm_db_index_physical_stats DMV (which I'm going to call 'the DMV' from now on) is by far the most expensive of these - but only in terms of I/O.

The idea of the DMV is to display physical attributes of indexes (and the special case of a heap) - to do this it has to scan the pages comprising the index, calculating statistics as it goes. Many DMVs support what's called predicate pushdown, which means if you specify a WHERE clause, the DMV takes that into account as it prepares the information. This DMV doesn't. If you ask it for only the indexes in the database that have logical fragmentation > 30%, it will scan all the indexes, and then just tell you about those meeting your criteria. It has to do this because it has no way of knowing which ones meet your criteria until it analyzes them - so can't support predicate pushdown.

This is where understanding what it's doing under the covers comes in - the meat of this post.

LIMITED

The default operating mode of the DMV is called LIMITED. Kimberly always makes fun of the equivalent option for DBCC SHOWCONTIG, which I named as a young and foolish developer - calling it WITH FAST. Hey - it's descriptive!

The LIMITED mode can only return the logical fragmentation of the leaf level plus the page count. It doesn't actually read the leaf level. It makes use of the fact that the next level up in the index contains a key-ordered list of page IDs of the pages at the leaf level - so it's trivial to examine the key-ordered list and see if the page IDs are also in allocation order or not, thus calculating logical fragmentation.

The idea behind this option is to allow you to find the fragmentation of an index by reading the minimum number of pages, i.e. in the smallest amount of time. This option can be magnitudes faster than using the DETAILED mode scan, and it depends on how big the index's fanout is. Without getting too much into the guts of indexes, the fanout is based on the index key size, and determines the number of child-page pointers an index page can hold (e.g. the number of leaf-level pages that a page in the next level up has information about).

Consider an index with a char(800) key. Each entry in a page in the level above the leaf has to include a key value (the lowest key that can possibly appear on the page being referred to), plus a page ID, plus record overhead, plus slot array entry - so 812 bytes. So a page can only hold 8096/812 = 9 such entries. The fanout is at most 9.

Consider an index with a bigint key. Each entry is 13 bytes, so a page can hold 8096/13 = 622 entries. The fanout is at most 622, but will likely be smaller, depending on operations on the index causing fragmentation at the non-leaf levels.

For a table with 1 million pages at the leaf level, the first index will have 1 million/9 = 111112 pages at least at the level above the leaf. The second index will have at least 1608 pages. The savings in I/O from using the LIMITED mode scan will clearly differ based on the fanout.

I've created a 100GB clustered index (on the same hardware as I'm using for the benchmarking series) with 13421760 leaf-level pages and a maximum fanout of 540. In reality, I populated the index using 16 concurrent threads, so there's some fragmentation. The level above the leaf has 63012 pages, an effective fanout of 213. Still, the LIMITED mode scan will read 213x less than a DETAILED scan, but will it be 213x faster?

Here's a perfmon capture of the LIMITED mode scan on my index:

 

There's nothing special going on under the covers in a LIMITED mode scan - the chain of pages at the level above the leaf is read in page-linkage order, with no readahead. The perfmon capture shows:

  • Avg. Disk Read Queue Length (light blue) is a steady 1.
  • Avg. disk sec/Read (pink) is a steady 4ms.
  • Disk Read Bytes/sec (green) is roughly 14.5million.
  • Page reads/sec (dark blue) is roughly 1800.

DETAILED 

The DETAILED mode does two things:

  • Calculate fragmentation by doing a LIMITED mode scan
  • Calculate all other statistics by reading all pages at every level of the index

And so it's obviously the slowest. It has to do the LIMITED mode scan first to be able to calculate the logical fragmentation, because it reads the leaf level pages in the fastest possible way - in allocation order. DBCC has a customized read-ahead mechanism for allocation order scans that it uses for this DMV and for DBCC CHECK* commands. It's *incredibly* aggressive and will hit the disks as hard as it possibly can, especially with DBCC CHECK* running in parallel.

Here's a perfmon capture of the DETAILED mode scan on my index:

 

Not quite as pretty as the LIMITED mode scan, but I like it :-) Here's what it's showing:

  • Avg. Disk Read Queue Length (black) is in the multiple hundreds. Clearly it's appetite for data is outstripping what my RAID array can do. It basically tries to saturate the I/O subsystem to get as much data as possible flowing into SQL Server.
  • Avg. disk sec/Read (pink line at the bottom) is actually measuring in whole seconds, rather than ms. Given the disk queue length, I'd expect that.
  • DBCC Logical Scan Bytes/sec (red) varies substantially as the readahead mechanism throttles up and down, but it's driving anywhere up to 80MB/sec. You can see around 9:49:20 AM when it drops to zero for a few seconds.
  • Readahead pages/sec (green) is tracking the DBCC scan. This is a buffer pool counter, the DBCC one is an Access Methods counter (the dev team I used to run during 2005 development). If I had Disk Read Bytes/sec and Pages reads/sec showing, they'd track the other two perfectly - I turned them off for clarity.

So the DETAILED mode not only reads more data, but it does it a heck of a lot more aggressively so has a much more detrimental effect on the overall I/O capabilities of the system while it's running.

SAMPLED

There is a third mode that was introduced just for the DMV. The idea is that if you have a very large table and you want an idea of some of the leaf level statistics, but you don't want to take the perf hit of running a DETAILED scan, you can use this mode. It does:

  • LIMITED mode scan
  • If the number of leaf level pages is < 10000, read all the pages, otherwise read every 100th pages (i.e. a 1% sample)

Summary

There's no progress reporting from the DMV (or DBCC SHOWCONTIG) but if you look at the reads column in sys.dm_exec_sessions you can see how far through the operation it is. This method works best for DETAILED scans, where can compare that number against the in_row_data_page_count for the index in sys.dm_db_partition_stats (yes, you'll need to mess around a bit if the index is actually partitioned).

In terms of timing, I ran all three scan modes to completion. The results:

  • LIMITED mode: 282 seconds
  • SAMPLED mode: 414 seconds
  • DETAILED mode: 3700 seconds

Although the LIMITED mode scan read roughly 200x less than the DETAILED scan, it was only 13 times faster, because the readahead mechanism for the DETAILED scan is way more efficient than the (necessary) follow-the-page-linkages scan of the LIMITED mode.

Just for kicks, I ran a SELECT COUNT(*) on the index to see how the regular Access Methods readahead mechanism would fare - it completed in 3870 seconds - 5% slower, and it had less processing to do than the DMV. Clearly DBCC rules! :-)

Although the DETAILED mode gives the most comprehensive output, it has to do the most work. For very large indexes, this could mean that your buffer pool effectively gets flushed out by the lazy writer making space available for the DMV to read and process the pages. One of the reasons I advise people to only run the DMV on indexes they know they're interested in - and better yet, run it on a restored backup of the database.

Hope this is helpful!

PS Oh, also beware of using the SSMS fragmentation wizard. It uses a SAMPLED mode scan, but I found it impossible to cancel!

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For an explanation of log growth and its effect on perf, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.
  • For the increasing performance through separation of data and log files, see this post.

In the previous post in the series, I examined the effects of separating the data and log files (one file each) to different RAID arrays. It was very obvious that separation gave a performance boost, and that having the portion of the database with the highest I/O write throughput requirements on the faster array (the 8-drive 15k SCSI RAID10) produced the biggest gain.

Now - a confession.  In the last post, when I posted it I found that moving the data an 8-drive 7.2k SATA RAID10 array was the best approach. *But* during the testing for this post, I found that one of my tests had screwed up and only half the client threads had run. You'll notice in that post I went back in and edited it to explain that and update the graph and results. I've now augmented my test harness with a way to check that all client threads are running - to make sure the tests are apples to apples, rather than apples to pomegranates :-)

So - the best I've been able to do so far with the tests is creating 1TB of data using 128 connections (each creating 1/128th TB using inserts with default values) with the single data file on an 8-drive 15k SCSI RAID array (pre-created to 1TB) and the log file on an 8-drive 7.2k SATA RAID10 array (pre-created to 256MB with 50MB autogrowth) in 20842 seconds.

Lots of people have been asking how my network is setup in these tests. Here's what I've been running with (all 1GB ethernet):

  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 10.x.x.x network
  • 1 NIC from a Broadcom BCM5708C NetXtreme II GigE card on the 192.168.x.x network
  • 2 x PowerConnect 5424 24-port iSCSI optimized switches , with no separation of traffic
  • The 10.x.x.x server NIC connected to all iSCSI arrays

Over the last couple of weeks I've been playing around with the network setup to make sure things are optimized, and this post will describe what I did and what effect it had. In all the tests below, I kept the dat aon the faster SCSI array and the log on the slower SATA array.

I'm very grateful to the help I received from Wes Brown (twitter|blog) and Denny Cherry (twitter|blog) to the technical questions and WTF?s I sent (and to anyone else on twitter I may have forgotten!).

1) Separation of network traffic

I decided to make one of the 5424 switches dedicated to iSCSI traffic on the 10.x.x.x network and the other for general network activity, including connecting to the management ports on the MD3000s. Turns out that I didn't really need to, as each 5424 can handle 48GB of throughput, way more than I'm generating. But hey ho, at least the wiring in the back of the 42U rack is a little tidier now :-)

Running the 128-way test with the new configuration gave a test time of 21252 seconds, slightly slower than the best time without separation! This was the first of the WTF?s. Until I realized that I hadn't actually removed any network bottleneck at all. I can't explain why things are slightly slower here, so I decided to take the switches out of the equation. My suspicion is that if I ran the test ten times, I'd get ten different results, but within a standard deviation of the median. So - no cause for concern. (In fact, I'm going to try this as part of the next set of tests.)

2) Direct connections to the iSCSI arrays

I configured another NIC (one from an Intel PRO/1000 PT Dual Port GigE card) and then had one NIC directly connected to one of the RAID controllers on the SCSI MD3000 (only one configured volume, so no concerns about having multiple volumes suddenly switching over to a non-optimal RAID controller) and the other NIC directly connected to the SATA MD3000.

Running the 128-way test with the new configuration gave a test time of 21859 seconds, slower than test #1. Very surprising - I expected to get some *gain* so I looked at the peak throughput of the arrays:

  • For test 1, peak SATA was 50500 bytes/sec and peak SCSI was 106012 bytes/sec.
  • For test 2, peak SATA was 46923 bytes/sec and peak SCSI was 107708 bytes/sec.

Things are slower with the network bottleneck removed.

3) Upgrading 5424 switch firmware and reconfiguring

Although the 5424 switches are supposed to come iSCSI optimized, I thought I'd flatten them and reconfigure the. I got the latest version of the switch firmware and re-flashed both switches. I think configured the 10.x.x.x one specifically for iSCSI using this excellent Dell whitepaper.

Running the 128-way test with the new configuration gave a test time of 20745 seconds. Finally an improvement, but nothing major, and still possibly just a statistical variation.

4) Upgrading the NIC drivers

Next I figured I'd bring the NICs up to the latest driver versions so upgraded all the NICs on all the servers.

Running the 128-way test with the new configuration gave a test time of 21743 seconds. Hmmm.

5) Homogenizing the network paths

At this point I started wondering if the Broadcom and Intel NICs had different characteristics so I decided to use the two Intel NICs for the iSCSI traffic. I also enabled jumbo frames. The Intel NICs have three setting for jumbo frames - off, 4088 bytes or 9014 bytes. The MD3000s can only go up to 9000 bytes, so I chose 4088 bytes and configured the MD3000 iSCSI ports to use the same.

Running the 128-way test with the new configuration gave a test time of 21526 seconds - nothing to write home about.

None of the network configuration changes I made had much effect on performance, apart from removing the network bottleneck, which made performance slightly worse overall. I checked other stuff like TCP offloading, but that wasn't enabled. My suspicion was that by removing the network bottleneck, I unmasked a SQL Server contention issue with my 128-connection test. I decided to try fewer client connections.

Here are the results:

There's clearly a SQL Server bottleneck that's being alleviated by reducing the number of connections and allowing the throughput to each array to increase slightly. With 8 connections, SQL Server isn't being driven hard enough and the elapsed time increases again, and this is reflected in the array throughput measurements too (a 10-15% drop compared to the 16-way test). One thing I forgot to do was examine the distribution of wait types while these tests were running, but my guess would be the bottleneck was in the transaction log manager.

Summary

By separating the network traffic and moving to two iSCSI NICs, I removed the network bottleneck I had (see the image at the bottom of the last post) and replaced it with a SQL Server bottleneck. Here's a snapshot of network utilization with the new setup:

 

In the next set of tests, I'm going to look at the effect of altering the transaction log auto-growth size, and pre-allocation size. In all the tests so far, the log has grown from the pre-allocated 256MB to somewhere between 6.5-8GB.

Should be interesting - stay tuned.

PS Yes, I'll be doing a bunch of stuff with the Fusion-io drives too - be patient! :-)

A while ago I blogged about disk partition alignment, and how the default alignment of 31.5Kb on Windows Server 2003 can lead to enormous I/O performance problems (see Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly?). We've been on-site with clients this week and that topic came up again. I thought it would be useful to do a quick blog post showing how to use the diskpart and wmic tools. Google them for lots of info from the Microsoft site - but be careful not to play around with any of the destructive options on productions systems. The options I'm using below will not alter the disks in any way.

Note: This stuff applies to MBR disks, not GPT or dynamic disks. Although these require correct alignment too, I don't have any information on how to do it for those disks. The SQLCAT team will be publishing some guidelines but has not yet done so, AFAIK. Check out the SQLCAT team whitepaper Disk Partition Alignment Best Practices for SQL Server for full details on this topic.

Bring up a command prompt and type diskpart. You'll see something like:

C:\Users\Administrator>diskpart

Microsoft DiskPart version 6.0.6001
Copyright (C) 1999-2007 Microsoft Corporation.
On computer: MONKEY

DISKPART>

Next you need to list the logical disks that Windows knows about. Type list disk. You'll see something like:

DISKPART> list disk

  Disk ###  Status      Size     Free     Dyn  Gpt
  --------  ----------  -------  -------  ---  ---
  Disk 0    Online       136 GB  1434 MB
  Disk 1    Online      1116 GB      0 B
  Disk 2    Online      2036 GB      0 B

DISKPART> 

Disks 1 and 2 are two RAID arrays I'm using right now for the performance benchmarking series I'm doing. Notice that the numbers in the Free column aren't correct - not sure why not. 

To see the partitions on a disk, you need to set the diskpart focus to be that disk. Type select disk X, where X is the disk you want to focus on. You'll see something like:

DISKPART> select disk 1

Disk 1 is now the selected disk.

DISKPART>

And now you can list the partitions on the disk using list partition. You'll see something like:

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary           1116 GB  1024 KB

DISKPART>

This is the output from one of my Windows Servr 2008 servers, where the default partition offset is 1MB - which doesn't lead to perf issues.

On another Windows XP system, I get the following output:

DISKPART> select disk 0

Disk 0 is now the selected disk.

DISKPART> list partition

  Partition ###  Type              Size     Offset
  -------------  ----------------  -------  -------
  Partition 1    Primary            119 GB    32 KB

DISKPART>

This disk isn't aligned correctly. If this was a RAID array, I'd pay a perf penalty every time a read or write straddled a RAID stripe offset. See the blog post link at the top of this post for more details.

Unfortunately, diskpart isn't always the best tool to use to get partition offsets, as it rounds up the values, and when there are multiple partitions, it can be hard to tell exactly what's what, especially whtih lots of disks where you need to select each one and then list the partitions.

In this case, use wmic to get the exact numbers. The command is as follows:

wmic partition get BlockSize, StartingOffset, Name, Index

For my server, I get the following output:

BlockSize  Index  Name                   StartingOffset
512        0      Disk #1, Partition #0  1048576
512        0      Disk #2, Partition #0  1048576
512        0      Disk #0, Partition #0  1505755136
 

For dynamic disks, use:

dmddiag.exe -v 

Now - go out to your servers and check the partition alignment - fixing this can give you up to 30-40% I/O performance boost!!

How do you fix it? Well, that's the downside - fixing it means reformatting the disk to have the correct partition offset or moving the data to a disk that already has the correct partition offset. Remember - although Windows Server 2008 creates disks with the correct offset, taking a disk that was created on Windows Server 2003 and attaching it to Windows Server 2008 will have no effect on the existing partition offset.

Categories:
IO Subsystems | Performance | Tools

Blog posts in this series:

  • For the hardware setup I'm using, see this post.
  • For the baseline performance measurements for this benchmark, see this post.
  • For the increasing performance through log file IO optimization, see this post.

In the previous post in the series, I optimized the log block size to get better throughput on the transaction log, but it was very obvious that having the log file and the data file on the same RAID array is a bottleneck.

Over the last couple of weeks I've been running some tests with the log and data files on separate RAID arrays, and this post will explain what I've found.

The first thing I did was setup the systems for remote access so I can log in to them from anywhere in the world. This involved:

  • Changing the TCP/IP port that Remote Desktop listens to (KB 306759)
  • Allowing that port through Windows Firewall (having to perform this step wasn't obvious)
  • Turning on port forwarding on our Internet-facing router for that port to the right server

It's very cool being able to play with these servers from anywhere - and to show live servers-under-load during a class, as I could last week.

The previous two tests were performed using a single 2TB RAID-10 array comprised of 8 300GB 15k SCSI drives. The new array I added to the mix for this test is comprised of various numbers of 1TB 7.2k SATA drives. I tested:

  • 4 drives in a RAID-10 configuration, giving 2TB
  • 6 drives in a RAID-10 configuration, giving 2TB usable
  • 8 drives in a RAID-10 configuration, giving 2TB usable

The volume size limit that the Dell MD3000i arrays allow is 2TB, so the 2nd and 3rd array configurations described above made use of extra spindles but with a lot of wasted space. In this set of tests I limited myself to a single data file, but coming up I'll try multiple data files which will enable more efficient usage of the available raw disk capacity.

I used the same scripts as in the previous tests, with 128 concurrent connections each inserting 4.1KB rows in batches of 10 inserts per transaction into the same table, for a total of 1/128 TB in each connection. The data file is pre-sized to 1TB. The log file is pre-sized to 250MB, with a 50MB auto-growth increment. These are invariants from the previous test (and are also something obvious that can be changed for better performance, but that's not what this set of tests was about).

I tried the following configurations:

  • Log on SCSI RAID-10, data on 4-, 6-, 8-drive SATA RAID-10
  • Data on SCSI RAID-10, log on 4-, 6-, 8-drive SATA RAID-10

And very interesting results they are too! In the previous set of tests, the best performance I could get was 21167 seconds for test completion, with the data and log files sharing the SCSI RAID-10 array.

Moving the log to a different array

The left-hand graph above shows that moving the log file off managed to get the time down to 20842 seconds in the best case, but that's using 8 drives. The 6- and 8-drive times were basically the same - which shows that 4-drives didn't provide enough IO parallelism for the load SQL Server was pushing through the array, but moving to 6 drives provided basically enough so that 8 drives didn't lead to a big performance gain.

Overall, I'd say there was no real performance gain from moving the log file to a different array.

As far as log disk queue lengths go, when the log was on the 4-drive array, the average log write disk queue length was in the 20s. For the 6- and 8-drive cases, the queue length dropped to low single-digits. It didn't drop right down to around zero because the perfmon counter is measuring the queue length as far as Windows sees it - and the iSCSI traffic was being bottle-necked through a single NIC, as we'll see later.

For data disk queue lengths, when the log was on the 4-drive array, the average data write queue length was around 5, with spikes to 20+. For the 6- and 8-drive cases, the queue length increased to an average of 10-15, with spikes to 30+. Clearly the log drive bottle-neck in the 4-drive case was lowering the overall transaction throughput, which reduced the load on the data drives.

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the log array. The results are in the right-hand graph above. You can see that the throughput of the data array remains basically static, but the log array throughput increases with more drives thrown into the mix. Nothing stunning here.

In the 4-drive case, the transaction log grew to 5.5GB but in the 6- and 8-drive cases the log grew to over 8GB - which is what I'd expect given the higher transaction throughput.

Moving the data to a different array

Edit 02/28/10: When performing the tests for the next post in the series, I found the results from this test didn't make sense. I went back to re-run the single-NIC tests and found what I feared - the original tests didn't complete properly. The corrected results are below. 

The left-hand graph above shows that moving the data file off managed to get the time down to 25400 seconds in the best case, but that's using 8 drives. The data file's throughput requirements are clearly higher than the log file's (for this specific benchmark test, not as a general statement by *any means*). It's clear that having the data file on the slower SATA array wasn't going to lead to any better performance than having it on the SCSI array.

As far as log disk queue lengths go, they were between 1-3 in all cases, as the 8-drive SCSI array could clearly provide enough IO throughput to satisfy the log's need.

For data disk queue lengths, when the data was on the 4-drive array, the average data write queue length was around 30, with wild spikes. For the 6- and 8-drive cases, the queue length decreased to an average of 20, and then down to 10 with spikes to 30+. When there was no checkpoint or lazy writer activity, the write queue lengths dropped to zero, as I'd expect. 

In terms of iSCSI performance, I had the MD3000is take a 2-hour snapshot of each array's performance with the varying size of the data array. The results are in the right-hand graph above. You can see that the throughput of the log array remains pretty static, but the data array throughput increases *dramatically* with more drives thrown into the mix. Again, nothing stunning here.

In the 4-drive case, the transaction log grew to 19GB but in the 6- and 8-drive cases the log grew to around 8GB - again I'd expect this. It seems that for this workload, with these default checkpoint settings, and with 8GB of server memory, the log needs to be around 8GB when there's adequate IO throughput. This is something I'll be trying to address in future tests.

Perfmon captures

I took a few perfmon snapshots during the various tests to provide some details of the system performance. I'm not going to go into as much detail explaining the various counters and what they mean, I did that in the previous post, but I will point out interesting details. These aren't necessarily meant to be representative of what you'll see running similar tests - they're of interesting times during the traces that I thought would be fun to look at and understand.

1) Log file on the 6-drive SATA array

 

This graph is highlighting the Avg. Disk Write Queue Length for the K: data array (the black line). You can see that when the Checkpoint pages/sec (pink line) and Lazy writes/sec (light green line) both drop to zero, the write queue length drops to zero - as there's no other activity on that RAID array. When either or both of these start up again, the write queue length spikes wildly. Other things to note:

  • The Avg. Disk Write Queue Length for the I: log array (the green line at the bottom) is pretty static in the low single-digits.
  • The Log Bytes Flushed/sec (the red line) tracks the Disk Write Bytes/sec for the I: log array, except during the heavy checkpoint activity at the start of the trace - this is a period of log file auto-growth.
  • The Pages Allocated/sec (the top light blue line) is fairly static, but varies wildly during the heavy checkpoint activity. This is because the log auto-growth is effectively stalling transaction throughput.

2) Log file on the 8-drive SATA array

 

This is highlighting the Pages Allocated/sec (the black line), which remains basically static, and is a measure of the overall transaction throughput. At this point in the test, the log has auto-grown as far as it will and performance is stable. Other things to note:

  • The Log Bytes Flushed/sec also remains static, as there's a constant transaction workload with no interruptions for log growth, and directly correlates with the spikes and troughs in the Pages Allocated/sec counter.
  • The Avg. Disk Write Queue Length for the I: log array (the bottom green line) is static.

3) Data on the 4-drive SATA array

 

This represents the worst combination of array sizes and file placements, and is a pretty chaotic trace. It's highlighting the Avg. Disk Write Queue Length of the I: data array (the black line) and you can see that it varies wildy between 0 and 80(!!!), clearly showing that the data file performance is hampered by under-powered RAID configuration. All other perf counters around the data array vary wildly - clearly not a recipe for high performance.

Summary

There's clearly a performance gain to be had from separating the data and log portions of the database in this case. However, doing so has highlighted the fact that the simple networking configuration I have is now a bottleneck.

Here's a Task Manager trace showing the network utilization during one of the tests:

 

The troughs are the constant traffic going to the log array, and the sustained peaks are when there's checkpoint and/or lazy writer activity going to the data array as well.

In the next post, I'll tune the network configuration, and then I'll move on to trying multiple data files in multiple RAID arrays.

Hope you're still enjoying the series!

Categories:
Benchmarking | Performance

(For the hardware setup I'm using, see this post. For the baseline performance measurements for this benchmark, see this post.)

In my previous post in the series, I described the benchmark I'm optimizing - populating a 1-TB clustered index as fast as possible using default values. I proved to you that I had an IO bottleneck because the IOs to the data file (from checkpoints) and the transaction log file (from transactions committing) were causing contention.

Several people commented that I might have mis-configured the iSCSI IO subsystem - so first off I want to look at that. Fellow MVP Denny Cherry (twitter|blog), who knows a lot more than me about IO subsystems, volunteered to discuss my iSCSI setup with me to make sure I hadn't goofed anywhere (many thanks Denny!). It seems like I haven't. I'm using a single iSCSI array right now, with a single NIC on the server dedicated to the iSCSI traffic - once I move to multiple volumes, I'll add in more NICs.

Looking at Task Manager in the middle of a 6-hour test run to see the network utilization through the NIC shows that it's not saturated, as shown below.

 

I ran the DELL smcli utility for two hours during the most recent test to see what peak throughput I'm getting, using the following command:

smcli -n Middle_MD3000 -c "set session performanceMonitorInterval=5 performanceMonitorIterations=1440;save storageArray performanceStats file=\"c:\\MiddlePerfStats.csv\";"

I saw around 101MBytes/sec. and earlier when testing the smcli settings I'd seen 106MBytes/sec. I'm sure once I remove some of the contention that this will get better.

On to the test!

The first thing I wanted to try was optimizing my use of the transaction log - i.e. doing as few and as large IOs as possible to the log. My hypothesis is that by changing the batch size from one to, say, 10 or 100 SQL Server can make more efficient use of the log.

I changed my T-SQL script to take a batch size parameter and use explicit transactions inserting <batch-size> records. Everything else remained the same as the baseline. I picked 128 concurrent threads as my test point. In the baseline, the 128-thread insert test took 8 hours 8 minutes 27 seconds to complete (29307 seconds). My simple T-SQL code changed to (with $(rows) and $(batch) being SQLCMD parameters to the script):

DECLARE @counter BIGINT;
DECLARE @inner   SMALLINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   SELECT @inner = 0;
   BEGIN TRAN;
   WHILE (@inner < $(batch))
   BEGIN
      INSERT INTO MyBigTable DEFAULT VALUES;
      SELECT @inner = @inner + 1;
   END
   COMMIT TRAN;
   SELECT @counter = @counter + $(batch);
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

Below are the results for 128 threads with batch sizes varying from 10 to 10000:

Increasing the batch size from 1 record to 10 records caused the elapsed time to drop from 29307 seconds to 21167 seconds - a 28% improvement! However, increasing the batch size further didn't improve things any more. Hmmm.

Next I tried different numbers of concurrent connections with a batch size of ten to see if the improvement was universal:

 

So moving from a single-record batch to a ten record batch has the same dramatic effect on performance for varying numbers of connections.

What's going on?

The answer involves some deep internals of the transaction log structure and IO behavior. The transaction log is split up internally into sections called virtual log files (VLFs for short). These VLFs are used to manage which parts of the log can be overwritten with new log records. If this is all unfamiliar, go read my TechNet Magazine article on Understanding Logging and Recovery in SQL Server which explains about the log in more detail and about transactions, otherwise what's coming next may not make much sense.

Whenever a transaction commits, all the transaction log records up to and including the LOP_COMMIT_XACT log record for that transaction (including log records from other transactions that are inter-mingled with those from the one that just committed) must be written out to disk, so the transaction is durable (called write-ahead logging). But it's not individual log records that are written to disk, the unit of IO for the transaction log is a log block. VLFs are split internally into dynamically sized log blocks, ranging from 512-bytes to 60KB in 512-byte increments, as needed. There are algorithms to figure out how big log blocks should be based on what log records are being flushed out.

With the single record batch, the amount of log generated in the transaction totals 4952 bytes. You can find this by doing something like:

CHECKPOINT;  -- to force the log in the SIMPLE recovery model to clear
GO

INSERT INTO MyBigTable DEFAULT VALUES;
GO

SELECT * FROM fn_dblog (NULL, NULL); -- show me all active log
GO

Then add up the values in the Log Record Length column for all records from the first LOP_BEGIN_XACT log record to the LOP_COMMIT_XACT with the matching Transaction ID.

Anyway, a single-record transaction generates 4952 bytes of transaction log, which will be flushed out in our case in a log block that's 5120 bytes (the nearest multiple of 512 above 4952), with a bunch of zero-padding at the end. For ten single-record transactions, that's 10 small IOs.

The ten-record transaction generates 48688 bytes of transaction log and will be flushed out in a log block that's 49152 bytes (the nearest multiple of 512 above 48688). This is clearly more efficient than 10 smaller IOs and is why changing to batches makes things go faster.

A 100-record transaction generates 489628 bytes of transaction log, which is clearly more than the 60KB log-block maximum, so it actually covers multiple log blocks (varying from 14-16 in my tests - I don't remember the exact algorithms). You can see the log block changing when the middle number in the Current LSN column of the fn_dblog output increases. It looks like it's jumping up, and it is - this is the offset of the log block within the current VLF divided by 512.

Because the 100-record transaction is split into multiple log blocks, there's no real IO gain during log flushes over the 10-record transaction - which is illustrated by the results I saw.

Now, even with this speed increase from the increased batch size, the average-disk-write-queue-length is still anywhere from 20-40 when running the 128-connection test, so there's still a significant bottleneck there. In fact, the transaction log grew significantly still during these tests (up to 23GB in one case) - for an explanation of this phenomenon, see Interesting case of watching log file growth during a perf test. I picked up some wait stats queries from fellow-MVP Glenn Berry (twitter|blog) to run while the tests were running. I'm using his as they're published and anyone can download and use them (see his blog post for the queries, and Bob Ward's CSS blog post that's starting to document the wait types).

His queries tell me that with the 10-record per batch and 128-connections:

  • Average write-stall to the data file is 37ms
  • Average write-stall to the log file is 18ms
  • Top wait types are PAGELATCH_EX (55% of all waits), PAGELATCH_SH (28% of all waits), and WRITELOG (14% of all waits)

The first two waits are the Storage Engine waiting for the buffer pool to kick out dirty pages so that buffers can be used to hold newly created pages, and the third is the log manager waiting for log block flushes to disk to complete.

Clearly IO IO IO.  Let's look at a a perfmon capture during the 10-records per batch test with 128 connections:

 

Let's go through each counter (top to bottom in the list in the image) and explain what's going on. I deliberately picked this time-slice, as it really simply shows the effect of contention:

  • Pages Allocated/sec: this is the light blue line and is the Access Methods part of the Storage Engine (the dev team I used to run) creating new data and index pages for the clustered index we're populating.
  • Checkpoint pages/sec: this is the pink line at the bottom left and bottom right of the capture. This is the buffer pool writing out dirty pages during a periodic checkpoint.
  • Avg. Disk sec/Write: this is the dark blue line that's tracking just above the thick black line. It's the amount of time in ms for a write to complete. You can see that it has a minimum around 12:51:00 and then varies wildly, hitting as high as 50+ms for a single write.
  • Avg. Disk Write Queue Length: this is the highlighted line in thick black. It also has a minimum around 12:51:00 and varies wildly the rest of the time.
  • Disk Write Bytes/sec: this is the dark green line at the top that shows the number of bytes being written to disk from all IO sources. Same story around 12:51:00.
  • Log Growths: A simple counter since the database was created/server started. It's over 100 and off the chart.
  • Log Bytes Flushed/sec: this is the red, fairly constant line around 1/3 the way up and is log blocks being flushed to disk because of transaction commits or checkpoints.
  • Lazy writes/sec: this is the light green line at the bottom and is the buffer pool having to force data pages to be written to disk (along with all transaction log flushed up to the point of the last log record to change the page being written) to make space for images of newly created pages.

This time-slice is really cool in that it shows what happens when contention goes away. Just before 12:51:00, a checkpoint ends and the lazywriter has nothing to do - so the only IOs hitting the disks are those coming from the transaction log flushing out log blocks as transactions commit. You can see the Avg. Disk Write Queue Length drop down to 2-3, the Avg. Disk sec/Write drop to about 5ms, and most beautiful of all (look, I'm a big geek ok? :-), the Disk Write Bytes/sec (the green line) drops down to be exactly equal to the Log Bytes Flushed/sec - proving that it's just log flushes hitting the disk. This is the no-contention case. It happens again for brief spell about 10 seconds later - the lazywriter most likely created a temporary surfeit of empty buffers. All the rest of the time, the lazywriter and checkpoints play havoc with the write throughput on the drives by causing contention.

It's clearly time to try some separation of files to relieve the contention - and that's what I'll cover in the next post in the series.

Hope you're enjoying the series - these take a long time to write up!

(For the hardware setup I'm using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn't matter what the data in the database is, as backup doesn't care - as long as there's a terabyte of it. Why a terabyte? It's a nice round number, it's a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that's what I'm doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what's going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived - I'm going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I'm starting with a brain-dead database layout, configuration, and method of populating the table, and then I'm going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads - the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it's *essential* to have a baseline with which to compare - otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I'm deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE BackupPerfTest ON
(NAME = 'BackupPerfTest_Data',
   FILENAME = 'K:\BackupPerfTest.mdf',
   SIZE = 1TB,
   FILEGROWTH = 1GB)
LOG ON
(NAME = 'BackupPerfTest_Log',
   FILENAME = 'K:\BackupPerfTest.ldf',
   SIZE = 256MB,
   FILEGROWTH = 50MB);
GO

ALTER DATABASE BackupPerfTest SET RECOVERY SIMPLE;
GO

USE BackupPerfTest;
GO

CREATE TABLE MyBigTable (c1 BIGINT IDENTITY, c2 CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134217728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134217728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don't do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I'm deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   INSERT INTO MyBigTable DEFAULT VALUES;
   SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:

sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So - a simple setup.

Here's a graph of the results:

 

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster runtime, but the improvements from doubling the number of threads clearly aren't directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let's look at the page allocations per second for each experiment too:

 

As I'd expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There's a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I've already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here's a perfmon graph for the 64-way test once it reached steady-state and the log wasn't growing:

 

Analysis: 

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there's a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I'm speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I've got them on a fast RAID-10 array. This debunks the theory I've often heard that contention doesn't matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem - it just depends what you're trying to do with it. Imagine if I had nonclustered indexes on this table too - more logging and more pages being allocated - much worse performance...

So now I've got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I'm not saying that all of these things are going to help, but over the next few weeks I'm going to try them all and report on the results. You'll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people's words for it.

I hope you're going to find these benchmarks and experiments useful - I'll be learning (hopefully) as I go along too.

Let me know if there's anything else you'd like to see me try, and if you're following along (I don't want to spend all this time if no-one's reading the series!)

Thanks!

I'm running some performance tests on the hardware we have (more details on the first of these tomorrow) and I was surprised to see some explosive transaction log growth while running in the SIMPLE recovery model with single row insert transactions!

Without spoiling tomorrow's thunder too much, I've got a setup with varying numbers of connections populating a 1TB table with default values, with the single data and log files on the same 8-drive RAID-10 array (again more details on why tomorrow). I was looking at perfmon while the test was running with 128 concurrent connections, and noticed some weird-looking behavior - basically what looked like a never-ending checkpoint occuring. So I dug in deeper and discovered the transaction log had grown over 7GB since the start of the test. Huh?

I restarted the test from scratch, with a 1TB data file and a 256MB log and watched what happened in perfmon; here's the screenshot of the first 90 seconds:

 

This is *so* interesting. The black line is the number of log growths, so you can see the log grows every time the line level goes up. The bright blue line is the number of pages being allocated per second to hold all the table rows my 128 connections are inserting. You can clearly see that every time there's a log growth, the allocations take a nose-dive - because no transactions can commit while the new portion of the log is being zeroed (remember that instant file initialization does not apply to the log - see this blog post). The green line (disk write bytes per second) spikes when the log grows because of the zeroes being written out to the log file by SQL Server (remember that SQL Server does the zeroing).

But why is the log growing in the first place? I'm in the SIMPLE recovery model and doing single-row implicit transactions (yes, I'm deliberately doing this) so there shouldn't be anything stopping the log from clearing during a checkpoint, right?

Wrong. The log starts out small (256MB) so one of the thresholds for triggering a checkpoint gets hit pretty fast (70% of the log file is used). So a checkpoint occurs (you can see the checkpoints occuring when the pink line at the bottom of the perfmon screen is above zero), and starts writing out to the data file, which is on the same disk as the log (see How do checkpoints work and what gets logged for an explanation of checkpoint mechanics), but it can't write fast enough (because of disk contention) to get to the log-clearing part before the log fills up completely and has to grow (because transactions are continuing at break-neck speed). So the log grows, and the insert transactions stop while the log is zeroed. And then the log starts to fill up again very quickly and another checkpoint is triggered, and so on and so on.

Eventually a steady state is reached where there's enough free log space during a checkpoint that no new log is required for the concurrent transactions to commit. You might ask why the checkpoint is so slow in the first place? Because I deliberately put the data file on the same RAID array as the log file, and both are being steadily hammered with writes - classic disk contention. Even though the RAID array is RAID-10 with 8x300GB 15k SCSI drives, the average disk queue write length is over 20 most of the time during the 128-way test because I'm simply trying to do too much.

The point of my blog post? Just because you don't have any of the classic causes of transaction log growth going on, doesn't mean you're going to be immune. In this case my (deliberate) poor physical layout of the database files and workload growth up to 128 concurrent connections caused the log to grow. What started out working when I was running 16 connections didn't work any more at 128 (actually I went back and re-ran some of the earlier tests and even with only 64 connections, the log grew to over 1GB before reaching steady-state).

Interesting eh?

Tomorrow I'll be continuing the perf/benchmarking series by creating my first benchmark and then tweaking the setup to see how I can improve performance (for instance with multiple data files, separation of log and data files - all the things I preach but have never *demonstrated*), but this behavior merited a post all on its own.

I've blogged about ghost records and the ghost cleanup task a couple of time before (the only place it is really explained AFAIK), but one of my fellow MVPs was asking me some questions about it today for a customer of theirs and couldn't find the trace flag to turn it off.

My previous blog posts about this are:

These explain what ghost records are and how the ghost cleanup process works.

On big systems it's possible for the ghost cleanup process to fall behind the rest of the system, with no hope of it catching up. It's a single-threaded task - so imagine a 16-way box with lots of deletes occuring, and a single CPU spending a few seconds every 5 seconds trying to remove all the ghosted records resulting from the deletes of all the other CPUs. It's pretty obvious that the ghost cleanup process is going to lag behind.

The problem with this scenario is that the ghost cleanup process will still pop up every 5 seconds (every 10 on 2008) and start removing ghost records, potentially causing performance issues by keeping pages in the buffer pool, generating log records, and causing physical IOs. The ghost cleanup task is also one of the background processes that can cause IOs to occur on what looks like a totally quiescent system.

There is a way to turn off the ghost cleanup task, using trace flag 661, as documented in KB 920093. Be careful though!!! If you disable the ghost cleanup task, the space taken up by deleted records will *NOT* be released for reuse by SQL Server until you do something else to remove it, like rebuilding an index.

One method people sometimes consider is to force ghost cleanup to clean everything by performing a table or index scan (thus queuing all the deleted records up for the ghost cleanup task). Although this is an alternative, it still uses the ghost cleanup task to do the work, and on a very busy system with a very large number of deletes (warning: generalization! :-) it can be much more efficient to remove the deleted-but-not-yet-reclaimed records using index reorganize or index rebuild.

Turning on this trace flag can be useful as a performance gain on systems with very heavy delete workloads, but only as long as you're careful about it. It's not something that's generally recommended but it may be useful to you.

Enjoy!

This is an interesting performance bug concerning a broken query optimizer rule in 2008. Thanks to Dan Shargel (Twitter) for sending me info on this and letting me use some of the stuff he sent.

The scenario involves using MIN or MAX in a query against a partitioned view. In 2005 the query plan includes a TOP (1) operator which uses the right index, but in 2008 the optimizer rule was broken and the plan turned into a stream aggregate, much more expensive in this case.

Here's the 2005 query plan:

 

and here's the 2008 query plan (before the bug fix):

 

You can get the fix in CU4 for 2008 SP1 (or later) and read a bit more about it in KB 973255.

Note that you have to turn on trace flag 4199 to enable the fix - that requirement will be removed in SQL11.

It's been a few weeks since my last posts but I've got a bunch in the pipeline coming up.

Firstly, I've got it together to start using the hardware we got a while back. I'm going to be doing some benchmarking, perf testing and playing with various HA technologies, and of course blogging a bunch about what I discover. The hardware was a lot of fun to put together!

This is kind of a reference post, detailing the hardware setup, so I can link to it in all the posts where I use our hardware.

Photos: front and back (no Facebook login required).

Here's what I'll be using, all mounted in a 42U rack:

  • 2 x Dell PowerEdge 1950 servers, each with
    • 2 x quad-core Xeon E5405 Processor2x6MB Cache, 2.0GHz, 1333MHz FSB
    • 8GB 667MHz RAM
    • 2 x 146GB 15K RPM SCSI drive (for OS etc)
    • 4 3GBps NICs
    • Windows Server 2008 Enterprise SP1 64-bit
    • Multiple SQL Server 2008 Enterprise instances
  • (Upper array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 15 x 1TB 7.2K RPM SATA drives
  • (Middle array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 8 x 300GB 15K RPM SCSI drives
  • (Bottom array) Dell PowerVault MD3000i iSCSI array with 2 dual-port controllers
    • 8 x 1TB 7.2K RPM SATA drives
  • Connected with 1GbE through two PowerConnect 5424 24-port iSCSI optimized switches 

The drives will be reconfigured and I'll specify what I'm using in each post.

The SQL Server instances are vanilla installs, with no special parameters, sp_configure options, or trace flags. If I use any I'll detail them in each post.

Time to get using it!

Categories:
Benchmarking | Performance

As part of the 2008 DBA class we're teaching down here in Melbourne, I did a demo of using predicates and file targets with extended events, so I want to blog the script for people to play with.

For background info on extended events see:

In this scenario, I'd like to track queries that make heavy usage of the CPU on my system.

First off I'll create a test database to play with.

USE MASTER;
GO
IF DATABASEPROPERTYEX ('production', 'Version') > 0 DROP DATABASE production;
GO

CREATE DATABASE production;
GO
USE production;
GO

CREATE TABLE t1 (c1 INT IDENTITY, c2 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID(), c3 CHAR (5000) DEFAULT 'a');
CREATE CLUSTERED INDEX t1_CL ON t1 (c1);
CREATE NONCLUSTERED INDEX t1_NCL ON t1 (c2);
GO

SET NOCOUNT ON;
INSERT INTO t1 DEFAULT VALUES;
GO 1000

-- Get the database ID to plug into the event session
SELECT DB_ID ('production');
GO

Notice that I grabbed the database ID of the new database I created, as I'll need that when defining the extended event predicate.

IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'EE_ExpensiveQueries')
   DROP EVENT SESSION EE_ExpensiveQueries ON SERVER;
GO

CREATE EVENT SESSION EE_ExpensiveQueries ON SERVER
ADD EVENT sqlserver.sql_statement_completed
   (ACTION (sqlserver.sql_text, sqlserver.plan_handle)
      WHERE sqlserver.database_id = 18 /*DBID*/  AND cpu > 10 /*total ms of CPU time*/)
ADD TARGET package0.asynchronous_file_target
   (SET FILENAME = N'C:\SQLskills\EE_ExpensiveQueries.xel', METADATAFILE = N'C:\SQLskills\EE_ExpensiveQueries.xem')
WITH (max_dispatch_latency = 1 seconds);
GO

I'm monitoring the event that fires whenever a statement completes, and I'm filtering by database ID (make sure you plug in the correct database ID when you try this yourself) and by the number of milliseconds of CPU time the statement used. Note: when using predicates you should always make sure that you order the predicate tests such that the most restrictive predicates (those most likely to evaluate to false) are first in the list, so the predicate evaluation 'short-circuits' as quickly as possible (a standard programming practice with boolean logic). I'm also using a file target, just to show how it can be done, instead of the ring buffer which is commonly used for extended events demos.

Now I'm going to turn on the event session and do some queries in the production database.

ALTER EVENT SESSION EE_ExpensiveQueries ON SERVER STATE = START;
GO

USE production;
GO

SELECT COUNT (*) FROM t1 WHERE c1 > 500;
GO

SELECT SUM (c1) FROM t1 WHERE c3 LIKE 'a';
GO

ALTER INDEX t1_CL ON t1 REORGANIZE;
GO

ALTER INDEX t1_CL ON t1 REBUILD;
GO

That should have generated some long-running queries. Now I'll switch in to the context of master (so the querying of the event session itself doesn't get captured by the event session) and see what I've captured.

USE master;
GO

SELECT COUNT (*) FROM sys.fn_xe_file_target_read_file
   ('C:\SQLskills\EE_ExpensiveQueries*.xel', 'C:\SQLskills\EE_ExpensiveQueries*.xem', NULL, NULL);
GO

In this case I've got 3 entries. I can pull these out using the following code:

SELECT data FROM
   (SELECT CONVERT (XML, event_data) AS data FROM sys.fn_xe_file_target_read_file
      ('C:\SQLskills\EE_ExpensiveQueries*.xel', 'C:\SQLskills\EE_ExpensiveQueries*.xem', NULL, NULL)
 ) entries;
GO

But I get three XML blobs back, like so:

 

What's more useful is to pull everything out of the XML blob programmatically using the code below:

SELECT
   data.value (
      '(/event[@name=''sql_statement_completed'']/@timestamp)[1]', 'DATETIME') AS [Time],
   data.value (
      '(/event/data[@name=''cpu'']/value)[1]', 'INT') AS [CPU (ms)],
      CONVERT (FLOAT, data.value ('(/event/data[@name=''duration'']/value)[1]', 'BIGINT')) / 1000000
      AS [Duration (s)],
   data.value (
      '(/event/action[@name=''sql_text'']/value)[1]', 'VARCHAR(MAX)') AS [SQL Statement],
      SUBSTRING (data.value ('(/event/action[@name=''plan_handle'']/value)[1]', 'VARCHAR(100)'), 15, 50)
      AS [Plan Handle]
FROM
   (SELECT CONVERT (XML, event_data) AS data FROM sys.fn_xe_file_target_read_file
      ('C:\SQLskills\EE_ExpensiveQueries*.xel', 'C:\SQLskills\EE_ExpensiveQueries*.xem', null, null)
) entries
ORDER BY [Time] DESC;
GO

Time                    CPU (ms) Duration (s) SQL Statement                              Plan Handle
----------------------- -------- ------------ ------------------------------------------ --------------------------------------------------
2009-10-16 17:59:29.623 30       1.214875     ALTER INDEX t1_CL ON t1 REBUILD;           0x06001000EB672A07B8C0C807000000000000000000000000
2009-10-16 17:59:28.407 20       0.024076     ALTER INDEX t1_CL ON t1 REORGANIZE;        0x0600100003594903B8C0C807000000000000000000000000
2009-10-16 17:59:28.343 51       0.045144     SELECT SUM (c1) FROM t1 WHERE c3 LIKE 'a'; 0x06001000FAF5B11EB820C307000000000000000000000000

Now I can plug in one of the plan handles to a query of sys.dm_exec_query_plan to get the graphical query plan:

SELECT [query_plan] FROM sys.dm_exec_query_plan (0x06001000FAF5B11EB820C307000000000000000000000000);
GO

And clicking on the resulting XML 'link' gives the query plan:

 

 

And now I can tweak the production workload to potentially behave better.

Just for kicks I went back into the context of the production database and ran the XML parsing again to capture the query plan - try it yourself - pretty gnarly! :-)

Now, if I was going to make this more useful, I'd use a ring buffer target, with a polling mechanism every few seconds to make sure that I can capture the graphical query plan for expensive queries before the plan is pushed out of cache - maybe I'll get around to doing that and publish the complete solution.

In the meantime, here's a zip file containing the entire script that you can download and play with: EE_ExpensiveQueries.zip (1.72 kb)

Enjoy!

This is a question that came up yesterday on Twitter: will SQL Server always do a table scan to find the result of SELECT COUNT (*) FROM mytable?

The answer is no. The query processor will use the index with the smallest number of pages - i.e. with the least I/O cost.

Let me quickly show you. First I'll create a simple table with no indexes.

CREATE TABLE CTest (c1 INT IDENTITY, c2 BIGINT DEFAULT 1, c3 CHAR (1000) DEFAULT 'a');
GO
SET NOCOUNT ON;
GO
INSERT INTO CTest DEFAULT VALUES;
GO 10000

Don't forget to unclick the Include Actual Query Plan button

 

before running the 10000 inserts otherwise it will take forever (generating 10000 graphical plans) and SSMS will barf with:

The query has exceeded the maximum number of result sets that can be displayed in the Execution Plan pane. Only the first 250 result sets are displayed in the Execution Plan pane.

Now if we do the SELECT COUNT (*), the plan is:

 

The query processor has no choice but to do a table scan.

Now I'll add a nonclustered index that will have less pages that the table itself:

CREATE NONCLUSTERED INDEX CTest_1 ON CTest (c2);
GO

And the select plan is now:

 

Notice that the Table Scan operator has changed to an Index Scan operator on the new CTest_1 index. This is because the nonclustered index has fewer pages than the table and so the I/O cost is lower.

Now I'll create an index that's even smaller, over the integer column:

CREATE NONCLUSTERED INDEX CTest_2 ON CTest (c1);
GO

And the plan should change to use the smallest index again:

 

And it does, as I expected.

Let's look at the relative page counts for each index and the table:

SELECT [index_id], [page_count]
FROM sys.dm_db_index_physical_stats (DB_ID (), OBJECT_ID ('CTest'), NULL, NULL, 'LIMITED');
GO

index_id    page_count
----------- --------------------
0           1436
2           28
3           19

Any time the query processor is working out what plan to use, one of the key factors in plan choice is the I/O cost.

Hope this helps!

This survey is based on a suggestion from Jonathan Kehayias (blog/Twitter).

In this survey I'm asking two questions:

  1. When you're designing/implementing a database, do you manually create statistics or let SQL Server create them automatically?
  2. How do you maintain statistics? Manually or letting SQL Server do it for you? Or both? 

I'll probably collaborate with Kimberly on the editorial for this, as she's forgotten more about statistics than I'll ever know!

*Please* no comments on this post - wait for the survey results post to avoid skewing the answers (comments are unfortunately moderated anyway, now that my blog gets a lot of spam). I'm very interested in your reasoning, but not until everyone else responds.

   

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

In last week's survey I asked what you think is the most important thing when it comes to performance tuning, if you *had* to choose one - see here for the survey. Here are the results as of 6/7/09.

Now of course you're all calling 'foul' because I didn't put an 'it depends' option on there which you all would have chosen, but that would have been boring, and I wanted you to pick what you really think is the #1 thing in the majority of cases. Of course, in general the answer is always 'it depends', but some things turn out to be the #1 problem more often than others. For the record, mine and Kimberly's answers were overall indexing strategy, but I deliberately broke them out into clustered and nonclustered to see what people would pick. I'm not surprised that application issues and nonclustered indexing strategy came out top, but good to see that all answers were represented and there's a variety of opinion out there. Variety is the spice of life as they say, or is it garlic? Depends who's cooking I suppose - but I digress.

Let's walk through each option and see how badly they can affect performance if something's wrong. This is by no means an exhaustive list of what could cause problems, just me rambling on at the end of a Sunday, as usual. Blog posts are so much easier than articles because you can have run-on sentences, short little fragments and all sorts of other verboten grammatical beasts.

IO subsystem design/tuning, including RAID: Quite a few things here could be wrong. Is there a write-intensive workload (either data files or log files) sitting on a RAID-5 array (with RAID-5 there's a performance penalty on writes)? Is the disk partitioning alignment, RAID stripe size, or NTFS allocation unit size incorrectly set? (See Are your disk partition offsets, RAID stripe sizes, and NTFS allocation units set correctly? for details.) Is the I/O subsystem just not up to the load being placed on it - high disk queue lengths, page IO latch wait errors, long-running IO errors? Is tempdb the bottle-neck because there a multiple procs with multiple users all trying to create/drop lots of temp tables? (See Misconceptions around TF 1118 for details.) Are there multiple databases using the same I/O subsystem such that disk-head contention is causing performance issues? Is there a network bottleneck with iSCSI storage? Are there filter drivers getting in the way and slowing things down?

Server hardware (CPUs/memory/other): Ok - I'm going to sound like a stuck-record saying 'quite a few things here could be wrong'. Let's just take it as read that each category has lots that could go wrong. A heavy load with a single CPU isn't going to perform well. However, sometimes excessive parallelization can lead to the dreaded CX Packet waits - multiple threads in the query processor waiting for information from each other because the server is overloaded - or blocking issues as locks start getting in the way - but that's more an application problem - oops, wrong section. Having more CPUs is usually a good thing as parallel plans usually run faster than single-threaded plans. Having too little memory can severely limit the buffer pool size and force buffer pool thrashing (low page life expectancies), and large operations to spool data to tempdb, placing further load on resources. Not sure what I meant now by 'other' - maybe having a wireless keyboard could slow down commands getting to SQL Server? Ok - I remember - CPU architecture. 64-bit vs. 32-bit. With 64-bit there's no real limit on the amount of memory that can be addressed, and no mucking around with AWE. There are some tricky issues with NUMA to make sure you don't get cross-node memory accesses (just like CPU cache invalidation could hit you on SMP architectures) but that's getting a little advanced and geeky. Btw - watch out for this when you're using Extended Events in SQL 2008 on multi-proc boxes - there's a setting where you can tell it what kind of CPU architecture you have.

Virtualized vs. 'real' server: Now, I'm the first to admit that I'm not an expert in virtualization, but now I have 3 8-ways and lots of disks to play with at home, I want to be. That's why I picked up a Hyper-V book today after reading Tom's virtual bookshelf post here. What I do know is what I've heard from talking to people at conferences and during classes - the main thing here is I/O virtualization. If there's software virtualization of I/Os, then performance is going to suck under load and you better not do it in production. Microsoft's Hyper-V gets around this, as do things like VMware ESX, by allowing you to assign actual real physical resources to virtual machines. And that's as much as I know about it until the book arrives and I get back from Houston to read it. And no doubt I'll be posting more about it.

Database physical layout (files/filegroups): Tempdb is the obvious case, and I've covered that above. User databases are a bit trickier, and it really depends on the underlying I/O subsystem. A single monolithic data file on a single physical disk isn't going to perform well under heavy concurrent load. But it may do fine on a RAID-10 array sliced-and-diced in a clever way by an expensive SAN. In today's world, more and more databases are on SANs so user database performance is taken care of in that respect, and all you need to worry about is tempdb. I touched on this a little bit in the editorial of a previous survey Physical database layout vs. database size.

Table design: Both ends of the spectrum work well here - from the plastic Costco tables with the fold-down metal legs to antique French-oak farmhouse tables. Both will give many years of service with no degradation of performance. Just like regular tables, SQL Server tables come in all shapes and sizes and you can easily pick the wrong one for what you need, leading to poor performance. This is a massive can of worms - in fact Kimberly spends 2 brain-busting days on this in one class she teaches and she could easily spend a week of 10-hour days on it (oh yes, it's called the SQL Masters program :-) From picking the right column types (see Michelle's post at Performance Considerations of Data Types) to deciding how to store you LOB values (see my post at Importance of choosing the right LOB storage technique) to deciding on a good primary key. Just don't pick a random GUID, or if you have one, replace it. Clustered index keys should be unique, narrow, static and ever increasing. Table design encompasses so many things I can't do it justice in a late evening paragraph. But you get the idea.

Heaps vs. clustered indexes: Kimberly likes to wrap this up with nonclustered indexes too, but I wanted to break them out so I could reference a whitepaper that strongly suggests you should use clustered indexes: Comparing Tables Organized with Clustered Indexes versus Heaps. Read it yourself and draw your own conclusions. Basically, with the possibility of forwarding records in heaps, and the majority of a clustered index's upper levels being in memory, the extra random physical I/Os to access a record in a heap outweigh the aggregated cost of the in-memory binary-searches at each level of clustered index tree navigation. I'm sure some people will argue about this - bring it on. There are very special cases where heaps are better, but not in general.

Nonclustered index strategy: Biggie #1. Where to start here and what to say? For the majority of scenarios, if you don't have a good nonclustered indexing strategy you're not going to have good performance. There's a simple reason for this - without the right nonclustered indexes to allow the query optimizer to choose them, you're going to get table scans. Of your 124 billion row table. How fast is that going to be? Again, Kimberly has this class that she teaches on this where she goes into details on how to index for ORs, for ANDs, for JOINs, and all sorts of other kinky stuff. I need to sit and listen to her one day so I can appear knowledgeable about this. I just know that you need to have indexes with the right selectivity. Ok - I just called Kimberly downstairs in the office to make sure I'm right before I blog something and look like an idiot and she starts with 'erm, well it depends'. I give up. I do *corruption* and *HA* and *maintenance*, not performance tuning. It's bloody hard. Luckily I'm married to one of the best people on the planet for SQL index tuning - wow, what hope do you have? (Of index tuning, not of marrying Kimberly :-) Seriously, index tuning isn't that bad - I just had a moment of weakness. You need to make sure you have the right indexes and no completely unused indexes otherwise you're wasting resources maintaining them. You could do worse than listen to her on RunAs Radio Interview Posted - "Kim Tripp Indexes Everything".

Statistics: If the statistics are out-of-date, the optimizer will not be able to pick a good plan and your performance will suck. It's that simple. Turn on AUTO_UPDATE_STATISTICS and make statistics updating is part of your regular maintenance. Don't forget to update non-index statistics too.

Application (design and code): Biggie #2. Sometimes no amount of cleverness can wring good performance from the twisted logic of a deranged application programmer. An application that is written with absolutely no concern for how SQL Server works is likely to not perform well. An application that is written with too much knowledge of how SQL Server works is likely to fall foul of relying on 'accidental' behaviors, or behaviors that are limited to certain data volumes or workloads.  An application that is going to make heavy use of SQL server has to take into account how SQL Server is going to behave under a variety of workloads, on a variety of servers, and on a variety of I/O subsystems. The common application test framework? A ten-row customer table with a single connection. 'Excellent - my 16-table join with CLR-based aggregations runs in less than 2 seconds'. For 10 rows. Six months later: 'Now we've got 3 million customers, why does performance suck so bad?' Go figure.

Database maintenance: Ok - starting to get tired now and I still need to do this week's survey. Maybe I could just say 'do maintenance' and be done with it? No? Ok - how about go checkout the article I wrote last August for TechNet Magazine: Effective Database Maintenance article. If you don't maintain your indexes, they'll likely get fragmented and affect performance in one way or another and you'll need to play with FILLFACTOR (see Kimberly's Database Maintenance Best Practices Part II – Setting FILLFACTOR). If you don't take care of transaction log VLF fragmentation, it will affect log-based operations (Kimberly again at Transaction Log VLFs - too many or too few?). Seriously - she says I blog so much - how many links are in this post to her stuff? She blogs a lot too!. If you have corruption, it could manifest itself as long-running I/Os. If I don't put my food on to cook, then I'll be eating way too late again. That was a 'Paul maintenance' one that slipped in - and can have disastrous effects on blogging performance. Seriously, you can't just put a database into production and walk away. It will slowly degrade over time. Like red jello melting on a hot summer's day, but probably not as fast, unless you use random GUID primary keys, or it's really hot. What?!?

Performance tuning is an art and a science. But there's a huge amount of science behind it before you have to get into the art side of things. I've just scratched the surface here in a blog post that took me more than an hour to write, banging away non-stop as I do. And I don't do the art side of things. I leave that to Kimberly - she just got one of our client's batch jobs from a 72-hour run-time down to a 6.5 hour run-time. They were pretty happy. We can't all be the Goddess of Performance Tuning Smile

Next post - this week's survey!

Categories:
Performance | Surveys

This week's survey is going to provide some fun debating topics. I'd like to know what the number 1 thing is that YOU go after when performance tuning - what do you think is most important? (Assuming you're not just randomly walking up to a server to do some perf tuning - there's an actual problem).

[Edit: And, as I say below - please no comments about what you picked until the results post next week - it skews the results. I've already had to delete 6 comments and ask for them to be held until next week.]

There is deliberately no 'it depends' answer, no 'combination' answer, and no way to suggest something else. If what you think is most important isn't on the list, don't vote, but I'd like to hear your opinion after the editorial next week.

As always, a big Thanks! for contributing to the blog by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

PS And again, as always, no comments on this post please (I'll delete them straight away) - please wait for the survey results post to avoid skewing the answers. I'm very interested in your reasoning, but not until everyone else responds.

Categories:
Performance | Surveys

I've just read a very good, very deep, and very interesting blog post by James Rowland-Jones. In the post, James investigates some locking issues using a variety of means and explains about the undocumented %%lockres%% function with you can use to figure out what the wait resource will be for individual table rows (basically the lock-hash value). He also shows something I've known about but never seen before - how the lock hash algorithm isn't perfect and can actually cause lock collisions where you wouldn't expect them - and how to mitigate the problem.

Excellent post and well worth reading. Check it out at The Curious Case of the Dubious Deadlock and the Not So Logical Lock.

I'm teaching a class this week on database maintenance, for DBAs inside Microsoft. One of the things we're discussing today is index fragmentation and how poor cluster key choice can lead to page splits, poor performance, index fragmentation, and so on - not just in the clustered index, but also in nonclustered indexes.

One of the students looked in a database underpinning an application and found a unique cluster key, which is the worst I've ever seen (although not the worst that Kimberly's ever seen apparently - the mind boggles!).

The cluster key is defined as a combination of the following column types:

  • 16-byte GUID
  • varbinary (16)
  • nvarchar (512)
  • nvarchar (256)
  • tinyint

Now, the wide cluster key isn't a big deal UNLESS there are nonclustered indexes, but there are in this case - so the cluster key is included in all nonclustered index rows. And the random GUID high-order key is always a bad idea, as it means the clustered index will be heavily fragmented as records are inserted. This is all simplified and generalizations (and I open this can of worms happily) - but you get the idea.

Good design up-front, with an understanding of how key choice affects the behavior of SQL Server and how indexes are stored and indexed, can lead to vastly reduced performance problems and maintenance issues.

Quickie this morning to start the day off. I saw a question on a forum: if I *have* to use a GUID and *must* have a primary key, should I make the primary key clustered or nonclustered?

Now, I'm not getting into the whole GUID vs. bigint identifier, or random GUID vs. GUID generated by NEWSEQUENTIALID(), so please don't comment on those issues, they're not relevant here. I just want to address the question - what kind of index should it be?

From a Storage Engine perspective, my answer is nonclustered. Here are three reasons why:

  • If the index is clustered, then the cluster key is immediately at least 16 bytes (the size of a GUID). This doesn't change the size of the clustered index records (as the GUID column has to be stored in the table anyway, and a clustered index IS the table), but it does change the size of the nonclustered indexes. All nonclustered indexes on the table must include the cluster keys, even of they are not explicitly part of the nonclustered index keys (I'll do a post on this later). This means the GUID is present in every nonclustered index record too. From this perspective, it would be better to use a smaller clustered index key and have the GUID primary ley be nonclustered so it's only present in that one nonclustered index.
  • Random GUIDs used as the high-order key cause index fragmentation. Their random nature means the insertion point into the index is also random. This causes page splits, which cause fragmentation and are *expensive*. (I touched on this a bit a few days ago in my post How expensive are page splits in terms of transaction log?.). With a random key value, it's hard to avoid page splits and fragmentation, although you can delay them somewhat using FILLFACTOR, but at the expense of using extra space. By making the GUID index nonclustered, you can delay page splits even further. The clustered index is the table, so the records are (usually always) larger than nonclustered index records. This means you can get fewer clustered index records on an 8KB page than nonclustered index records. With fewer records per page, you can do fewer random insertions on the page before a page split occurs. So using a nonclustered index for the GUID key means you can do fewer expensive page splits.
  • Given that whatever kind of index you create for the GUID key is going to experience index fragmentation, you're probably going to want to periodically remove the fragmentation as part of your database maintenance plan. It makes sense to try to limit the amount of resources used by the fragmentation removal operation (e.g. cpu, IO, disk space, transaction log space), and so the smaller the fragmented index, the better. A nonclustered index for the GUID key will be smaller than a clustered index, so if you choose a non-fragmentation-causing clustered index key, and confine the fragmentation to the nonclustered index, you can use fewer resources during database maintenance.

And there you have it. I'm sure some of you have seen pathological cases that disprove one of the above points, but my arguments are generalizations. Maybe this is a can of worms I've opened, in which case I look forward to the comments!

PS Brent did a great post about humor when blogging, the cartoon links he includes are great. Check it out here.

I've just been setting up some of our new hardware, and wanted to do some background reading to ensure I use the correct disk partition offset, RAID stripe size, and NTFS allocation unit size to enable the best possible performance for the volumes I'm creating.

You may not of heard about this (or your disk admins may not have heard about this) but on Windows Server 2003 and before, the default partition offset typically causes worse-than-optimal performance - and correcting it can get gains of maybe as high as 30% in terms of IO latency and duration. The SQLCAT team have just published a *fantastic* whitepaper (written by Jimmy May and Denny Lee) which explains the issue simply and clearly and shows you how to correct it. You should checkout the whitepaper at Disk Partition Alignment Best Practices for SQL Server.

The summary is that on Windows Server 2003 and before, the default partition offset is 31.5KB (63 x 512byte disk sectors), which does not align nicely with the common RAID stripe sizes of 64K or 128K, or the optimal NTFS allocation unit size of 64KB. This can lead to having to read/write multiple stripes every so often and a big perf drop. It can be fixed, as detailed in the whitepaper. For volumes *created* on Windows Server 2008, the problem does not exist as it creates a default partition offset of 1024KB.

In fact Jimmy just published a blog post to help you make the case to your disk admins/customers a few days ago: Disk Partition Alignment (Sector Alignment): Make the Case: Save Hundreds of Thousands of Dollars.

Luckily I'm using Windows Server 2008, which correctly sets the disk partition for the vast majority of cases.

Next thing I considered was RAID stripe size and NTFS allocation unit size (previously known as 'cluster size'). Kendal Van Dyke just published an *excellent* blog post series that provides a lot of empirical evidence as to what the best numbers are for the RAID level you're using. This saved me a lot of time. Check out his series at Disk Performance Hands On Series.

The Dell MD3000i units I'm using don't go any lower than 128KB for a RAID stripe size, so the default is fine. Unfortunately, I forgot to set the NTFS allocation unit size to 64KB when creating the partitions in Windows, so I need to recreate the partitions.

A massive thank-you to these guys for saving me a lot of time and hassle. You should go read this stuff too.

Page splits are always thought of as expensive, but just how bad are they? In this post I want to create an example to show how much more transaction log is created when a page in an index has to split. I'm going to use the sys.dm_tran_database_transactions DMV to show how much more transaction log is generated when a page has to split. You can find the list of columns and a small amount of explanation of each column in Books Online here - I was reminded of its existence by someone on Twitter (sorry, don't remember who it was and I couldn't find it in search).

In the example, I'm going to create a table with approximately 1000-byte long rows:

CREATE DATABASE PageSplitTest;
GO
USE pagesplittest;
GO

CREATE TABLE BigRows (c1 INT, c2 CHAR (1000));
CREATE CLUSTERED INDEX BigRows_CL ON BigRows (c1);
GO

INSERT INTO BigRows VALUES (1, 'a');
INSERT INTO BigRows VALUES (2, 'a');
INSERT INTO BigRows VALUES (3, 'a');
INSERT INTO BigRows VALUES (4, 'a');
INSERT INTO BigRows VALUES (6, 'a');
INSERT INTO BigRows VALUES (7, 'a');
GO

I've engineered the case where the clustered index data page has space for one more row, and I've left a 'gap' at c1=5. Let's add it as part of an explicit transaction and see how much transaction log is generated:

BEGIN TRAN
INSERT INTO BigRows VALUES (8, 'a');
GO

SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('PageSplitTest');
GO

database_transaction_log_bytes_used
-----------------------------------
1228

That's about what I'd expect for that row. Now what about when I cause a page split by inserting the 'missing' c1=5 row into the full page?

-- commit previous transaction
COMMIT TRAN
GO

BEGIN TRAN
INSERT INTO BigRows VALUES (5, 'a');
GO

SELECT [database_transaction_log_bytes_used] FROM sys.dm_tran_database_transactions
WHERE [database_id] = DB_ID ('PageSplitTest');
GO

database_transaction_log_bytes_used
-----------------------------------
6724

Wow. 5.5x more bytes are written to the transaction log as part of the system transaction that does the split.

The ratio gets worse as the row size gets smaller. For a row with an approximately 100-byte long row (use the same code as above, but change to a CHAR (100), insert 67 rows with a 'gap' somewhere then insert the 68th to cause the split), the two numbers are 328 and 5924 - the split cause 18 times more log to be generated! For a row with an approximately 10-byte long row, I got numbers of 240 and 10436, because I created skewed data (about 256 rows with the key value 8) and then inserted key value 5 which forced a (rare) non-middle page split. That's a ratio of more than 43 times more log generated! You can try this yourself if you want: I changed the code to have a CHAR (10), inserted values 1, 2, 3, 4, 6, 7, then inserted 256 key values of 8 and then 2 of 5. The resulting page had only 6 rows - it split after the key value 5 - the Storage Engine doesn't always do a 50/50 page split. And that's not even causing nasty cascading page-splits, or splits that have to split a page multiple times to fit a new (variable-sized) row in.

Bottom line: page splits don't just cause extra IOs and index fragmentation, they generate a *lot* more transaction log. And all that log has to be (potentially) backed up, log shipped, mirrored....

Another quickie today. If you're using Activity Monitor in SQL Server 2008, or Performance Data Collection with the Server Activity system data collection set, then you may see a constant, and large number of 'Other' SQL Server Waits. When you drill into these, you see that the highest number of waits are for FSAgent. The FSAgent is part of the FILESTREAM subsystem, and it's what fools the rest of the Storage Engine into accepting the FILESTREAM data as if it was real varbinary(max) data. If you look in the Books Online entry for sys.dm_os_wait_stats, which defines all the different wait types, it explains that the wait should only show up when FILESTREAM I/Os are occuring. However, this wait type shows up *all* the time - clearly a bug.

Turns out that it's a known bug that's been fixed - see KB 958942.

This is a quick follow-on from my Misconceptions around null bitmap size post.

The null bitmap is *always* present in a data record (i.e. records in a heap or the leaf-level of a clustered index), but is optional in index records if all the columns in the index records are not nullable. The misconception is around what happens when a new column is added to the table. The common misconception is that if you have 8 columns in the table (and hence 8 bits in the null bitmap), if you add a ninth column then SQL Server has to go update every record so the null bitmaps all contain 9 bits. (Same misconception applies to adding the 17th, 25th, 33rd, etc column).

This is usually not true. Let's consider the cases:

  • New column is nullable, with a NULL default. The table's metadata records the fact that the new column exists but may not be in the record. This is why the null bitmap also has a count of the number of columns in that particular record. SQL Server can work out whether a column is present in the record or not. So - this is NOT a size-of-data operation - the existing table records are not updated when the new column is added. The records will be updated only when they are updated for some other operation.
  • New column is nullable, with a non-NULL default. This IS a size-of-data operation. The non-NULL default forces all existing records to be updated when the column is added, and so the null bitmap will be updated too.
  • New column is not-nullable (obviously with a non-NULL default). This IS a size-of-data operation, for the same reasons as above.

Hope this helps.

5/11/09: Little bit of a rewrite today as it seems some people are taking what I'm saying the wrong way.

Over on Ed Bott's blog, last week he showed some inside info about the notorious "SQL Server problem" that caused the Windows 7 RC downloads to be so slow. The reason given was that SQL Server fragmentation caused the server to spike CPU and an index rebuild fixed the problem. Then the SQL CAT team posted that the problem was (paraphrasing) that the download team had only planned for a 100% jump in traffic, and got a 500% jump. The server (me: for a so-far undisclosed reason) couldn't handle the load, so a new server was dropped in and (me: this is interesting) operational practices were changed. Today (5/11/09) I find out that there's an ongoing investigation into just why the server couldn't handle the load. Notice I'm not saying the SQL Server was deficient in any way, but that the server as a whole couldn't handle the load.

Based on the initial information, and the CAT team not stating what the actual problem was, plus they had to change an operational practice (e.g. more frequent index rebuilds to remove fragmentation), we (Kimberly and I) came up with a hypothesis as to what might have happened. Remember that saying that the server wasn't setup for a 500% jump in load could mean a very large number of things. It could mean that the raw number of inserts overloaded the hardware on the server. It could mean that the application logic didn't scale to the extra workload. It could mean that the database schema didn't scale to the extra workload. The latter is the one that we're hypothesizing.

Whatever the actual technical issue was, there are two real problems: 1) the lack of planning, whether it was in terms of hardware capacity, application logic, schema design - whatever 2) the way the problem was handled by the MS PR folks, which not only made out that it was a SQL problem, but made people like me think that of-course-it-wasn't-a-SQL-problem, but it must have been a design/schema problem.

I discussed this with Kimberly and here's what we think happened (from her initial idea). There's been some discussion on how simply downloading a package could cause fragmentation - it's just a SELECT right? Wrong. If you look at what happens when you download, there's a GUID (Globally Unique IDentifier) that gets generated for your download, and there must be a table that tracks downloads, so the GUID is entered into a table. The table most likely has a clustered index, with the random GUID value as the primary key (or the high-order key in a composite key).

Every download thus produces an insert into the clustered index at a random location in the index. As the index pages get full, they eventually need to split, so more inserts can occur in the index at points defined by the random key. These 'page splits' are very expensive, causing lots of IO, log record generation, and fragmentation (both logical fragmentation that interrupts range scan performance and reducing page density from pages becoming less full). All of this takes CPU, which explains some of the spiked CPU from all the downloads. If there are queries that are scanning this data too, the fragmentation would cause terrible performance issues, resulting in more, smaller IOs - and also adding to the CPU load. Rebuilding the index (as was explained in the original blog post) would remove the existing fragmentation, increasing performance again, but wouldn't remove the page-split issue - which is why (apparently) they're going to rebuild that index every night now.

(This is the 100-level explanation for non-SQL geeks, a more in-depth explanation can be found on Kimberly's blog at GUIDs as PRIMARY KEYs and/or the clustering key.)

One of the comments below from Andrew Kelly suggests it could also be an ad-hoc plan problem - again, not a SQL Server issue, but a problem with the application. That could equally be explained by unplanned workload blowing capacity.

So was this a SQL Server issue? No. SQL Server did exactly as it was told, and has no choice where in the index to insert the new records - the GUIDs provide the insertion point. The problem was in the developer who created that schema without testing it with a very high workload - and the lack of a database maintenance plan to pro-actively find and resolve fragmentation issues, etc, etc. 

Now, this is pure conjecture, based on the facts that have come out so far. However, nothing that's been said since the issue (by the SQL team or anyone else) has made me question the hypothesis that some design/app issue led to the problem. As I said, capacity planning can mean a huge number of things - but saying that it's a capacity issue certainly doesn't mean it's NOT a schema design/app logic issue.

Once we hear what the root cause of the capacity overload problem was, I'll post more details. In the meantime, of course I'm not making out that it's a SQL Server problem - if you tell SQL Server to (for instance) use random GUIDs a the clustered, primary key - under high load it's not going to perform well. Inherent SQL Server problem? No. Human problem? Yes.

I hope this rewritten version is clearer now.

Categories:
General | Performance

Here's an interesting bug that was just recently fixed, and someone actually hit it today, so it's time to blog it.

Bottom line is that if you're doing NOLOCK scans of tables involving LOB data, the perfmance might suck. 2005 SP3 CU3 has the bug fix, but unfortunately the KB article was written by someone who didn't seem to know what they were talking about, so I suggested it was rewritten, which it has been (but not republished yet). The original KB article is at http://support.microsoft.com/kb/961049/. It's also fixed in 2008, but I don't know which build (I believe CU4 at least).

Here's my explanation of the problem.

LOB data can be stored in-row or off-row (my previous LOB post Importance of choosing the right LOB storage technique has more details). When it's stored off-row, it must be accessed by first reading the pointer to the LOB data from the data/index record, and then following the pointer (remember that 'record' is synonymous with 'row'). When a LOB data value is updated, the off-row value is updated first, and then if the off-row link changed, the data/index record is updated with the new link. There's obvioulsy a window here, where someone reading the data/index record might see the wrong/non-existent off-row pointer. This is exactly what can happen when a NOLOCK scan occurs.

To mitigate the possibility of a NOLOCK scan trying to follow a bad off-row link, the old behavior was to scan all the IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) for the table/index to make sure that the off-row link actually pointed to a page allocated to the table/index. If there are lots of IAM pages, this means lots of logical IOs, and poorly performing SELECT queries. And it does the scan once for *every row*. The person that hit it today had a 500 row select of ~20KB per row taking 20 seconds - 10MB of physical IOs and 30MB of logical IOs!

The fix is to make further use of an already existing in-memory cache of IAM pages to do a quick lookup of the right IAM page covering the GAM interval of the LOB page being read, without having to scan the whole IAM chain.

Hope this helps explain things, and track down perf problems for some of you.

A couple of weeks ago I kicked off the latest survey, on what the physical layout of your databases are and why you have them that way (see here for the survey). I let the survey run for a while to get a good sampling, and I wasn't disappointed, with over 1000 responses! Here are the results as of 4/27/2009.

Just like any other 'best practice' kind of topic, the question of how to design the physical layout of a database provokes a lot of (sometimes heated) discussion. There are lots of options and there are even more factors to consider - so the best answer is my perennial favorite "it depends"! In this post, I don't want to tell you how I think you should layout your database - instead I want to discuss some of the options and let you make up your own mind, with the added benefit of data on what your peers are doing with their databases. The main point of this survey was to see what people are doing, rather than as a driver for an editorial blog post.

As you can clearly see from the results above, and predictably, the distribution of layout types shifts as the database size increases - but I was very surprised by the number of single file databases over 10GB. Rather than go through each option in the survey, I'm going to talk a bit about some of the things to consider when planning a layout.

Underlying I/O subsystem

This could be the most important factor to consider. If you only have a single physical drive, for instance, there's arguably not much point creating multiple data files, as that will force the disk heads to bounce back and forth to the different file locations on the disk. On the other hand, if you have a SAN with several thousand drives grouped together into multiple LUNs, your possibilities are a lot wider (and maybe much harder to come up with the optimal layout). Several people asked if I'd go into depth around having multiple controllers, and different drive layouts in a SAN - and my answer is no. I'm not an expert at storage design, which, like indexing, is both an art and a science. There's a good whitepaper that discusses some of this: Physical Database Storage Design, which I helped review back in my MS days.

Performance, recoverability, manageability

Having multiple files with separate storage for each allows reads and writes to be parallelized for increased performance, lowering the amount of disk head contention. When a checkpoint occurs, and pages are written to disk, spreading the I/O load over multiples files can speed up the checkpoint and reduce the IOPS spikes that you may see. It can also lead to reduced contention for the various allocation bitmaps - in the same way as I've described for tempdb. In user databases with a very high rate of allocations, contention can arise on the GAM pages - but it's not common. Some people also advocate having separate filegroups for tables and indexes, and although this can sometimes be more trouble than it's worth, and often turns into a religious debate, I have heard of people getting a perf boost from this.

One of the most convincing reasons (I find) for having multiple filegroups is the ability to do much more targeted recovery. With a single file database, if it gets corrupted or lost, you have to restore the whole database, no matter how large the file is - and this can seriously affect your ability to recover within the RTO (Recovery Time Objective) agreement. By splitting the database into multiple filegroups, you can make use of partial database availability and online piecemeal restores (in Enterprise Edition) to allow the database to be online as soon as the primary filegroup is online, and then restore the remaining filegroups in priority order - bringing the application online as soon as the relevant filegroups are online. You can even use this layout to spread your backup workload - moving to filegroup-based backups instead of database backups, although this isn't very common.

As far as manageability is concerned, there are a few reasons to have multiple filegroups. Firstly, you can isolate a table that requires a lot of I/O (e.g. in terms of index maintenance) on separate storage from other tables, so that maintenance operations (and the I/O overhead of doing them) doesn't interfere with the I/O of the other tables. Also, you can provision different kinds of storage for different tables - in terms of disk speed and RAID level (redundancy), for instance. If you want to be able to move data around, you can do it much more easily if the database is split up, than if it's a single file.

Summary 

Ok - so I lied. I *am* going to offer advice - against one of the options: single filegroup, single file. For smaller databases, this is fine - but as the database size gets larger, say, over tens of GB, then having a single file can become a serious liability. With a single file database (or even a single filegroup database), you lose most of the benefits mentioned above.

Bottom line - as your databases get larger, you're going to need to think more carefully about their layout, otherwise you could run into big problems as your workload increases or when disaster strikes. As the survey results show, this is what your peers are doing.

Next post - this week's survey!

Finally! All those customers out there who are frustrated by not being able to turn on the 'lock pages in memory' option to protect the SQL working set on Standard Edition - the wait is nearly over. Bob Ward (Principal Escalation Engineer in CSS, and a good friend of mine) announced at PASS Europe that the option is soon to be supported - in the May CU for SQL 2008 and the June CU for SQL 2005. You can read a little more in his blog entry here.

If you don't know what this is, checkout this entry in Books Online - it basically prevents Windows paging out SQL memory to disk.

[Edit: the bits containing this fix have been released for 2008. See here for details.]

Categories:
Bugfixes | Memory | Performance

There's been a recent flurry of confusion and misconceptions about trace flag 1118 in SQL Server 2008. This trace flag switches allocations in tempdb from single-page at a time for the first 8 pages, to immediately allocate an extent (8 pages). It's used to help alleviate allocation bitmap contention in tempdb under a heavy load of small temp table creation and deletion.

There are multiple points of confusion, which I'll address in turn. Then I'll prove that the trace flag still works in SQL Server 2008.

1) Why is the trace flag usually required in 2000? In SQL 2000, whenever a temp table is created in tempdb and a row inserted, an IAM page must be allocated and a single data page must be allocated. These two pages are both 'single-page' allocations, from a mixed extent (see Inside The Storage Engine: Anatomy of an extent for more info). This means that an SGAM allocation bitmap page must be accessed, and a PFS page must be accessed (see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps for more info).

With lots of very small temp tables being created, this means the very first SGAM page and the very first PFS page in the data file are accessed/changed by all the threads, leading to latch contention problems on these two pages. When the temp tables are deleted again, the various pages are deallocated, which again needs to access and change the PFS page, and potentially the SGAM page.

There are two ways to alleviate this problem. Firstly, create multiple data files in tempdb - which splits the latch contention over multiple allocation bitmaps (from having allocations come from multiple files) and thus reduces the contention. The general rule of thumb was one tempdb data file for each processor core. Secondly, turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent. This means one extent is allocated from the GAM page, rather than 8 single pages (and potentially 8 accesses to the SGAM page). The pages within the extent are reserved and allocated singly from this extent, as needed. This also cuts down on contention and is documented in KB 328551.

2) What does reserved vs. allocated mean? When an extent is allocated to a table, the 8 pages in the extent are not immediately allocated as well. Allocating an extent means those 8 pages are reserved exclusively for subsequent allocation to that table. The pages are allocated individually as needed, but no other table can allocate them. This is why such extents are called 'dedicated' extents (see my blog post link above for more details). You can see the counters of reserved pages vs. allocated pages in the output from sp_spaceused.

3) Why is the trace flag not required so much in 2005 and 2008? In SQL Server 2005, my team changed the allocation system for tempdb to reduce the possibility of contention. There is now a cache of temp tables. When a new temp table is created on a cold system (just after startup) it uses the same mechanism as for SQL 2000. When it is dropped though, instead of all the pages being deallocated completely, one IAM page and one data page are left allocated, and the temp table is put into a special cache. Subsequent temp table creations will look in the cache to see if they can just grab a pre-created temp table 'off the shelf'. If so, this avoids accessing the allocation bitmaps completely. The temp table cache isn't huge (I think it's 32 tables), but this can still lead to a *big* drop in latch contention in tempdb.

4) Does the trace flag still exist in 2005 and 2008? Yes it does - KB 328551 clearly states:

Note Trace flag -T1118 is also available and supported in Microsoft SQL Server 2005 and SQL Server 2008. However, if you are running SQL Server 2005 or SQL Server 2008, you do not have to apply any hotfix.

Just to make extra-sure (as I'm always paranoid about saying absolutes), I checked with my good friend Ryan Stonecipher, who's the dev lead for the team that owns allocation (and a bunch of other stuff, including DBCC). He confirmed the code is exactly the same in 2008 as it was in 2005. And I prove it to you below too.

5) And why is it sill there in 2005 and 2008? It does the same thing in 2005/2008 as it did in 2000. If the temp table creation/deletion workload is high enough, you can still see latch contention, as the temp table cache won't be enough to completely alleviate the need for creating actual new temp tables, rather than just being able to grab one 'off the shelf'. In that case, using the trace flag to change to extent-based allocation (in *exactly* the same way as for 2000) can help, as can creating more tempdb data files.

As far as data files go though, the number has changed. Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there's latch contention), now you don't need so many - so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there's usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization - your mileage may vary - don't post a comment saying this is wrong because your system benefits from 12 data files. It's a generalization, to which there are always exceptions.

6) Why does DBCC IND still show two pages, even with the trace flag on? I've heard of some people being confused by the output of DBCC IND in SQL 2008 when the trace flag is turned on. Creating a single row temp table will only show two pages allocated in the DBCC output - one IAM page and one data page. Yes, that's completely correct - as only two pages are allocated, but the data page comes from a dedicated extent, not a mixed extent. (IAM pages are *always* single-page allocations from mixed-extents).

And now the proof, on SQL 2008.

SELECT @@VERSION;
GO

Microsoft SQL Server 2008 (RTM) - 10.0.1600.22 (Intel X86)

First off, I'll create a temp table without the trace flag enabled, and see what pages the table has allocated, by looking at the first IAM. I'll use a temp table with an 8000+ byte row size, and insert two rows - so we have two data pages for clarity.

DBCC TRACEOFF (1118, -1);
GO

USE tempdb;
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

Now I'll figure out what is the first IAM page, using my sp_AllocationMetadata script (see here for the script and details), and dump it with DBCC PAGE to see the single-page allocations it's tracking, and which dedicated extents are allocated to the table:

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1152921505223016448  IN_ROW_DATA      (1:158)     (0:0)      (1:199)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 199, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A35C08E

Slot 0 = (1:158)                     Slot 1 = (1:200)                     Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A35C0C2

(1:0)        - (1:1016)     = NOT ALLOCATED   

As you can clearly see from the partial output of the dump of the IAM page, there are two single-page allocations and no extents allocated to the temp table. This is what should happen when the trace flag is not enabled.

Now I'll do the same thing with the trace flag 1118 enabled.

USE tempdb;
GO

DROP TABLE #temp;
GO

DBCC TRACEON (1118, -1);
GO

CREATE TABLE #temp (c1 INT IDENTITY, c2 CHAR (8000) DEFAULT 'a');
GO
INSERT INTO #temp DEFAULT VALUES;
GO 2

EXEC sp_AllocationMetadata '#temp';
GO

Object Name    Index ID  Alloc Unit ID        Alloc Unit Type  First Page  Root Page  First IAM Page
-------------- --------- -------------------- ---------------- ----------- ---------- ---------------
#temp__<snip>  0         1224979099301904384  IN_ROW_DATA      (1:208)     (0:0)      (1:158)

DBCC TRACEON (3604);
GO
DBCC PAGE ('tempdb', 1, 158, 3);
GO

<snip>

IAM: Single Page Allocations @0x4A8FC08E

Slot 0 = (0:0)                       Slot 1 = (0:0)                       Slot 2 = (0:0)
Slot 3 = (0:0)                       Slot 4 = (0:0)                       Slot 5 = (0:0)
Slot 6 = (0:0)                       Slot 7 = (0:0)                      


IAM: Extent Alloc Status Slot 1 @0x4A8FC0C2

(1:0)        - (1:200)      = NOT ALLOCATED                              
(1:208)      -              =     ALLOCATED                              
(1:216)      - (1:1016)     = NOT ALLOCATED
                              

Now as you can clearly see, there are no single-page allocations, and there's a single extent allocated to the table. Proof that trace flag 1118 still does exactly what it should in SQL Server 2008.

Now for a DBCC IND on the table: 

DBCC IND ('tempdb', '#temp', -1);
GO

PageFID PagePID     IAMFID IAMPID      ObjectID    IndexID    
------- ----------- ------ ----------- ----------- -----------
1       158         NULL   NULL        293576084   0
1       208         1      158         293576084   0
1       209         1      158         293576084   0

(I've removed some of the trailing columns for clarity.) We see that it still only lists the two data pages (1:208, 1:209) and the IAM page (1:158) - although an entire extent was allocated to the temp table, only two pages from the extent were actually allocated and used - the rest are reserved for use by that table, but remain unallocated. 

Hopefully this post has cleared up a lot of the confusion around this trace flag and what it does.

A short post to start the day (in India) and then at lunchtime I'll do how it works: FILESTREAM garbage collection.

This is a question that came up recently on an MVP discussion list: why does a 200MB log backup take 40 minutes to restore on my log-shipping secondary?

First answer I thought of was that the I/O subsystem on the log-shipping secondary was very slow, or broken in some way, so I/Os were taking ages to complete. Nope - I/O subsystem was performing well.

Next thing I suggested was the the log-shipping secondary was restoring the log backups using WITH STANDBY and there was something like an index rebuild that was being rolled back. As an aside, when you restore log backups on the log-shipping secondary, you have a choice how they recovery: either WITH NORECOVERY or WITH STANDBY. The first option doesn't allow any access to the database, as the database is still "in recovery". The second option runs the REDO part of recovery, then runs the UNDO part of recovery, but saves the details of what it had to do for UNDO into a file (who's name and location you specify). It then allows read-only access to the database, for queries/reporting/whatever. Obviously if there's a lot of transaction log that has to be undone (i.e. rolling back transactions that weren't committed at the time the log backup completed), then this could take some time. But 40 minutes? No. That wasn't it.

I was partly right on my second guess. The *previous* log backup that was restored WITH STANDBY contained a long-running index operation, and so the undo file that the restore created was *huge*. The next log backup that's restored after a RESTORE ... WITH STANDBY, must first undo everything in the undo file (i.e. put the database back to the exact state it was in as if the WITH STANDBY part of the restore never happened) before it can restore more transaction log. In this case, no-one had noticed that the previous log restore *also* took a lot of time and created the huge undo file. It just looked like the 200MB log backup was causing the problem.

My advice: in any kind of there's-suddenly-a-corruption-or-performance-problem situation, don't just look at what's happening now. Always look at what happened leading up to the problem, as the answer is often there.

Microsoft's popular performance troubleshooting whitepaper has been updated for SQL Server 2008. You can download Troubleshooting Performance Problems in SQL Server 2008 at http://download.microsoft.com/download/D/B/D/DBDE7972-1EB9-470A-BA18-58849DB3EB3B/TShootPerfProbs2008.docx.

I've also added it to our whitepaper-links collection page at http://www.sqlskills.com/whitepapers.asp.

Categories:
Performance | Whitepapers

At the last few conferences I've presented at, there have been questions about using SSDs (Solid-State Drives) for enterprise storage and whether that will change some of the database maintenance practices. My answer to that is "I don't know" (ha - bet you thought I was going to say "It depends!") because adoption of SSDs is very low. I haven't been able to find much info about using them, but the Microsoft Research group in Cambridge just published a research paper Migrating Sever Storage to SSDs: Analysis of Tradeoffs, which does a nice job of walking through the issues involved and concludes that for the majority of workloads, it makes more economic sense to host them on HDDs. The exception is for top-end OLTP databases. I'll warn you that this isn't a whitepaper - it's a research paper, and gets a bit deep into algorithms and mathematical analyses, but if you're up to the challenge it's a great read.

You can download the paper from http://research.microsoft.com/en-us/um/people/antr/ms/ssd.pdf. Enjoy!

PS I found it on James Hamilton's blog.

This week's survey is a little more complicated. I'm interested in the physical layout of your databases. I've got four surveys, for a variety of database sizes. Please vote multiple times in each survey, as you see fit - and by all means forward this link to your friends/clients/etc or re-blog/tweet it. I'm going to report on this survey in two weeks, to give a bit more time for people to respond (and because we're travelling next week). I think we're going to see some interesting statistics come out of this - the more people that respond the better. I'll report on it 4/24/09.

One thing to note - this is just for user databases, not for tempdb. In the surveys, "multiple filegroups" implies multiple files too, and if you don't have them spread exactly one per drive/etc, just choose that option - I only have 10 options to choose from in the free surveys. 

(If you're in the over 1TB range and have multiple files/filesgroups spread over multiple drives/arrays/LUNs, vote using the last option on the >1TB survey and I'll lump them together.) 

Phew - thanks!

Last week's survey was on how you should store large-value character data in SQL 2005+ (see here for the survey). Here are the result as of 4/3/2009 - and I think my favorite answer is starting to catch-on:

My favorite answer is, of course, it depends! For all those who didn't answer 'it depends', your answer is valid, but only for particular circumstances, as each method has its pros and cons and won't be applicable in all cases. It's extremely important when designing a schema to consider how to store LOB data, as making the wrong choice can lead to nasty performance issues (where 'performance' is a catch-all to include things like slow queries, fragmentation, and wasted space). Now I'd like to run through each of the options and detail what I think of as the pros and cons. A couple of definitions first: 'in-row' means the column value is stored in the data or index record with the other columns; 'out-of-row' or 'off-row' means the column value is stored in a text page somewhere in the data file(s), with a physical pointer stored in the data/index record (taking either 16 or 24 bytes itself).

  • As a N/CHAR column. This is a great choice when the data that's stored in the column is a fixed size all the time, and always uses the full width of the column. Any time that the data may be smaller than the defined wdith of the column, space is being wasted in the row. Wasted space leads to fewer rows per page, more disk space being used to store the data, more I/Os to read the data, and more memory used in the buffer pool. However, if the character values are very volatile, and can change size, then having a fixed-width column can avoid the problem of a row having to expand and there not being enough space on the page to allow that - leading to a fragmentation-causing page split in an index (or forwarding record in a heap). There's a tipping point that can be hard to identify for your particular application...
  • As a N/VARCHAR (1-8000) column. For data values less than 8000 bytes, this is the common choice as it avoids wasted space. However, if the application can change the size of the data after the initial creation of the row, then there is the possibility of fragmentation occuring through page-splits. In SQL Server 2005+, a row can also be created that is more than 8060 bytes - one or more variable-length columns is pushed into off-row storage and replaced by a physical pointer. This means any access of the column has to do an extra I/O to reach the data - and this is commonly a physical I/O as the text page is not already in memory. This can lead to hard-to-diagnose performance issues if a query selects the column and some rows have the data in-row, and some out-of-row. Also, if the data values tend towards the larger end of the 1-8000 byte spectrum, individual rows can become vary large, leading to very few rows per page - and the problems described in the first option. If the data isn't used very much, then storing it in-row like this isn't very efficient.
  • As a N/VARCHAR (MAX) column in-row. This has the same pros and cons as the option above, with the added benefit that the value can grow larger than 8000 bytes. In that case it will be pushed off-row automatically, and start incurring the extra I/O for each access. These data types also work with the intrinsic functions in the same way as the character data types discussed above. I guess one drawback of this type compared to FILESTREAM is that it's limited to 2GB. Also, if there's a LOB data column in the table definition, the table's clustered index cannot have online operations peformed on it - even if all the LOB values are NULL or stored in-row!
  • As a N/VARCHAR (MAX) column out-of-row. The drawback of storing this data out-of-row is that accessing it requires an extra I/O to retrieve it, but if the data isn't used very much then this is an efficient way to go, but still uses space in-row to store the off-row pointer. An additional benefit of storing the data off-row is that it can be placed in a separate filegroup, possibly on less expensive storage (e.g. RAID 5) - but then there's the drawback that it can't be moved after being created except with an export/import operation. This option has the same online operations drawback as storing the data in-row.
  • As a N/TEXT column in-row. This has the same pros and cons as the N/VARCHAR (MAX) column in-row option, but these data types are deprecated and don't work with the majority of the intrinsic functions.
  • As a N/TEXT column out-of-row. Same as above.
  • In a seperate table and JOIN to it when required. This option is great when the data isn't used very much, as it doesn't require any storage at all in the main table (except for a value to use for the JOIN), but it does require some extra up-front design and slightly more complicated queries. There's another HUGE benefit to doing this - by moving the LOB data to another table, online operations become available on the main table's clustered index. (This concept is 'vertical partitioning' a huge topic in itself...)
  • As a FILESTREAM column. (Yes, I didn't have this in the survey, but it's a possibility). If your data values are going to be more than 1MB, then you may want to consider using the FILESTREAM data type in SQL 2008 to allow much faster access to the data than having to read it through the buffer pool before giving it to the client. There are lots of pros and cons to using FILESTREAM - see my whitepaper for more info here.

So, as you can see, the best answer for a general question like this is definitely It Depends!. Although I haven't covered every facet of each storage option, the aim of this post is to show that it is very important to consider the implications of the method you choose, as it could lead to performance problems down the line.

Next post - this week's survey!

(Continuing my habit of blogging while Kimberly's presenting - at least I'm not on stage this time...)

In early versions of SQL Server, it was sometimes necessary to 'pin' the pages from a heavily used table in the buffer pool to help performance. In later versions, the buffer pool became very sophisticated and does an excellent job of keeping the right data in memory (basically using an LRU algorithm). I saw a blog post today which mentioned that you can get into trouble using DBCC PINTABLE if you pin a table in memory and then the table grows and grows, and ends up taking up such a large proportion of the database that it causes massive performance problems - absolutely true. The blog post also mentioned that DBCC PINTABLE was deprecated in SQL Server 2005.

That's also true - it's deprecated, but in SQL Server 2005 it actually does NOTHING at all. I personally removed all the code - so now it just returns a success message.

Categories:
DBCC | Performance

The April edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Disappearing errors with DBCC CHECKDB
  • Provisioning tempdb when moving from 2000 to 2008
  • Does fillfactor prevent fragmentation and should it be set instance-wide
  • Avoiding FILESTREAM performance problems

Check it out at http://technet.microsoft.com/en-us/magazine/2009.04.sqlqa.aspx

Back in 2005 Kimberly produced two very popular webcast series - an 11-part webcast series for TechNet called SQL Server 2005 for the IT Professional and a 10-part webcast series for MSDN called A Primer for Proper SQL Server Development. The webcast links and blog posts were broken for quite a while but now they're all fixed up and working again. I've created some web pages that link to all the webcasts and blog posts, along with abstracts. I've also included some more recent ones too and will be adding to the list over the next few weeks.

There's over 30 hours of good stuff to watch - check them out at http://www.sqlskills.com/webcasts.asp

Microsoft's Steffen Krause has written an excellent whitepaper on Tuning the Performance of Change Data Capture in SQL Server 2008, that I technically reviewed, and it was published late last year. To get an overview of Change Data Capture (CDC) before reading the whitepaper, see the TechNet Magazine article I wrote for the November issue, titled SQL Server 2008: Tracking Changes in Your Enterprise Database. The new whitepaper covers:

  • Configuration of sys.sp_cdc_enable_table parameters
  • Configuration of the CDC capture job
  • Using Extended Events to determine the performance and characteristics of CDC
  • Influence of CDC scan job parameters on CDC performance
  • Influence of workload characteristics on CDC performance
  • Influence of sys.sp_cdc_enable_table parameters on CDC performance
  • CDC cleanup job considerations
  • Transaction log file considerations

Check it out at http://msdn.microsoft.com/en-us/library/dd266396.aspx.

One of the things I mentioned in my recent TechNet Magazine article on Understanding Logging and Recovery was the need to manage the transaction log so the number of VLFs (virtual log files) does not get too large (with too large being more than, say, 100 VLFs). Linchi Shea (a fellow MVP) has just posted some performance numbers comparing insert/update/delete performance between a database with 16 VLFs and one with 20000 VLFs, representing a poorly managed transaction log. You can see his blog post here.

For those wishing to take charge of unruly transaction logs, checkout Kimberly's blog posts - starting with this one. And if you want to quickly know how many VLFs your transaction log has, use the undocumented DBCC LOGINFO command - the number of lines of output is the number of VLFs you have.

Next posts coming up will be photos - we're in Bangkok right now and I've got 3 Where In The World Are Paul and Kimberly posts queued up - St. Lucia, Hyderabad, and Bangkok.

Over the last few weeks Sunil Agarwal (from the SQL Storage Engine team) has posted a great series of blog articles about tempdb and the version store, over on my old stomping ground - the Storage Engine blog. The articles are well worth reading - the links are:

Enjoy!

In the previous blog post I mentioned our partner company in Australia, so I'd better explain...

We have a new partner company - SQLskills.com.au - run by our good friend and fellow MVP Greg Linwood, along-side his other company MyDBA, which provides DBA support, consulting and staffing services to customers world-wide. By extending SQLskills.com into Australia, we can provide world-class training to the burgeoning SQL Server base in and around Australia, without customers having to travel to the US. As well as providing their own custom courses, the SQLskills.com.au team will be teaching courses developed by me, Kimberly, and Bob Beauchemin - using only the best instructors that we've personally taught and approved.

Update 12/10/08: the Australia classes have been pushed out to June 2009 for various reasons - watch the blog for more details.

The courses have already been running the last two months to great success and I'm very excited to announce that Kimberly and I are coming to Australia in February 2009 to teach four classes in the SQLskills.com.au training facility in Melbourne. The classes we have planned are:

Click the links for more details and registration info. The internals course (or equivalent knowledge) is really a pre-requisite for the perf tuning and maintenance courses that immediately follow it, as these courses will be really in-depth.

Checkout the SQLskills.com.au site for other classes they offer for developers and BI specialists.

We hope to see you in Melbourne next year!

It's really scary how quickly time flies - seems like it was just last week when I last blogged about TechNet Magazine (actually it was a month ago when I blogged about my 2008 Change Tracking article - see here). Anyway, a new issue of TechNet Magazine has just come out and this one has the latest installment of my bi-monthly SQL Q&A column.

This month's topics are:

  • How row-overflow columns can lead to poor range scan performance, even on completely defragmented indexes.
  • Combining database mirroring and failover clustering without undesired failover behavior.
  • Adding differential backups to a full+log backup strategy to lower recovery time.
  • Memory settings for multi-instance failover clustering

You can get to the column online at http://technet.microsoft.com/en-us/magazine/dd228989.aspx.

Enjoy!

PS If you have any ideas for what would make a good SQL Q&A topic, please drop me a line - paul@sqlskills.com

During early SQL Server 2008 development, the Storage Engine team introduced additional minimally-logged functionality. And then it was removed again for RTM. Now it seems that they've bowed under customer pressure and have made the functionality available again (see Sunil's announcement here).

Using trace flag 610 in the RTM build, you enable the potential for minimal-logging when:

  • Bulk loading into an empty clustered index, with no nonclustered indexes
  • Bulk loading into a non-empty heap, with no nonclustered indexes

Sunil's previous blog post here gives more info on the required syntax - very useful!

Over the last few weeks I've had lots of questions about FILESTREAM performance and how to coax NTFS into scaling nicely. I just finished writing a 30-page whitepaper about FILESTREAM for the SQL Server team, which should be published before PASS in November (I'll blog the link when I have it). Although my whitepaper isn't strictly about performance, there is a long section about setting up your system to get high-performance from FILESTREAM. What I want to do in this blog post is give a bullet list of things to do that will help you get good performance. All of these are explained in more detail in the whitepaper.

Here you go, in no particular order:

  • Make sure you're storing the right-sized data in the right way. Jim Gray (et al) published a research paper a couple of years ago based called To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?. To summarize the findings, BLOBs smaller than 256-KB should be stored in a database, and 1-MB or larger should be stored in the file-system. For those in-between, "it depends" - my favorite answer. The upshot of this is that you won't get good performance if you store lots of small BLOBs in FILESTREAM.
  • Use an appropriate RAID level for the NTFS volume that will host the FILESTREAM data container (the NTFS directory structure corresponding to the FILESTREAM filegroup in the database). Don't use RAID-5, for instance, for a write-intensive workload.
  • Use an appropriate disk technology. SCSI will be usually be faster than SATA/IDE, but more expensive. This is because SCSI drives usually have higher rotational speeds, so lower latency and seek times.
  • Whichever disk technology you choose, if SATA, ensure it supports NCQ, and if SCSI, ensure it supports CTQ. Both of these allow the drives to process multiple, interleaved IOs concurrently.
  • Separate the data containers. Separate them from each other, and separate them from other database data and log files. This avoids contention for the disk heads.
  • Defrag the NTFS volume if needed before setting up FILESTREAM, and periodically to maintain good scan performance
  • Turn off 8.3 name generation on the NTFS volume. This is an order-N algorithm that has to check that the new name generated doesn't collide with any existing names in the directory. This slows insert and update performance down *a lot*. Do this using the command line fsutil utility.
  • Turn off tracking of last access time using fsutil.
  • Set the NTFS cluster size appropriately. For BLOBs 1-MB or large, use a cluster size of 64-KB. This will help to reduce fragmentation.
  • A partial update of FILESTREAM data creates a new file. Batch lots of small updates into one large update to reduce churn.
  • When streaming the data back to the client, use an ~60-KB SMB buffer size (or multiples thereof). This is so that the buffers don't get overly fragmented as TCP/IP buffer are 64-KB.

Hope this helps!

At the user group meeting on Monday I spent some time explaining how GUIDs can cause fragmentation in clustered indexes AND in non-clustered indexes, even if the GUID isn't specifically included in the non-clustered index key. GUIDs are essentially random values (pseudo-random in ranges if generated using NEWSEQUENTIALID) that are also unique. Their uniqueness is what makes them attractive to many developers as a key value, without understanding the havoc they can cause in production in terms of fragmentation and poor query performance.

A GUID key causes fragmentation because of its randomness. The insertion point of a new record in an index is dictated by the value of the index key, so if the key value is random, so is the insertion point. This means that if an index page is full, a random insert that happens to have to go onto that page will cause a page split to make room for the new record. A page-split is where a new page is allocated and (as near as possible to) half the rows from the splitting page are moved to the new page. The new row is then inserted into one of the two pages, determined by the key value. Usually the newly allocated page is not physically contiguous to the splitting page, and so fragmentation has been caused. In this case *two* kinds of fragmentation have been caused - logical fragmentation (where the next logical page as determined by the index order is not the next physical page in the data file) and physical (or internal) fragmentation (where space is being wasted on index pages). These can both affect query performance (topic for a later post), as well as the expense of having to do the page split in the first place.

It's fairly well known that GUIDs can cause fragmentation in the index where the GUID is the key (e.g. a clustered index), but not about the knock-on effects in non-clustered indexes. Here's an example - I'll create two clustered indexes with GUID keys (one generated from NEWID and one from NEWSEQUENTIALID), plus a non-clustered index on each. Let's see what happens when we insert 100000 rows:

-- Create a table with a GUID key
CREATE TABLE BadKeyTable (
   
c1 UNIQUEIDENTIFIER DEFAULT NEWID () ROWGUIDCOL,
   
c2 DATETIME DEFAULT GETDATE (),
   
c3 CHAR (400) DEFAULT 'a');
CREATE CLUSTERED INDEX BadKeyTable_CL ON BadKeyTable (c1);
CREATE NONCLUSTERED INDEX BadKeyTable_NCL ON BadKeyTable (c2);
GO

-- Create another one, but using NEWSEQUENTIALID instead
CREATE TABLE BadKeyTable2 (
   
c1 UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID () ROWGUIDCOL,
   
c2 DATETIME DEFAULT GETDATE (),
   
c3 CHAR (400) DEFAULT 'a');
CREATE CLUSTERED INDEX BadKeyTable2_CL ON BadKeyTable2 (c1);
CREATE NONCLUSTERED INDEX BadKeyTable2_NCL ON BadKeyTable2 (c2);
GO

DECLARE @a INT;
SELECT @a = 1;
WHILE (@a < 10000)
BEGIN
   
INSERT INTO BadKeyTable DEFAULT VALUES;
   
INSERT INTO BadKeyTable2 DEFAULT VALUES;
   
SELECT @a = @a + 1;
END;
GO

-- And now check for fragmentation
SELECT
   
OBJECT_NAME (ips.[object_id]) AS 'Object Name',
   
si.name AS 'Index Name',
   ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
   
ips.page_count AS 'Pages',
   
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats (DB_ID ('DBMaint2008'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE
   
si.object_id = ips.object_id
   
AND si.index_id = ips.index_id
   AND ips.index_level = 0;
GO

Object Name    Index Name        Fragmentation  Pages  Page Density
-------------  ----------------  -------------  -----  ------------
BadKeyTable    BadKeyTable_CL    99.13          8092   66.08
BadKeyTable    BadKeyTable_NCL   30.97          78     64.1
BadKeyTable2   BadKeyTable2_CL   0.83           5556   96.26
BadKeyTable2   BadKeyTable2_NCL  1.88           372    99.61

The BadKeyTable_CL clustered index with the GUID key generated from NEWID is almost perfectly fragmented, with 34% space being wasted on each page. Conversely, the BadKeyTable2_CL clustered index with the GUID key generated from NEWSEQUENTIALID is hardly fragmented and only 4% of free space is wasted (and this is just because of the row size chosen). These numbers are entirely expected given the nature of the cluster keys.

Now look at the non-clustered indexes. BadKeyTable_NCL is 31% fragmented with 36% space wasted on each page! BadKeyTable2_NCL is harldy fragmented with no free space wasted on each page. So what's going on? The non-clustered index key in both cases is a datetime column, which has a minimum granularity of 3 milliseconds. The code above runs in a tight loop inserting records and so can insert more than one record per 3ms time interval - and I allow this because I didn't create make the non-clustered indexes unique. For all the records inserted in one 3ms time interval, there *has* to be something that makes the non-clustered index key unique internally (as even though an index can be defined as non-unique, the Storage Engine requires that each record really is unique in an index and will add whatever it needs to so that happens).

In this case, the clustered index key (which must be present in the non-clustered index anyway) is used to differentiate between all the non-clustered index records with the same datetime value. For BadKeyTable_NCL, the cluster key is a random GUID, so the non-clustered index record insertion points ALSO become random within each 3ms time interval - leading to the fragmentation above. The BadKeyTable2_NCL non-clustered index has the same time interval issue, but it's cluster key is a sequential GUID, so the non-clustered index doesn't get fragmented. If the non-clustered index key was a time datatype with a larger minimum granularity (like smalldatetime, or the new date), the fragmentation of BadKeyTable_NCL would be even worse - try it for yourself and you'll see.

So the answer to the question in the blog post title is really - it depends! Under the right conditions, a GUID cluster key can also seriously fragment a non-clustered index as well.

Fresh off a week of teaching classes on the Microsoft campus, we've finalized some user group dates. Here's what we have coming up:

  • Monday August 18th: user group meeting in Redmond
  • Monday September 1st to 3rd: public class in England
    • In conjunction with our UK partners, SQL Know How at Hatfield, England
    • Topic: Best Practices in Performance and Availability in SQL Server 2005/2008 
  • Thursday September 4th: user group meeting in Ireland
  • Friday September 5th: SQL Server launch event in Ireland
    • In conjunction with Microsoft Ireland at the Dublin Microsoft office
    • Topic: SQL Server 2008 Overview for DBAs
  • Monday September 8th to 9th: public class in Scotland
    • In conjunction with our UK partners, SQL Know How at Edinburgh, Scotland
    • Topic: Indexing for Performance in SQL Server 2000/2005/2008

It's going to be a busy few weeks - hope to see you at one of these events!

Kimberly and I were presenting at our local (Redmond) .Net Developers Association on Monday and the following question came up while Kimberly was talking about missing and extra indexes (paraphrasing):

What's the best non-clustered index to use for the query with a predicate WHERE lastname = 'Randal' AND firstname = 'Paul' AND middleinitial = 'S'?

Kimberly said that the order of the keys (e.g. lastname, firstname, middleinitial; or middleinitial, lastname, firstname; etc) doesn't matter for this case. I thought about it for a second and then argued, saying that the most selective column should come first. We agreed to discuss with the group at the end, but I thought about it some more and realized (and admitted to the group) that she's right - I should know better than to question Kimberly's knowledge of indexing... :-)

She's right because for a pure equality query using AND for multiple predicates, the Storage Engine will seek straight to the first exactly matching record in the index (and then scan for more matches if it's a non-unique index). It doesn't matter what order the index keys are defined because the Storage Engine is looking for an exact match.

When I started arguing, I was thinking about a phone book, which is ordered by lastname, firstname, middleinitial. You may think that a phone book is ordered that way because lastname is the most selective. Wrong. It's because the lastname is what most people know - it just happens to be the most selective of the three choices. Most SQL geeks should be able to find Kimberly in a phone book by looking for Tripp, Kimberly. But what if it was ordered by middleinital? I'd have no problem finding Kimberly, but how many of you would remember that her middleinitial is L? Probably a few as we both use our middle initials in our public names. What about if it was ordered by middleNAME? Again, no problem for me but who how many other people know her middle name is Lynn?

Then I started thinking about other queries and how they would play into the index choice to answer to the question above. If I also wanted to support a query with the predicate WHERE lastname = 'Randal', then having the left-most index key be anything other than lastname won't work so well. If the key order was firstname, middleinitial, lastname then all the distinct lastname values would be spread through the index rather than being together. The index might still be used to satisfy the query if it's the lowest cost index to use. However, having lastname be the leading key probably wouldn't work very well for a query with a predicate of WHERE firstname = 'Paul' - that argues for having firstname be the left-most index key.

Which should I choose? I probably I can't have both in the same index, so maybe I'd have TWO non-clustered indexes, to support both queries. The answer depends on how often the various queries are used and the trade-off between how much of a performance gain the non-clustered index would provide against the performance drop of having to maintain it during DML operations.

I hear time and again about people adding a non-clustered index for every column in the table, thinking that this will help - and my thinking is that this is wrong because these indexes can only satisfy a query where the only predicate is the column being indexed. I ran this argument past Kimberly and she added that these indexes could also be used if the column is chosen as the most selective in a multi-predicate query, and no other index has a lower cost than that one (a slim chance usually). Even what I though of as a simple case has caveats!

So what's the point of this post? Well, I wanted to show how indexing for one very simple query is pretty straightforward, but as soon as the number of different queries grows, and the query predicates get more complicated, indexing becomes more complex. You really have to know your workload and your data to know which columns are used, in what combinations, and how often - and then it helps to know how indexes are costed and used so that you can make intelligent choices about which indexes to define.

This thought-exercise has really shown me that I didn't know how much I don't know about indexes - I know precisely how they work at the Storage Engine level but not too much about how they're used by the Query Processor. I have new-found respect for Kimberly's indexing expertise. Luckily she's teaching a class at Microsoft called Indexing For Performance next week - I think I'll attend :-)

During the various courses I've been teaching, people are interested in how FILESTREAM performance compares with storing BLOBs in the database itself. I have some performance graphs based on measurements the dev team made - these have been presented publicly by myself and the dev team so I can share them with you here.

There are three graphs below, showing the relative performance for read, insert, and update of:

  • BLOB data stored in FILESTREAM format and accessed through the WIN32 streaming APIs. The times include getting a transaction context from SQL Server, getting the file path, doing the operation, closing the file, and committing the transaction in SQL Server.
  • BLOB data stored in FILESTREAM format and manipulated through T-SQL
  • BLOB data stored in varbinary(max) format (and obviously manipulated through T-SQL)

The data is the same in each test. The tests were performed on a 4-way box with a cold buffer pool. (Note that if this was for a warm-buffer pool, as the graph in the FILESTREAM whitepaper is for, the varbinary(max) and FILESTREAM T-SQL numbers would essentially swap.) One interesting point to note is that for smaller data sizes, it's faster to manipulate them through T-SQL than through the file system - this is expected based on research Jim Gray did when putting together the original TerraServer.

Almost a year ago to the day I asked a question on my old blog - how long does *your* CHECKDB take to run? Well, I had a bunch of replies but I never got around to posting the results. I was reminded of this recently so threw together some graphs of the results. I've plotted database size against run-time of DBCC CHECKDB, for a number of different numbers of CPUs, and mostly on SQL Server 2000. I've also added a linear trend line too.

Now - bear in mind this doesn't take into account the multitude of other factors that can affect how long DBCC CHECKDB takes to run (see my previous post here for the list).

Hopefully these are interesting to some of you!

A bit more traffic on the thread (see previous post here) prompted me to give my thoughts on the many sweeping generalizations that plague the computer industry and make it difficult sometimes to give advice in forums and blogs. I'd like to repost here (with a few tweaks for clarity).

Some examples of questions that breed sweeping generalizations:

  • Should you have clustered indexes on all tables? The well-known clustered-index debate as Kimberly likes to call it.
  • Should you rebuild or reorganize indexes to remove fragmentation?
  • Which high-availabilty solution should you use?

The problem - as with most advice - is that it's extremely hard to make generalizations. This is both because:

  1. without lots of evidence many people (quite rightly) don't believe sweeping generalizations as they may have been bitten by one in the past
  2. nearly every situation is different so many generalizations are useless

What I'd love to see, (and I tried to do this when at MS, and like to think I do it here or when teaching classes or conferences) is for people to provide the justification for generalizations, plus some idea of the exceptions and the circumstances under which they arise.

As for this case (whether to create multiple files because there are multiple cores/CPUs), I think we've about done this one to death. The sweeping generalizations here are:

  1. for non-tempdb you usually don't need multiple files, unless you have a very high-end workload of the specific nature I described in my first post (rare)
  2. for tempdb you usually do, as long as your workload merits it on a multi-core/cpu box
  3. IO vendors may recommend it for increased IO throughput *on their specific hardware*
  4. there exist sweeping generalizations from various sources that dispute all of the above

Unfortunately, you're not going to get a definitive, authoritative answer to a design/strategy question such as this and you'll continue to find contradictions to anything anyone says on the forums, and even MS contradicting itself (sigh).

What I would suggest is the following:
1) go with the majority opinion of responses to questions asked, based on the respondents collective experience with many customers, databases, and workloads
2) do your own testing, on your own hardware, with your own workload and see what works for you (but beware that testing in a vacuum can prove or disprove anything you want - which is why you see so many contradictory statements)

One last thing on MS - it's a very big company, with lots of groups. Anyone can sponsor a whitepaper, write a blog post/MSDN article/technet article and publish it, or reply on a forum as a visible MS person and it has the 'official stamp' of coming from MS. When I was in the product group I was continually dismayed by the misinformation, bad advice, contradictions, and baseless assertions that I saw coming from MS employees who were just trying to be helpful.

Once something's published on the internet, it's *incredibly* hard to undo the damage done. There's a fundamental element of mistrust sometimes on forums and newsgroups which can be wearying when you're trying to help people out. It can be very hard to convince people that someone else's advice isn't the best to follow - I remember several times arguing with people about how CHECKDB works or what a corruption error message means and finally having to resort to 'I wrote that code - I'm afraid you *are* wrong' - which I really hate doing.

Anyway - rant over :-)

There's been a very interesting discussion going on over at SQLServerCentral.com about whether to create multiple files for a user database because the server has multiple CPUs - see the thread here. I wrote a couple of long responses during the thread that I wanted to duplicate here as I think it's of broad interest.

My first response was:

Doesn't make sense to split any database into multiple files for per-cpu performance, except tempdb which can suffer from multiple-CPUs all trying to modify the same allocation bitmaps under high load with small temp tables being created/deleted (see http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx for details).

Now, saying that, there's an exception - and that's when a non-tempdb database has the same allocation bitmap contention issues, but that only happens on the very high-end with thousands of single-row inserts per second happening on each CPU. It's pretty rare. I've never seen it but Kimberly has.

What you should be considering is splitting a database into filegroups to be able to isolate critical tables within separate filegroups, or splitting a partitioned table across multiple filegroups. For SS2005, these aren't for performance but instead for manageability (e.g. being able to restore a single filegroup offline while the rest of the database remains online, or setting several partitions of a table to be read-only). For SS2008, there are improvements to query plan generation over multiple partitions that could allow for better parallelism, but its still not optimal compared to non-partitioned data.

So - complex topic and these are simple guidelines. Hope they help.

This was followed by a discussion pointing out that various hardware vendors imply the opposite, as do several MS sources. So here was my latest response:

Hi folks,

What's really not made clear in the MS literature is that the multiple-files in a database really only applies to tempdb. Even on tempdb, on 2005 you don't need one file per core - more like 1/4 -1/2 the number of files as there are cores.

The tempdb problem is this - common workloads create and drop many worktables per second. The allocations that happen for a table are initially single-pages (as opposed to whole extents). This means that a search of the SGAM page for the 4GB interval needs to be done to find a mixed-extent with a free page to allocate. Multiple CPUs all hammering this page cause contention on it and performance problems. Then, a page needs to be allocated for the first IAM page - same thing happens. Then these pages need to be marked allocated in a PFS page - same thing happens. And then these pages need to inserted into the sysindexes row for the tabel - more contention. On 2000 this was particularly bad - so T1118 plus multiple files was the solution, where SQL Server would round-robin the single page allocations in the files in tempdb, alleviating the contention somewhat.

In SQL Server 2005, we changed the temp table mechanism so that whenever a temp table is dropped, one data page, one IAM page, and the system table entries (no longer sysindexes, but instead is a 'hidden' table called sys.allocation_units) are cached. When a new temp table is allocated, if there's a cached 'template temp table' it is picked up and used without so much contention on the various allocation bitmaps. On a heavily loaded system there can still be contention and so you still need multiple files for an SMP box, but just not so many. And you don't need T1118 any more.

So - this is more prevalent on tempdb, but CAN happen on a user database under extreme load on monster hardware. Testing should show whether this is happening to you - if not, don't create multiple files for performance.

Now, in terms of what works for your particular vendors IO scalability - that's beyond me and you may need to think about that if its something they recommend. However - I'd still take it with a pinch of salt and do your own testing. See the whitepaper at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx for some testing/banchmarking info.

Hope this helps clarify a bit more - great discussion!

I'm interested to hear any other opinions on this - especially cases where you've found it necessary to create multiple files for performance.

Thanks!

There's a new whitepaper on TechNet that I've just come across (even though it was published 6 months ago!) called Predeployment I/O Best Practices. It's really good - discussing the following:

  • Guidelines for determining I/O capacity
  • Disk configuration best practices and common pitfalls
  • Using SQLIO to determine capacity and interpreting its results
  • Using System Monitor to monitor an IO subsystem

You can read it here and I've added a link to our whitepapers page.

While I'm on the subject of I/O, Bob Dorr (A Senior Escalation Engineer in PSS) published a blog post last year that debunks a couple of urban legends around SQL Server's IO, specifically:

  • The myth that SQL Server used one thread per data file
  • The myth that a disk queue length greater than 2 indicates an I/O bottleneck

Check out his blog post and the subsequent discussion here.

Categories:
IO Subsystems | Performance | Tools

Quickie today as I'm preparing to speak at the Pacific Northwest SQL Server User Group meeting tonight on the MS Campus here in Redmond (my trademark Detection and Recovery from Database Corruptions talk).

SQL Server 2005 introduced the concept of hot-add memory, to allow for dynamic workload handling. SQL Server 2008 increases these capabilities by adding hot-add CPU as well. 'Hot-add' means being able to plug in a CPU while the machine is running and then reconfigure SQL Server to make use of the CPU ONLINE! (i.e. no application downtime required at all)

There are a few restrictions:

  • You need a 64-bit system that support hot-add CPU (obviously :-))
  • You need Enterprise Edition of SQL Server 2008
  • You need Windows Server Datacenter or Enterprise Edition

When you plug in the new CPU, SQL Server won't automatically start using it. If you think about it, it can't - you may not want that CPU to be used by SQL Server - so it has to be told that it can use it. You do that by setting the appropriate affinity masks and then running a RECONFIGURE.

Ok - now we come to the bit that needs the "(and affinity masks)" in the title. What's an affinity mask? In a nutshell, it's a bitmap of all the CPUs on the machine that specifies which ones are available for general SQL Server Engine use, and which ones are available only for SQL Server I/O. There are two kinds of masks, a regular affinity mask, and an affinity I/O mask. The regular affinity mask, if all zeroes, says that Windows decides who get's what CPU when. If the affinity mask is non-zero, then there's a bit per CPU. If it's set to 1 then SQL Server can use the CPU. If a bit is set to 1 in the affinity I/O mask, then the CPU can only be used for I/O. A bit cannot be set in both masks. Now - an affinity mask is 32-bits wide, so if you have more than 32 CPUs, you need to use two more affinity masks, called affinity64 and affinity64 I/O. These do the same thing but for CPUs 32-63 on the machine.

So, very cool, especially for those of you that can afford such hardware - I can't so I don't have a box to test it on (the 64-bit server we have here at SQL skills doesn't support it).

Back in September last year I blogged about the native Backup Compression that is in SQL Server 2008 (see here) and promised to blog more when CTP-5 came out and I ran some tests. Well, it's here and I have so here's some data for you.

I expanded the AdventureWorks database to be 322Mb (random size, but big enough to get a decent sized run-time on my server). I used System Monitor to capture %user-mode CPU time, plus backup/restore throughput for a compressed and uncompressed backup operation, and then restores.

1) For the uncompressed backup the average CPU was 5% (the green line at the bottom), the run-time was 39.5s, and, of course, it took 322Mb to store the backup.

2) For the compressed backup the average CPU was way higher at 25%, BUT the run-time was 21.6s (a 45% improvement), and the backup was stored in 76.7MB (a 4.2x compression ratio). Very cool.

3) For the restore of the uncompressed backup the average CPU was 8%, and the run-time was 71.0s.

4) For the restore of the compressed backup the average CPU was 14.5%, and the run-time was 36s (a 50% improvement).

So - to summarize, turning on compression means more CPU and smaller run-times - just what was expected. Note that if you try this on your database you will see different results - the compression ratio and CPU usage is entirely dependent on the data being compressed.

While we were in Barcelona we sat down with Richard Campbell and Greg Hughes from RunAs Radio to record a 1/2 hour interview on SQL Server 2008. We touch on a ton of different features (look at the number of Categories I've tagged this with!) and have a bunch of laughs along the way - check it out here.

PS There's been a ton of interest in the slide deck idea I had so we'll be going ahead with that. Look for an announcement sometime in the first few months of next year about how to get them. Thanks to everyone that replied!

This is a subject I posted about last year on my old blog but it came up at SQL Connections last week several times so I want to repost it for those who’ve just started following my blog.

There's only one time when you should be trying to work out how long a CHECKDB is going to take - when you're planning your regular database maintenance. If you're faced with a corrupt (or suspected corrupt) database and you're only just starting to think about how long a CHECKDB is going to take - you've made a mistake while planning your disaster recovery strategy. You always need to know how long CHECKDB takes (on average) to run for your database so:

  • You can tell whether a particular run of CHECKDB is taking longer than usual - a sign that it's found some corruption
  • You know how long it will take to get results in a disaster recovery situation

At every conference I go to, someone asks me how long CHECKDB will take to run on their database. There are several ways I could answer this:

  • The unhelpful answer - I've got no idea.
  • The almost-helpful answer - how long did it take to run last time and are the conditions exactly the same?
  • The answer I usually give - it depends.

Now, many people would see the third answer as being somewhat equivalent to the first answer - unhelpful. The problem is that there are many factors which influence how long CHECKDB will take to run. Let me explain the ten most important factors so you get an idea why this is actually a helpful answer. These aren't in any particular order of importance.

1) The size of the database

Pretty obvious... CHECKDB has to read every allocated page in the database so the bigger it is, the longer it will take to read all the pages.

2) Concurrent IO load on the server

At the simplest level, what is CHECKDB going to do? It reads every allocated page in the database. That's a lot of IO. CHECKDB takes great pains to do the most efficient IO it can and read the database pages in their physical order with plenty of readahead so that the disk heads move smoothly across the disks (rather than jumping around randomly and incurring disk head seek delays). If there's no concurrent IO load on the server, then the IOs will be as efficient as CHECKDB can make them. However, introducing any additional IO from SQL Server means that the disk heads will be jumping around - slowing down the CHECKDB IOs. If the IO subsystem is at capacity already from CHECKDB's IO demands, any additional IO is going to reduce the IO bandwidth available to CHECKDB - slowing it down.

3) Concurrent CPU activity on the server

At the next level of simplicity, CHECKDB is going to process every page it reads in some way. Depending on the various options you've specified and the database schema (details below), that's going to use a lot of CPU - it's possible that the server may be pegged at 100% CPU when CHECKDB is running. If there's any additional workload on the server, that's going to take CPU cycles away from CHECKDB and it going to slow it down.

Basically what points #2 and #3 are saying is that CHECKDB is very resource intensive! It’s probably one of the most resource intensive things you can ask SQL Server to do and so it's usually a good idea to not run it during peak workload times, as you'll not only cause CHECKDB to take longer to run, you will slowdown the concurrent workload, possibly unacceptably.

4) Concurrent update activity on the database

This is relevant for both SQL 2000 and SQL 2005, but for different reasons.

In SQL 2000, CHECKDB gets its consistent view of the database from transaction log analysis of concurrent DML transactions (see here for details). The more concurrent DML there is while CHECKDB is running, the more transaction log will be generated - and so the longer it will take for CHECKDB to analyze that transaction log. It's possible that on a large multi-CPU box with a ton of concurrent DML and CHECKDB limited to a single CPU that this phase of CHECKDB could take several times longer than the reading and processing of the database pages! (I've seen this in real-life several times.)

In SQL 2005, CHECKDB gets its consistent view of the database from a database snapshot, which is stored on the same disk volumes as the database itself. If there are a lot of changes in the database while CHECKDB is running, the changed pages are pushed to the snapshot so that it remains consistent. As the snapshot files are stored in the same location as the database files, every time a page is pushed to the snapshot, the disk head has to move, which interrupts the efficient IO described in #2. Also, whenever CHECKDB goes to read a page and it needs to read the page from the snapshot files instead of the database files, that's another disk head move, and another efficient IO interruption. The more concurrent changes to the database, the more interruptions to efficient IO and the slower that CHECKDB runs.

5) Throughput capabilities of the IO subsystem

This one's simple. CHECKDB is going to do a boat-load of IOs and it could even end up being IO-bound (meaning that the CPUs are idle periodically waiting for IOs to complete) depending on the options specified and the database schema. This means that the throughput of the IO subsystem is going to have a direct effect on the run-time of CHECKDB. so, if you have a 1TB database and the IO subsystem can only manage 100MB/sec, it's going to take almost 3 hours just to read the database (1TB / 100MB / 3600 secs) and there's nothing you can do to speed that up except upgrade the IO subsystem.

I've lost count of the number of times I've heard customers complain that CHECKDB (or index rebuilds or other IO-heavy operations) are running sloooowly only to find that the disk queue lengths are enormous and the IO subsystem it entirely unmatched to the server and workload.

6) The number of CPUs (processing cores) on the box

This also really encompasses the Edition of SQL Server that's being run. In Enterprise Edition, CHECKDB can run in parallel across all the CPUs in the box (or as many as the query processor decides to parallelize over when the CHECKDB internal queries are compiled). Running in parallel can give a significant performance boost to CHECKDB and lower run times, as long as the database is also spread over multiple files too (so the IOs can be parallelized). There's a nifty algorithm that’s used that allows CHECKDB to run in parallel which I'll explain in detail in a future post.

On the other hand, the fact that CHECKDB can run in parallel in Enterprise Edition can be bad for some scenarios, and so some DBAs chose to force CHECKDB to be single-threaded. SAP typically recommends this to help with user query predictability. The way to do this is to turn on the documented trace flag 2528.

7) The speed of the disks where tempdb is placed

Running CHECKDB against a VLDB uses lots of memory for internal state and for VLDBs the memory requirement usually exceeds the amount of memory available to SQL Server. In this case, the state is spooled out to tempdb and so the performance of tempdb can be a critical factor in CHECKDB performance. See this post for more details of this and how CHECKDB can run out of disk space if tempdb is too small.

8) The complexity of the database schema

This can have a really high impact on the run-time of CHECKDB because it impacts the amount of CPU that CHECKDB requires. For example, the most expensive checks that CHECKDB does are for non-clustered indexes. It needs to check that each row in a non-clustered index maps to exactly one row in the heap or clustered index for the table, and that every heap/clustered index row has exactly one matching row in each non-clustered index. Although there's a highly efficient algorithm for doing this, it still takes around 30% of the total CPU that CHECKDB uses!

There are a bunch of other checks that are only done if the features have been used in the database - e.g. computed column evaluation, links between off-row LOB values, Service Broker, XML indexes, indexed views - so you can see that empirical factors along aren't enough to determine the run-time.

9) Which options are specified

This is almost the same as #7 in that by specifying various options you're limiting what checks CHECKDB actually performs. For instance, using the WITH NOINDEX option will turn off the non-clustered index checks that I described in #7 and using the WITH PHYSICAL_ONLY option will turn off all logical checks, vastly decreasing the run-time of CHECKDB and making it nearly always IO-bound rather than CPU-bound (in fact this is the most common option that DBAs of VLDBs use to make the run-time of CHECKDB manageable).

One thing to be aware of - if you specify any repair options, CHECKDB always runs single-threaded, even on a multi-proc box on Enterprise Edition.

10) The number and type of corruptions that exist in the database

Again, this is similar to #7 and #8. If there are any corruptions present, there may be extra checks triggered to try to figure out more details of the corruptions. For instance, for the non-clustered index checks, the algorithm is tuned very heavily for the case when there are no corruptions present (the overwhelming majority of cases considering the millions of times CHECKDB is run every day around the world). When a non-clustered index corruption is detected, a more in-depth algorithm has to be used to figure out exactly where the corruption is, which involves re-scanning a bunch of data and so taking a bunch more time. There are a few other algorithms like this too.

Summary

So you can see that there's no simple answer.

I’m in the middle of a flight from Washington D.C. to Zurich on the way to Barcelona for TechEd IT Forum and I can’t sleep – Kimberly’s out like a light so what else is there to do except write another blog post? :-) OK - actually posting this from Barcelona on Tuesday before our first of 12 sessions here!

In the Database Maintenance workshop we did at SQL Connections last week I promised to blog a script I used to show how data file shrink operations cause massive fragmentation of indexes. The reason is that data file shrink starts at the end of the data file and moves a single page at a time to a free space below the shrink threshold. In the process of doing so, it perfectly reverses the physical order of the pages comprising the leaf level of an index – thus perfectly fragmenting it!!

Let’s try out my simple script that demonstrates this. First thing I’m going to do is create a new database and create a 10MB ‘filler’ table, which I’m going to then drop later to create a space that shrink can use.

USE MASTER;

GO

 

IF DATABASEPROPERTYEX ('shrinktest', 'Version') > 0

      DROP DATABASE shrinktest;

 

CREATE DATABASE shrinktest;

GO

USE shrinktest;

GO

 

SET NOCOUNT ON;

GO

 

-- Create and fill the filler table

CREATE TABLE filler (c1 INT IDENTITY, c2 VARCHAR(8000))

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO filler VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO

Next I’ll create the ‘production’ table that we’d really like to keep in optimal shape for performance.

-- Create and fill the production table

CREATE TABLE production (c1 INT IDENTITY, c2 VARCHAR (8000));

CREATE CLUSTERED INDEX prod_cl ON production (c1);

GO

DECLARE @a INT;

SELECT @a = 1;

WHILE (@a < 1280) -- insert 10MB

BEGIN

      INSERT INTO production VALUES (REPLICATE ('a', 5000));

      SELECT @a = @a + 1;

END;

GO 

Now I’ll use the sys.dm_db_index_physical_stats DMV to check the fragmentation of the production table’s clustered index – it should be almost zero:

-- check the fragmentation of the production table

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

0.390930414386239            6

This is what I expected. Now I’m going to drop the filler table, run a shrink operation and then check the fragmentation again:

-- drop the filler table and shrink the database

DROP TABLE filler;

GO

 

-- shrink the database

DBCC SHRINKDATABASE (shrinktest);

GO

 

-- check the index fragmentation again

SELECT avg_fragmentation_in_percent, fragment_count FROM sys.dm_db_index_physical_stats (

      DB_ID ('shrinktest'), OBJECT_ID ('production'), 1, NULL, 'LIMITED');

GO

avg_fragmentation_in_percent fragment_count

---------------------------- --------------------

99.7654417513683             1277

Wow! The index went from almost 0% fragmented to almost 100% fragmented – the shrink operation totally reversed the physical ordering of the leaf level of the clustered index – nasty.

One of the common maintenance operations I see at customer sites is to run a database shrink at some interval, and I always advise against it – now you can see why. Running a regular shrink operation can cause horrible fragmentation problems. The worst problems I see are those customers with maintenance plans that rebuild all indexes and then run a shrink to remove the extra space necessary for the index rebuilds – completely undoing the effects of the index rebuild!

One other common thing I see is to have auto-shrink set on for one or databases. This is bad for several reasons:

  • Shrink causes index fragmentation, as I’ve just demonstrated above.
  • You can't control when it kicks in. Although it doesn't have any effect like long-term blocking, it does take up a lot of resources, both IO and CPU. It also moves a lot of data through the buffer pool and so can cause hot pages to be pushed out to disk, slowing things down further. If the server is already pushing the limits of the IO subsystem, running shrink may push it over, causing long disk queue lengths and possibly IO timeouts.
  • You're likely to get into a death-spiral of auto-grow then auto-shrink then auto-grow then auto-shrink... (in my experience, if someone is using auto-shrink, they're most likely using and relying on auto-grow too). An active database usually requires free space for normal operations - so if you take that free space away then the database just has to grow again. This is bad for several reasons:
    • Repeatedly shrinking and growing the data files will cause file-system level fragmentation, which can slow down performance
    • It wastes a huge amount of resources, basically running the shrink algorithm for no reason
    • Auto-grow itself can be bad, especially if you're using SQL Server 2000 (or don't have Instant File Initialization turned on - see this post from Kimberly's blog) where all allocations to the file being grown are blocked while the new portion of the file is being zero-initialized.

Bottom-line: auto-shrink should *NEVER* be turned on…

This is a question I was asked multiple times over the last week: the Microsoft guidelines for database mirroring say not to mirror more than 10 databases per instance - why is that and is it true?

The answer is my favorite 'it depends!'. The number 10 is a rough guess at the sweet-spot for the majority of customers based on hitting a thread limit on 32-bit machines. The factors that need to be considered are:

  • How much memory do the principal and mirror instances have? (hopefully the same)
  • How much processing-power do the principal and mirror instances have? (hopefully the same)
  • How much bandwidth does the IO subsystem have on the mirror instance? (hopefully the same as on the principal)
  • How much transaction log does the workload on each database generate?
  • How much network bandwidth is available between the principal and the mirror instances?

The last two factors are the most critical. If the network bandwidth available between the two instances is not enough to handle the combined transaction log generation rate per second from all databases being mirrored then performance will drop on the principal databases. SQL Server 2008 does alleviate some of this with log stream compression - see here for details. The next most critical thing to consider is the memory and thread requirements for mirroring - each mirrored database takes one thread plus some memory, so on low-powered servers, lots of mirrored databases may be too much load on the server when combined with the regular workload.

Here are some examples that I've seen:

  • A customer with 25 databases, all of which have very small amounts of activity, and not all at the same time, has them all mirrored with no problem.
  • A customer with only 3 heavily-loaded databases, but without a great network connection, that can barely mirror one of the databases without the lack of network bandwidth causing workload degradation.

The key to success here is to do the log generation calculation and then if it seems that the available network bandwidth will support the number of databases you want to mirror, test it first before relying on it in production.

I guess the bottom-line here is that any broad guidance is only that - your mileage may (and probably will) vary. Always do your own calculations and testing.

[Edit 10/15/2009: Checkout the new KB article I helped write that discusses this in detail: http://support.microsoft.com/kb/2001270]

In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose efficient data-types - I'd like to share the discussion here with an example.

The crux of the matter is in the amount of space required to store the data-types in the schema, and ensuring that the data-type choice matches the data being stored without incurring any unnecessary overhead. Here are a couple of examples:

  • A field that stores a person's age will usually be able to use a TINYINT type that can store a maximum value of 256 - unless its an archeological or historical database, for instance. Without putting a lot of thought into it, however, someone may choose to use an INTEGER type. What's the difference? A TINYINT take a single byte to store, whereas an INTEGER takes *4* bytes to store - wasting 3 bytes per record.
  • A field that stores a person's city of residence in the USA needs to be able to hold 24 characters (see my previous post on how long fields have to be) - so what data-type should you use? A CHAR (24) will be able to store all the possible values, but will *always* take up 24 bytes in the record as it's a fixed-length column. A VARCHAR (24), on the other hand will only store the number of bytes equal to the number of characters in the city name, so using the fixed-length type will waste a varying number of bytes per record.
  • In the USA, the names of the various States are commonly abbeviated to two characters - for instance, Washington = WA and California = CA. So is the best type to use CHAR (2) or VARCHAR (2)? Given that the abbreviations are always 2 characters, they'll both store 2 bytes all the time. BUT, a variable length column has a two-byte overhead (the two-byte pointer for the column that's stored in the variable-length column offset array in the record) - so in this case the best data-type to use is CHAR (2).

You can see how choosing data-types without considering whether it's the best type can lead to wasted space. Let's look at an example schema to support a population census of the USA.

CREATE TABLE CensusSchema1 (

SSN CHAR (256),

StateName CHAR (256),

Sex INTEGER,

Employed INTEGER,

SelfEmployed INTEGER,

EthnicOrigin INTEGER,

MaritalStatus INTEGER,

NumberOfDependents INTEGER,

Age INTEGER,

CountryOfBirth INTEGER,

HouseholdIncomeGroup INTEGER,

ZipCode5 INTEGER);

GO

At first glance this may look reasonable, but digging in you'll see that many of the columns are over-sized. Here's a cleaned-up schema to compare against, with notes on the per-column savings:

CREATE TABLE CensusSchema2 (

SSN CHAR (9), -- saving 244 bytes

StateName VARCHAR (256), -- saving at least 240 bytes (longest state name is 14 + 2-byte varchar overhead)

Sex BIT,

Employed BIT,

SelfEmployed BIT, -- saving 11 bytes altogether over these three fields

EthnicOrigin TINYINT, -- saving 3 bytes

MaritalStatus TINYINT, -- saving 3 bytes

NumberOfDependents TINYINT, -- saving 3 bytes

Age TINYINT, -- saving 3 bytes

CountryOfBirth TINYINT, -- saving 3 bytes

HouseholdIncomeGroup TINYINT, -- saving 3 bytes

ZipCode5 INTEGER); -- no saving

GO

The bad schema gives a per-row size of 574 bytes and the cleaned-up schema is no more than 48 bytes per-row. I designed these two schemas to support a US census. The population of the USA is approx. 300 million. This means the bad schema would take around 190GB to store all the info, and the cleaned-up schema only takes around 15GB - more than 12 times more efficient! Now we're starting to see how poor data-type choice can lead to poor performance.

Wider rows means:

  • Fewer rows can fit on an 8k page.
  • More CPU is necessary to crack open a record due to the number of CPU data cache line invalidations necessary to read the record into the CPU's various caches (every time a cache line is invalidated it takes a bunch of CPU cycles - see here for an explanation of CPUs and cache lines).

Less rows per page means:

  • More pages are needed to store the data
  • Indexes could have a smaller fan-out (if the index keys are wider than then need to be), leading to more levels in the index and less efficient index navigation from the index root page to the index leaf-level.

More pages means:

  • More IOs are necessary to read the same amount of actual data
  • More buffer pool memory is necessary to hold the same amount of actual data
  • More disk space is necessary to hold the same amount of actual data

Clearly there's a link between various aspects of workload performance and badly chosen data-types.

There's been some discussion over on the SQL Server Central forums about database mirroring performance, and one question in particular on how to do index rebuilds on mirrored VLDBs.

Remember that in database mirroring, the database has to be in the full recovery mode, so all index rebuilds are fully logged. It all comes down to the amount of transaction log generated and whether this causes a problem.

  • In synchronous mirroring, the additional log being generated could overload the network link between the principal and mirror. This will slow down the hardening of regular transactions in the mirror's transaction log, and thus led to a decrease in transaction throughput on the principal.
  • In asynchronous mirroring, the additional log being generated could again overload the network link - but this time there's no requirement for transactions to harden in the mirror before they can commit on the principal, so instead the SEND queue on the principal gets really large. Now, in asynchronous mirroring the SEND queue represents the amount of work that will be lost if a failover occurs, as its all the transaction log that hasn't yet been sent to the mirror. So, in asynchronous mode, a large index rebuild operation could lead to increase exposure to data loss in the event of a failover.

In SS2008, the log stream compression I blogged about here should go a long way to alleviating this problem. There are three ways I can think of to reduce the amount of transaction log generated by and index maintenance plan in SS2005:

  1. Use a potentially less expensive (in terms of logging) solution for removing index fragmentation. Doing an index rebuild is guaranteed to generate an equivalent amount of transaction log to the size of the index being rebuilt, no matter how much fragmentation there is - because an index rebuild always rebuilds the entire index. The alternative is to do an index reorganize, either using my old DBCC INDEXDEFRAG or the new ALTER INDEX ... REORGANIZE. These will only generate transaction log when index pages are compacted and reorganized - so for less heavily fragmented indexes. There's no hard and fast rule here but I generally say where Logical Scan Fragmentation/Average Fragmentation in Percent from DBCC SHOWCONTIG/sys.dm_db_index_physical_stats, respectively, is less than 30%. You also need to consider page density too - but really this is a topic for a whole other post.
  2. Be very selective on which indexes you choose to rebuild/reorganize. A lot of people have a maintenance plan that does this for every index every week, without checking whether the index is even fragmented or whether removing fragmentation for an index improves workload performance. Again, this a whole other topic but there is an old whitepaper for SS2000 I helped with that describes some of this - Microsoft SQL Server 2000 Index Defragmentation Best Practices.
  3. Partition the tables/indexes so that the changing portion of the data is the only portion that's affected by index maintenance. If most of your data is read-only, there's no point in having it included in reindex/reorganize operations, right? Kimberly recently wrote a blog post about such an architecture here.

For more info on database mirroring performance considerations, checkout the whitepaper Database Mirroring: Best Practices and Performance Considerations. There's also a slide deck presentation based on this whitepaper that's been presented at various TechEds (I did it in China and Hong Kong last year and Kimberly did it in South Africa last year) - you can download it from the Hong Kong website here.

The bottom line is when mirroring is in the mix, you need to be more intentional with your database maintenance.

As I mentioned in a previous post, one of the new features for database mirroring in SQL Server 2008 is log stream compression. There's a good explanation of what this is (along with some example workloads and performance graphs) on the SQL Customer Advisory Team blog so I'm not going to duplicate all that here.

Basically, the way mirroring works is by shipping all the transaction log from the principal to the mirror, and then replaying the log records on the mirror database. Network bandwidth can be a bottleneck for mirrored databases where the rate of transaction log generation is high enough to saturate the network. On synchronously mirrored databases (where a transaction cannot commit on the principal until all the transaction log for it has been hardened in the log of the mirror database), this can even lead to workload throughput degradation.

Log compression is a way to alleviate this problem. SS2008 Books Online states that compression rates of at least 12.5% are achieved - obviously the compression ratio could be much higher than that and is dependant on what's being compressed - i.e. the data that is being processed by the application. One of the tests they did for the post above shows a 5x compression ratio - that's pretty good.

The downside of this is that compression is not free - extra CPU must be used. The obvious CPU load increases comes from having to compress the transaction log on the principal and then decompress it again on the mirror before hardening it in the log. The non-obvious increase will come because the principal and mirror should end up being able to process more transactions/second, which means more CPU is needed.

This extra CPU could be a problem for some systems that are already pegged in terms of CPU load, and so they may actually see a drop in performance when log stream compression is enabled. Now, it's on by default when you upgrade to SS2008 but there is a way to turn it off. The blog post above divulges that there is a trace flag, 1462, that turns off log stream compression and effectively reverts the behavior back to SS2005.

Apart from the obvious benefits of this feature for systems that can afford the extra CPU, I'm excited because it may allow more mirrored databases to perform effective database maintenance - in the form of fragmentation control and index maintenance. When database mirroring is enabled, the database has to be running in the full recovery mode. This means that operations such as index rebuilds are fully logged - that's a lot of extra log being generated if you're used to switching to bulk-logged recovery mode to perform index operations. Of course, this isn't an issue for reorganizing an index (with ALTER INDEX ... REORGANIZE) as this has been fully logged (and has to be that way because of how it works) since I wrote the old DBCC INDEXDEFRAG for SS2000.

Anyway, for some customers the fact that these two operations are fully logged means that running them produces so much additional log that, in conjunction with the log from the workload itself, the network becomes a bottleneck for mirroring and it slows down both the maintenance and reduces the workload throughput. The addition of log stream compression means that for these systems, the log may be compressed enough so that the network is no longer a bottleneck and more regular fragmentation control can take place. This in turn will increase workload throughput as well - a double benefit!

Here's an interesting question I was sent by my friend Steve Jones over at SQL Server Central - will a single CPU with dual-cores perform better than two single-core CPUs? Both have two processing cores but the hardware architecture is different - which one will make SQL Server perform better? Well, there's no hard and fast answer - it depends! I had a discussion on this topic this morning with Jerome Halmans, part of my old team in the SQL Server Storage Engine and I'm basing this post on our discussion with his permission.

My hypothesis (which Jerome confirmed) was that the performance of the two architectures depends on the amount of cache line invalidations and how that is managed (see here for a description of CPU caches and cache lines).

  • On the single-core machine, cache line invalidations needs to go across the main bus between the two CPUs, involving bus arbitration delays.
  • On the dual-core machine, cache line invalidations don't go across the bus because the two cores are contained within the same CPU package. In fact, if the architecture is smart enough, it may be able to just remap the cache line from one processing core to the other, thus avoiding any data copying. I'm not sure if such an architecture exists though.

Here is a very interesting and accessible Intel article that discusses cache-sharing in multi-core Intel systems. In this paper at least, the L2 cache is shared but modifications made by different cores in their private L1 caches still need to bounce through the shared L2 cache before being loaded by the other core. This will still be WAY faster than having to go through main bus between single-core CPUs.

And here is a similar paper from AMD on their Barcelona multi-core architecture that describes each core having separate L1 and L2 caches, with an additional shared L3 cache. The seperate L2 caches are kind-of linked though, in that modifications to a cache line in one L2 cache are immediately mirrored in the other L2 caches (if needed).

But the amount of cache invalidations (of whatever kind) depends on the workload. The two types of workload to consider are:

  1. Where the workload has very independent characteristics, so the data being processed by a thread on one processing core is unrelated to that being processed by a thread on the other core. There should be very few cache line invalidations. In this case, the single-core CPUs will have all their local caches full of data relevant to just the thread running. The two cores on the dual-core CPU will need to share some level of on-chip cache and so their may well be more churn in the cache. In this case I'd expect the single-core CPUs to perform better.
  2. Where the workload is such that data is shared, and threads touch data being processed by others threads on other cores. In this case, the single-core CPUs will fall victim to massive amounts of cache line invalidations, whereas the dual-core CPUs will do on-chip cache line invalidation (of whatever type is supported by the architecture). In this case I'd expect the multi-core CPU to outperform the two single-core CPUs.

Saying that, the majority of workloads on SQL Server are of the second type above. Jerome mentioned that even synthetic workloads (such as the TPCC benchmark) are still going to result in multiple-threads accessing and changing the same data/index pages.

So - what's the conclusion? I expect that a multi-core CPU will outperform an equivalent number of single-core CPUs in most workloads. And as Jerome pointed out, even if that's not the case for your workload, you'll find it pretty hard to find a system that ships with single-core CPUs these days.

I'd love to hear any comments on this, especially any measurements you've done on workloads as I don't have any single-core machines available to run tests on - even the laptop I'm typing this on is a dual-core Centrino.

(This is an updated repost from earlier this year on my old blog)

Whenever I’m discussing index maintenance, and specifically fragmentation, I always make a point of saying ‘Make sure the index is being used before doing anything about fragmentation’. If an index isn’t being used very much, but has very low page density (lots of free space in the index pages), then it will be occupying a lot more disk space than it could do and it may be worth compacting (with a rebuild or a defrag) to get that disk space back. However, usually there’s not much point spending resources to remove any kind of fragmentation when an index isn’t being used. This is especially true of those people who rebuild all indexes every night or every week.

 

You could even go so far as to say if a non-clustered index isn’t being used, why is it there at all? Extra non-clustered indexes drag down performance in a number of ways. Consider a non-clustered index called IX_MyNCIndex on the table MyTable:

  • Any time a record is inserted into MyTable, a matching record is inserted into IX_MyNCIndex. This is a bunch of extra IOs, extra log records, plus maybe even a page-split.
  • Any time a record is deleted from MyTable, the matching record in IX_MyNCIndex must be deleted. Extra IOs again and log records again.
  • Any time a record in MyTable is updated:
    • If MyTable has a clustered index, and the clustered index key value changes, then the matching record in IX_MyNCIndex must be updated. Extra IOs and log records again.
    • If any of the non-clustered index key values changes, or any of the INCLUDEd column values changes, then the matching record in IX_MyNCIndex must be updated. Extra IOs and log records again.
  • If a clustered index is created on MyTable, then IX_MyNCIndex has to be rebuilt to include the logical RIDs rather than the physical heap RIDs (see this post for an explanation). Lot of extra IOs and log records again.

That’s a significant amount of extra IOs and log records to maintain each extraneous non-clustered index.

 

So, how can you tell if an index is being used?

  • In SQL Server 2000 there is no way to do it
  • In SQL Server 2005 there are a few different ways in SQL Server 2005 – the one I want to discuss in this post is the sys.dm_db_index_usage_stats DMV.

This DMV exposes the information that is tracked about index usage (as the name suggests). It does not generate any information itself; it just returns info from a cache inside SQL Server. This cache is empty when the server instance starts, and is not persisted across instance restarts. All cache entries for indexes in a database are removed when that database is closed. So, the cache tracks usage information about which indexes have been used since the database they are part of was last opened (either manually or as part of instance start-up).

 

This continues to confuse people so I'll call it out: if the output from the DMV does not have an entry for the index you're interested in, it has not been used since the last database startup.

 

The cache tracks the following info for each index (for user queries and system queries):

  • The number of times it was used in a seek operation (either looking up a single row, or doing a range scan) along with the time of the last seek.
  • The number of times it was used in a scan operation (e.g. a select * operation) along with the time of the last scan
  • The number of times it was used in a lookup operation (this means a bookmark lookup – where a non-clustered index does not fully cover a query and additional columns must be retrieved from the base table row) along with the time of the last lookup.
  • The number of times it was used in an update operation (this counts inserts, updates, and deletes) along with the time of the last update.

Let’s have a look at its use.

SELECT * FROM sys.dm_db_index_usage_stats;

GO

The output is too wide for a single image so I've split it in two (I won't post any more output from the DMV - I'll just talk about it):

indexusage11.jpg

indexusage21.jpg

Unless you've just re-started your instance, you'll see a bunch of output from this, representing all index activity since the instance/databases started. If you're interested in whether an index is being used, you can filter the output. Let's focus in on a particular table  - AdventureWorks.Person.Address.

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

You'll probably see nothing in the output, unless you've been playing around with that table. Let's force the clustered index on that table to be used, and look at the DMV output again.

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

Now there's a single row, showing a scan on the clustered index. Let's do something else.

SELECT StateProvinceID FROM AdventureWorks.Person.Address

WHERE StateProvinceID > 4 AND StateProvinceId < 15;

GO

SELECT * FROM sys.dm_db_index_usage_stats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

And there's another row, showing a seek in one of the table's non-clustered indexes.

So, its easy to look at the index usage for particular tables and indexes. But how can you monitor this over time? This is easy too - let's see how. First we need to create our own table to store snapshots of the DMV output.

IF OBJECTPROPERTY (object_id (N'master.dbo.MyIndexUsageStats'), 'IsUserTable') = 1 DROP TABLE dbo.MyIndexUsageStats;

GO

SELECT GETDATE () AS ExecutionTime, * INTO master.dbo.MyIndexUsageStats

FROM sys.dm_db_index_usage_stats WHERE database_id=0;

GO

Next we need to take a baseline snapshot of the DMV output.

INSERT master.dbo.MyIndexUsageStats

SELECT getdate (), * FROM sys.dm_db_index_usage_stats;

GO

And now simulate a few operations and take another snapshot of the DMV:

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT * FROM AdventureWorks.Person.Address;

GO

SELECT StateProvinceID FROM AdventureWorks.Person.Address

WHERE StateProvinceID > 4 AND StateProvinceId < 15;

GO

INSERT master.dbo.MyIndexUsageStats

SELECT getdate (), * FROM sys.dm_db_index_usage_stats;

GO

And look at the filtered contents of our snapshot table:

SELECT * FROM master.dbo.MyIndexUsageStats

WHERE database_id = DB_ID('AdventureWorks')

AND object_id = OBJECT_ID('AdventureWorks.Person.Address');

GO

You should see four rows - two from the baseline snapshot and two from the final snapshot. If you ran just the statements above, you'll see that the user_scans count for the clustered index has increased by two, and the user_seeks count for the non-clustered index has increased by one.

So this is a pretty simple example of how you can track index usage. By putting something like this into a regularly run script you can tell which indexes aren't being used and could be candidates for less-regular index maintenance or removal altogether.

Let me know how you get on.

In SQL Server 2005, queries over partitioned tables use a single-thread per partition. This can cause performance problems under certain circumstances:

  1. On systems with many CPUs, if the table schema has less partitions than there are CPUs, then not all the CPUs will be used to process the query. Some examples:
    1. On a 32-way box, a query over a 12-partition table (e.g. a sales table partitioned by month) will only use 12 threads (one on each of 12 CPUs). This means 20 CPUs are potentially idle.
    2. On a 4-way box, a query over a 12-partition table only accesses a single partition so will only use one thread (on one CPU). This means 3 CPUs are potentially idle.
  2. On tables that have skewed data such that one partition is much larger than another, the length of time the query takes to complete will be bounded by the single thread processing the largest partition.

As part of the set of improvements in SQL Server 2008 for data warehousing there will be an option to change the threading behavior for queries over partitioned tables. The new, alternative model is that all available threads process part of each partition and then move into the next partition. This allows all available CPUs to take part in processing the query, which should lead to a drop in the query completion time.

The only time this model won't work is if the data is not in the buffer pool and is not spread out evenly across the available drives. For example, if an entire partition is stored on a single drive, then multiple threads will be scanning different portions of the drive, causing the disk head to thrash and IO throughput to drop sharply compared with a single thread driving the IO. For this reason, the option to use the new model will be off by default, to avoid surprising people with sudden bad performance after upgrading.

This should be available in the next CTP and then I'll post again with some example datasets and queries to see what the potential benefits and drawbacks are.

Part of the improvements to database mirroring in SQL Server 2008 are the addition of 10 new performance counters under the Database Mirroring Performance Object. These add some cool troubleshooting capabilities.

SQL Server 2005 provided the following 11 counters (from Books Online):

Name Description

Bytes Received/Sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Send Queue

Total number of bytes of log that have not yet been sent to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue

Total number of bytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

SQL Server 2008 now provides 21 counters, with the new ones highlighted in red. This info is taken from the 2008 July CTP 08Books Online that is downloadable here.

Name Description

Bytes Received/sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Redone from Cache/sec

Number of redone log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the mirror server. On the principal server the value is always 0.

Log Bytes Sent from Cache/sec

Number of sent log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the principal server. On the mirror server the value is always 0.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Compressed Bytes Rcvd/sec

Number of compressed bytes of log received, in the last second.

Log Compressed Bytes Sent/sec

Number of compressed bytes of log sent, in the last second.

Log Harden Time (ms)

Milliseconds that log blocks waited to be hardened to disk, in the last second.

Log Remaining for Undo KB

Total kilobytes of log that remain to be scanned by the new mirror server after failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Scanned for Undo KB

Total kilobytes of log that have been scanned by the new mirror server since failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Send Flow Control Time (ms)

Milliseconds that log stream messages waited for send flow control, in the last second.

Sending log data and metadata to the mirroring partner is the most data-intensive operation in database mirroring and might monopolize the database mirroring and Service Broker send buffers. Use this counter to monitor the use of this buffer by the database mirroring session.

Log Send Queue KB

Total number of kilobytes of log that have not yet been sent to the mirror server.

Mirrored Write Transactions/sec

Number of transactions that wrote to the mirrored database and waited for the log to be sent to the mirror in order to commit, in the last second.

This counter is incremented only when the principal server is actively sending log records to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue KB

Total number of kilobytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Send/Receive Ack Time

Milliseconds that messages waited for acknowledgement from the partner, in the last second.

This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, you can analyze the value of this counter to determine whether the network is causing the problem.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

Here's a little more explanation and what you can use these new performance counters to troubleshoot:

  • Log Bytes Redone from Cache/sec
    • This measures how much of the transaction log in the redo queue is being read by the log redo task from the mirror's in-memory transaction log cache. Reading from the cache is a lot faster than having to read from the mirror's actual transaction log. Even though the log gets hardened on the mirror database's log disk, it does not need to be removed from the cache until the cache fills up with new transaction log from the principal.
    • You could think of this as a cache hit ratio measure for the redo queue.
    • If this number is lower than usual, it means that transaction log is arriving from the principal faster than the log redo task can roll forward the transaction log in the redo queue.
  • Log Bytes Sent from Cache/sec
    • This is similar to the counter above. It measures how much of the transaction log being sent from the principal to the mirror is being read from the principal's in-memory transaction log cache. Sending from the cache is a lot faster than having to go to the transaction log itself and read from disk.
    • You could think of this as a cache hit ratio for the send queue.
    • If this number is lower than usual it means that the transaction log is being generated on the principal faster than it can be sent to the mirror.
  • Log Compressed Bytes Rcvd/sec
  • Log Compressed Bytes Sent/sec
    • These are simple counters and can be used with the Log Bytes Sent/sec and Rcvd/sec counters to determine the compression ratio. Log stream compression is another enhancement in SQL Server 2008 that I'll cover in a future post.
  • Log Harden Time (ms)
    • This measures the delay between the mirror server receiving a chunk of transaction log and it being hardened on the mirror database's log disk (i.e. the delay before the chunk of transaction log becomes part of the redo queue on the mirror server).
    • If this number is higher than normal it means the mirror database's log disk is more heavily loaded and may be  becoming saturated.
  • Log Remaining for Undo KB
  • Log Scanned for Undo KB
    • The Books Online entries for these counters are self-explanatory.
    • These counters give a way to monitor the undo phase after a failover occurs.
  • Log Send Flow Control Time (ms)
    • This measures how long a mirroring connection had to wait before it could us the mirroring flow control buffer.
    • If this number is higher than normal it means there is contention for the buffer, most likely because there are too many Database Mirroring partnerships running from a single instance.
  • Mirrored Write Transactions/sec
    • As Books Online mentions, this counts the number of transactions in the principal database that had to wait for a commit record to harden in the mirror database's transaction log.
    • If this value is lower than normal (for the same application workload) it means there is a bottleneck somewhere in the system.
  • Send/Receive Ack Time
    • As Books Online mentions, this can be used to measure network latency between the principal and mirror servers.
    • If this value is larger than normal it means that there is a network bottleneck between the principal and mirror servers.

Hopefully Microsoft will publish a whitepaper or some troubleshooting scenarios showing these counters being used.

Theme design by Nukeation based on Jelle Druyts