Benchmarking: 1-TB table population (part 2: optimizing log block IO size and how log IO works)

(For the hardware setup I’m using, see this post. For the baseline performance measurements for this benchmark, see this post.)

In my previous post in the series, I described the benchmark I’m optimizing – populating a 1-TB clustered index as fast as possible using default values. I proved to you that I had an IO bottleneck because the IOs to the data file (from checkpoints) and the transaction log file (from transactions committing) were causing contention.

Several people commented that I might have mis-configured the iSCSI IO subsystem – so first off I want to look at that. Fellow MVP Denny Cherry (twitter|blog), who knows a lot more than me about IO subsystems, volunteered to discuss my iSCSI setup with me to make sure I hadn’t goofed anywhere (many thanks Denny!). It seems like I haven’t. I’m using a single iSCSI array right now, with a single NIC on the server dedicated to the iSCSI traffic – once I move to multiple volumes, I’ll add in more NICs.

Looking at Task Manager in the middle of a 6-hour test run to see the network utilization through the NIC shows that it’s not saturated, as shown below.

 

I ran the DELL smcli utility for two hours during the most recent test to see what peak throughput I’m getting, using the following command:

smcli -n Middle_MD3000 -c "set session performanceMonitorInterval=5 performanceMonitorIterations=1440;save storageArray performanceStats file=\"c:\\MiddlePerfStats.csv\";"

I saw around 101MBytes/sec. and earlier when testing the smcli settings I’d seen 106MBytes/sec. I’m sure once I remove some of the contention that this will get better.

On to the test!

The first thing I wanted to try was optimizing my use of the transaction log – i.e. doing as few and as large I/Os as possible to the log. My hypothesis is that by changing the batch size from one to, say, 10 or 100 SQL Server can make more efficient use of the log.

I changed my T-SQL script to take a batch size parameter and use explicit transactions inserting <batch-size> records. Everything else remained the same as the baseline. I picked 128 concurrent threads as my test point. In the baseline, the 128-thread insert test took 8 hours 8 minutes 27 seconds to complete (29,307 seconds). My simple T-SQL code changed to (with $(rows) and $(batch) being SQLCMD parameters to the script):

DECLARE @counter BIGINT;
DECLARE @inner   SMALLINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter &lt; $(rows))
BEGIN
    SELECT @inner = 0;
    BEGIN TRAN;
    WHILE (@inner &lt; $(batch))
    BEGIN
        INSERT INTO MyBigTable DEFAULT VALUES;
        SELECT @inner = @inner + 1;
    END
    COMMIT TRAN;
    SELECT @counter = @counter + $(batch);
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

Below are the results for 128 threads with batch sizes varying from 10 to 10,000:

Increasing the batch size from 1 record to 10 records caused the elapsed time to drop from 29,307 seconds to 21,167 seconds – a 28% improvement! However, increasing the batch size further didn’t improve things any more. Hmmm.

Next I tried different numbers of concurrent connections with a batch size of ten to see if the improvement was universal:

So moving from a single-record batch to a ten record batch has the same dramatic effect on performance for varying numbers of connections.

What’s going on?

The answer involves some deep internals of the transaction log structure and I/O behavior. The transaction log is split up internally into sections called virtual log files (VLFs for short). These VLFs are used to manage which parts of the log can be overwritten with new log records. If this is all unfamiliar, go read my TechNet Magazine article on Understanding Logging and Recovery in SQL Server which explains about the log in more detail and about transactions, otherwise what’s coming next may not make much sense.

Whenever a transaction commits, all the transaction log records up to and including the LOP_COMMIT_XACT log record for that transaction (including log records from other transactions that are inter-mingled with those from the one that just committed) must be written out to disk, so the transaction is durable (called write-ahead logging). But it’s not individual log records that are written to disk, the unit of I/O for the transaction log is a log block. VLFs are split internally into dynamically sized log blocks, ranging from 512-bytes to 60KB in 512-byte increments, as needed. There are algorithms to figure out how big log blocks should be based on what log records are being flushed out.

With the single record batch, the amount of log generated in the transaction totals 4,952 bytes. You can find this by doing something like:

CHECKPOINT;  -- to force the log in the SIMPLE recovery model to clear
GO

INSERT INTO MyBigTable DEFAULT VALUES;
GO

SELECT * FROM fn_dblog (NULL, NULL); -- show me all active log
GO

Then add up the values in the Log Record Length column for all records from the first LOP_BEGIN_XACT log record to the LOP_COMMIT_XACT with the matching Transaction ID.

Anyway, a single-record transaction generates 4,952 bytes of transaction log, which will be flushed out in our case in a log block that’s 5,120 bytes (the nearest multiple of 512 above 4,952), with a bunch of zero-padding at the end. For ten single-record transactions, that’s 10 small I/Os.

The ten-record transaction generates 48,688 bytes of transaction log and will be flushed out in a log block that’s 49,152 bytes (the nearest multiple of 512 above 48,688). This is clearly more efficient than 10 smaller I/Os and is why changing to batches makes things go faster.

A 100-record transaction generates 489,628 bytes of transaction log, which is clearly more than the 60KB log-block maximum, so it actually covers multiple log blocks (varying from 14-16 in my tests – I don’t remember the exact algorithms). You can see the log block changing when the middle number in the Current LSN column of the fn_dblog output increases. It looks like it’s jumping up, and it is – this is the offset of the log block within the current VLF divided by 512.

Because the 100-record transaction is split into multiple log blocks, there’s no real I/O gain during log flushes over the 10-record transaction – which is illustrated by the results I saw.

Now, even with this speed increase from the increased batch size, the average-disk-write-queue-length is still anywhere from 20-40 when running the 128-connection test, so there’s still a significant bottleneck there. In fact, the transaction log grew significantly still during these tests (up to 23GB in one case) – for an explanation of this phenomenon, see Interesting case of watching log file growth during a perf test. I picked up some wait stats queries from fellow-MVP Glenn Berry (twitter|blog) to run while the tests were running. I’m using his as they’re published and anyone can download and use them (see his blog post for the queries, and Bob Ward’s CSS blog post that’s starting to document the wait types).

His queries tell me that with the 10-record per batch and 128-connections:

  • Average write-stall to the data file is 37ms
  • Average write-stall to the log file is 18ms
  • Top wait types are PAGELATCH_EX (55% of all waits), PAGELATCH_SH (28% of all waits), and WRITELOG (14% of all waits)

The first two waits are an insert hotspot in the table, and the third is the log manager waiting for log block flushes to disk to complete.

Let’s look at a a perfmon capture during the 10-records per batch test with 128 connections:

Let’s go through each counter (top to bottom in the list in the image) and explain what’s going on. I deliberately picked this time-slice, as it really simply shows the effect of contention:

  • Pages Allocated/sec: this is the light blue line and is the Access Methods part of the Storage Engine (the dev team I used to run) creating new data and index pages for the clustered index we’re populating.
  • Checkpoint pages/sec: this is the pink line at the bottom left and bottom right of the capture. This is the buffer pool writing out dirty pages during a periodic checkpoint.
  • Avg. Disk sec/Write: this is the dark blue line that’s tracking just above the thick black line. It’s the amount of time in ms for a write to complete. You can see that it has a minimum around 12:51:00 and then varies wildly, hitting as high as 50+ms for a single write.
  • Avg. Disk Write Queue Length: this is the highlighted line in thick black. It also has a minimum around 12:51:00 and varies wildly the rest of the time.
  • Disk Write Bytes/sec: this is the dark green line at the top that shows the number of bytes being written to disk from all IO sources. Same story around 12:51:00.
  • Log Growths: A simple counter since the database was created/server started. It’s over 100 and off the chart.
  • Log Bytes Flushed/sec: this is the red, fairly constant line around 1/3 the way up and is log blocks being flushed to disk because of transaction commits or checkpoints.
  • Lazy writes/sec: this is the light green line at the bottom and is the buffer pool having to force data pages to be written to disk (along with all transaction log flushed up to the point of the last log record to change the page being written) to make space for images of newly created pages.

This time-slice is really cool in that it shows what happens when contention goes away. Just before 12:51:00, a checkpoint ends and the lazy writer has nothing to do – so the only I/Os hitting the disks are those coming from the transaction log flushing out log blocks as transactions commit. You can see the Avg. Disk Write Queue Length drop down to 2-3, the Avg. Disk sec/Write drop to about 5ms, and most beautiful of all (look, I’m a big geek ok? :-), the Disk Write Bytes/sec (the green line) drops down to be exactly equal to the Log Bytes Flushed/sec – proving that it’s just log flushes hitting the disk. This is the no-contention case. It happens again for brief spell about 10 seconds later – the lazy writer most likely created a temporary surfeit of empty buffers. All the rest of the time, the lazy writer and checkpoints play havoc with the write throughput on the drives by causing contention.

It’s clearly time to try some separation of files to relieve the contention – and that’s what I’ll cover in the next post in the series.

Hope you’re enjoying the series – these take a long time to write up!

DBCC CHECKFILEGROUP bug fixed in 2008 SP1 CU6

Back in September I blogged about an old 2005 bug that prevented DBCC CHECKFILEGROUP checking the partitions of an object on the specified filegroup unless *all* partitions of the object are on the specified filegroup (not a smart way to set things up!). The bug was fixed ages ago in 2005 but has only just been fixed in 2008. The post which explains the bug in more detail is DBCC CHECKFILEGROUP bug on SQL Server 2008. This is an important feature to be able to split the consistency checks of a partitioned VLDB over a series of days – see CHECKDB From Every Angle: Consistency Checking Options for a VLDB for more details.

SQL Server 2008 SP1 Cumulative Update 6 (which you can get here) has the bug fix for 2008 finally. The KB article which describes the bug is 975991.

You’re all running regular consistency checks, right?

Benchmarking: 1-TB table population (part 1: the baseline)

(For the hardware setup I’m using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn’t matter what the data in the database is, as backup doesn’t care – as long as there’s a terabyte of it. Why a terabyte? It’s a nice round number, it’s a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that’s what I’m doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what’s going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived – I’m going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I’m starting with a brain-dead database layout, configuration, and method of populating the table, and then I’m going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads – the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it’s *essential* to have a baseline with which to compare – otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I’m deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE [BackupPerfTest]
ON
    (NAME = N'BackupPerfTest_Data',
    FILENAME = N'K:\BackupPerfTest.mdf',
    SIZE = 1TB,
    FILEGROWTH = 1GB)
LOG ON
    (NAME = N'BackupPerfTest_Log',
    FILENAME = N'K:\BackupPerfTest.ldf',
    SIZE = 256MB,
    FILEGROWTH = 50MB);
GO

ALTER DATABASE [BackupPerfTest] SET RECOVERY SIMPLE;
GO

USE [BackupPerfTest];
GO

CREATE TABLE [MyBigTable] ([c1] BIGINT IDENTITY, [c2] CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX [MyBigTable_cl] ON [MyBigTable] ([c1]);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134,217,728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134,217,728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don’t do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I’m deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
INSERT INTO [MyBigTable] DEFAULT VALUES;
SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO [msdb].[dbo].[Results] VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO
[/sourcecode

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:
1
sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So – a simple setup.

Here’s a graph of the results:

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster run time, but the improvements from doubling the number of threads clearly aren’t directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let’s look at the page allocations per second for each experiment too:

As I’d expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There’s a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I’ve already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here’s a perfmon graph for the 64-way test once it reached steady-state and the log wasn’t growing:

Analysis:

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there’s a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I’m speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I’ve got them on a fast RAID-10 array. This debunks the theory I’ve often heard that contention doesn’t matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem – it just depends what you’re trying to do with it. Imagine if I had nonclustered indexes on this table too – more logging and more pages being allocated – much worse performance…

So now I’ve got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I’m not saying that all of these things are going to help, but over the next few weeks I’m going to try them all and report on the results. You’ll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people’s words for it.

I hope you’re going to find these benchmarks and experiments useful – I’ll be learning (hopefully) as I go along too.

Let me know if there’s anything else you’d like to see me try, and if you’re following along (I don’t want to spend all this time if no-one’s reading the series!)

Thanks!