In my previous post in the series, I described the benchmark I'm optimizing – populating a 1-TB clustered index as fast as possible using default values. I proved to you that I had an IO bottleneck because the IOs to the data file (from checkpoints) and the transaction log file (from transactions committing) were causing contention.
Several people commented that I might have mis-configured the iSCSI IO subsystem – so first off I want to look at that. Fellow MVP Denny Cherry (twitter|blog), who knows a lot more than me about IO subsystems, volunteered to discuss my iSCSI setup with me to make sure I hadn't goofed anywhere (many thanks Denny!). It seems like I haven't. I'm using a single iSCSI array right now, with a single NIC on the server dedicated to the iSCSI traffic – once I move to multiple volumes, I'll add in more NICs.
Looking at Task Manager in the middle of a 6-hour test run to see the network utilization through the NIC shows that it's not saturated, as shown below.
I ran the DELL smcli utility for two hours during the most recent test to see what peak throughput I'm getting, using the following command:
smcli -n Middle_MD3000 -c "set session performanceMonitorInterval=5 performanceMonitorIterations=1440;save storageArray performanceStats file=\"c:\\MiddlePerfStats.csv\";"
I saw around 101MBytes/sec. and earlier when testing the smcli settings I'd seen 106MBytes/sec. I'm sure once I remove some of the contention that this will get better.
On to the test!
The first thing I wanted to try was optimizing my use of the transaction log – i.e. doing as few and as large IOs as possible to the log. My hypothesis is that by changing the batch size from one to, say, 10 or 100 SQL Server can make more efficient use of the log.
I changed my T-SQL script to take a batch size parameter and use explicit transactions inserting <batch-size> records. Everything else remained the same as the baseline. I picked 128 concurrent threads as my test point. In the baseline, the 128-thread insert test took 8 hours 8 minutes 27 seconds to complete (29307 seconds). My simple T-SQL code changed to (with $(rows) and $(batch) being SQLCMD parameters to the script):
DECLARE @counter BIGINT;
DECLARE @inner SMALLINT;
DECLARE @start DATETIME;
DECLARE @end DATETIME;
SELECT @counter = 0;
SELECT @start = GETDATE ();
WHILE (@counter < $(rows))
SELECT @inner = 0;
WHILE (@inner < $(batch))
INSERT INTO MyBigTable DEFAULT VALUES;
SELECT @inner = @inner + 1;
SELECT @counter = @counter + $(batch);
SELECT @end = GETDATE ();
INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
Below are the results for 128 threads with batch sizes varying from 10 to 10000:
Increasing the batch size from 1 record to 10 records caused the elapsed time to drop from 29307 seconds to 21167 seconds – a 28% improvement! However, increasing the batch size further didn't improve things any more. Hmmm.
Next I tried different numbers of concurrent connections with a batch size of ten to see if the improvement was universal:
So moving from a single-record batch to a ten record batch has the same dramatic effect on performance for varying numbers of connections.
What's going on?
The answer involves some deep internals of the transaction log structure and IO behavior. The transaction log is split up internally into sections called virtual log files (VLFs for short). These VLFs are used to manage which parts of the log can be overwritten with new log records. If this is all unfamiliar, go read my TechNet Magazine article on Understanding Logging and Recovery in SQL Server which explains about the log in more detail and about transactions, otherwise what's coming next may not make much sense.
Whenever a transaction commits, all the transaction log records up to and including the LOP_COMMIT_XACT log record for that transaction (including log records from other transactions that are inter-mingled with those from the one that just committed) must be written out to disk, so the transaction is durable (called write-ahead logging). But it's not individual log records that are written to disk, the unit of IO for the transaction log is a log block. VLFs are split internally into dynamically sized log blocks, ranging from 512-bytes to 60KB in 512-byte increments, as needed. There are algorithms to figure out how big log blocks should be based on what log records are being flushed out.
With the single record batch, the amount of log generated in the transaction totals 4952 bytes. You can find this by doing something like:
CHECKPOINT; — to force the log in the SIMPLE recovery model to clear
INSERT INTO MyBigTable DEFAULT VALUES;
SELECT * FROM fn_dblog (NULL, NULL); — show me all active log
Then add up the values in the Log Record Length column for all records from the first LOP_BEGIN_XACT log record to the LOP_COMMIT_XACT with the matching Transaction ID.
Anyway, a single-record transaction generates 4952 bytes of transaction log, which will be flushed out in our case in a log block that's 5120 bytes (the nearest multiple of 512 above 4952), with a bunch of zero-padding at the end. For ten single-record transactions, that's 10 small IOs.
The ten-record transaction generates 48688 bytes of transaction log and will be flushed out in a log block that's 49152 bytes (the nearest multiple of 512 above 48688). This is clearly more efficient than 10 smaller IOs and is why changing to batches makes things go faster.
A 100-record transaction generates 489628 bytes of transaction log, which is clearly more than the 60KB log-block maximum, so it actually covers multiple log blocks (varying from 14-16 in my tests – I don't remember the exact algorithms). You can see the log block changing when the middle number in the Current LSN column of the fn_dblog output increases. It looks like it's jumping up, and it is – this is the offset of the log block within the current VLF divided by 512.
Because the 100-record transaction is split into multiple log blocks, there's no real IO gain during log flushes over the 10-record transaction – which is illustrated by the results I saw.
Now, even with this speed increase from the increased batch size, the average-disk-write-queue-length is still anywhere from 20-40 when running the 128-connection test, so there's still a significant bottleneck there. In fact, the transaction log grew significantly still during these tests (up to 23GB in one case) – for an explanation of this phenomenon, see Interesting case of watching log file growth during a perf test. I picked up some wait stats queries from fellow-MVP Glenn Berry (twitter|blog) to run while the tests were running. I'm using his as they're published and anyone can download and use them (see his blog post for the queries, and Bob Ward's CSS blog post that's starting to document the wait types).
His queries tell me that with the 10-record per batch and 128-connections:
- Average write-stall to the data file is 37ms
- Average write-stall to the log file is 18ms
- Top wait types are PAGELATCH_EX (55% of all waits), PAGELATCH_SH (28% of all waits), and WRITELOG (14% of all waits)
The first two waits are the Storage Engine waiting for the buffer pool to kick out dirty pages so that buffers can be used to hold newly created pages, and the third is the log manager waiting for log block flushes to disk to complete.
Clearly IO IO IO. Let's look at a a perfmon capture during the 10-records per batch test with 128 connections:
Let's go through each counter (top to bottom in the list in the image) and explain what's going on. I deliberately picked this time-slice, as it really simply shows the effect of contention:
- Pages Allocated/sec: this is the light blue line and is the Access Methods part of the Storage Engine (the dev team I used to run) creating new data and index pages for the clustered index we're populating.
- Checkpoint pages/sec: this is the pink line at the bottom left and bottom right of the capture. This is the buffer pool writing out dirty pages during a periodic checkpoint.
- Avg. Disk sec/Write: this is the dark blue line that's tracking just above the thick black line. It's the amount of time in ms for a write to complete. You can see that it has a minimum around 12:51:00 and then varies wildly, hitting as high as 50+ms for a single write.
- Avg. Disk Write Queue Length: this is the highlighted line in thick black. It also has a minimum around 12:51:00 and varies wildly the rest of the time.
- Disk Write Bytes/sec: this is the dark green line at the top that shows the number of bytes being written to disk from all IO sources. Same story around 12:51:00.
- Log Growths: A simple counter since the database was created/server started. It's over 100 and off the chart.
- Log Bytes Flushed/sec: this is the red, fairly constant line around 1/3 the way up and is log blocks being flushed to disk because of transaction commits or checkpoints.
- Lazy writes/sec: this is the light green line at the bottom and is the buffer pool having to force data pages to be written to disk (along with all transaction log flushed up to the point of the last log record to change the page being written) to make space for images of newly created pages.
This time-slice is really cool in that it shows what happens when contention goes away. Just before 12:51:00, a checkpoint ends and the lazywriter has nothing to do – so the only IOs hitting the disks are those coming from the transaction log flushing out log blocks as transactions commit. You can see the Avg. Disk Write Queue Length drop down to 2-3, the Avg. Disk sec/Write drop to about 5ms, and most beautiful of all (look, I'm a big geek ok? :-), the Disk Write Bytes/sec (the green line) drops down to be exactly equal to the Log Bytes Flushed/sec – proving that it's just log flushes hitting the disk. This is the no-contention case. It happens again for brief spell about 10 seconds later – the lazywriter most likely created a temporary surfeit of empty buffers. All the rest of the time, the lazywriter and checkpoints play havoc with the write throughput on the drives by causing contention.
It's clearly time to try some separation of files to relieve the contention – and that's what I'll cover in the next post in the series.
Hope you're enjoying the series – these take a long time to write up!