(For the hardware setup I'm using, see this post.)

As part of my new benchmarking series I first wanted to play around with different configurations of data files and backup files for a 1-TB database to see what kind of performance gains I can get invoking the parallelism possible when backing up and restoring the database. To do that I need a way to quickly populate a 1-TB database so that I can mess around with different numbers of files and so on. It doesn't matter what the data in the database is, as backup doesn't care – as long as there's a terabyte of it. Why a terabyte? It's a nice round number, it's a common enough database size right now, and I have the storage to play around with it.

But then my plans got derailed. In figuring out how to most quickly populate a 1-TB database, I realized that in itself would be an interesting benchmark to explore, so that's what I'm doing first.

My aim is to give you improvements you can use in real life. If you think this is going to be boring, skip down to the end of the post where I show a detailed perfmon and explain what's going on in my overloaded IO subsystem, then come back up here :-)

The baseline for this benchmark is contrived – I'm going to populate a 1-TB clustered index (so I can do other experiments with the index) as quickly as I can. The interesting part is that I'm starting with a brain-dead database layout, configuration, and method of populating the table, and then I'm going to alter different things to see the effect on the system. The effects and gains will be the interesting part as it will expose parts of how SQL Server works which *WILL* be applicable to real-life situations and workloads – the whole point of me doing all of this is to show you improvements, why they work, and how they could be useful to you.

When doing any kind of performance testing it's *essential* to have a baseline with which to compare - otherwise how do you know what effect a variation is having? This post describes my baseline setup, measurements, and limitations I start to notice.

First let me describe the baseline setup:

  • Single data file and log file stored on the same volume, on an 8-drive RAID-10 array (each drive is a 300GB 15k SCIS drive), connected through iSCSI to the server
  • Data file is created to be 1-TB, with instant file initialization enabled
  • Log file is created to be 256MB
  • Database is using the SIMPLE recovery model

Yes, I'm deliberately putting the data and log on the same RAID array. I *want* to see some contention so I can prove to you how separation of data and log can reduce contention and improve performance sometimes.

Database and table creation script:

CREATE DATABASE BackupPerfTest ON
(NAME = 'BackupPerfTest_Data',
   FILENAME = 'K:\BackupPerfTest.mdf',
   SIZE = 1TB,
   FILEGROWTH = 1GB)
LOG ON
(NAME = 'BackupPerfTest_Log',
   FILENAME = 'K:\BackupPerfTest.ldf',
   SIZE = 256MB,
   FILEGROWTH = 50MB);
GO

ALTER DATABASE BackupPerfTest SET RECOVERY SIMPLE;
GO

USE BackupPerfTest;
GO

CREATE TABLE MyBigTable (c1 BIGINT IDENTITY, c2 CHAR (4100) DEFAULT 'a');
GO
CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

I figured the fastest way to fill the database is to have a single table with one row per page, and that having SQL Server create the large CHAR column inside itself, rather than me doing a REPLICATE, would be quickest.

For the actual population of the table, I worked out that 134217728 table rows gives me a terabyte, with each row taking up a single 8KB page.

The baseline setup of the table population is:

  • Varying numbers of concurrent connections (16, 32, 64, 128, 256) to the server (8-way DELL PowerEdge 1950)
  • Each connection runs a simple script that inserts 134217728 / number-of-connections rows into the table, and then logs the elapsed time into a results table
  • Each insert is done as a single-statement implicit transaction (if I don't do an explicit BEGIN TRAN/COMMIT TRAN, SQL Server does it for me)
  • A monitor connection pings the results table every 5 minutes checking to see if number-of-connections results are there, and sending me email if so

Yes, I'm deliberately using this method to insert each row. Again, I want to be able to make improvements and see the effect of the changes.

Each connection will be running this script:

SET NOCOUNT ON;
GO

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < $(rows))
BEGIN
   INSERT INTO MyBigTable DEFAULT VALUES;
   SELECT @counter = @counter + 1;
END;

SELECT @end = GETDATE ();

INSERT INTO msdb.dbo.Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));
GO

This is run through SQLCMD, and the number of rows to insert is passed into the T-SQL script using:

sqlcmd -S(local)\SQLDev01 -dBackupPerfTest -i"C:\SQLskills\Populate1TBTest\PopulateTable.sql" -v rows=%1

%1 in the line above is passed from a master CMD that kicks off number-of-connections CMD windows, each of which just runs the SQLCMD line above.

So – a simple setup.

Here's a graph of the results:

 elapsedtimeimprovements Benchmarking: 1 TB table population (part 1: the baseline)

The elapsed times for all connections to complete their work (as there could be up to an hour between the first and last to complete) were:

  • 16-way: 15 hours 25 minutes 5 seconds
  • 32-way: 13 hours 50 minutes 18 seconds (11% faster)
  • 64-way: 10 hours 12 minutes 48 seconds (27% faster)
  • 128-way: 8 hours 8 minutes 27 seconds (20% faster)
  • 256-way: 7 hours 24 minutes 21 seconds (9% faster)

More connections clearly leads to a faster runtime, but the improvements from doubling the number of threads clearly aren't directly proportional to the number of threads. The biggest improvement was from 32 to 64 threads, and then the percentage gain started to tail off. Let's look at the page allocations per second for each experiment too:

 pageallocationsimprovements Benchmarking: 1 TB table population (part 1: the baseline)

As I'd expect, the pace at which pages are being allocated in the database increases with more threads and the percentage improvements line pretty much matches that of the elapsed time graph above. There's a slight difference in the 128 and 256 gains here because the graph is show what the perf counter number was after the experiment reached a steady state. I noticed that the log grew enormously for the last few tests, which caused the steady-state number to not be reached for a while. I've already blogged about that phenomenon in Interesting case of watching log file growth during a perf test.

I also watched what was happening in perfmon to see if there were any obvious performance problems going on. Here's a perfmon graph for the 64-way test once it reached steady-state and the log wasn't growing:

 64wayperfmon Benchmarking: 1 TB table population (part 1: the baseline)

Analysis: 

  • The black line is above 0 when a checkpoint is occuring.
  • The green line represents how much data is being written to the K: volume, where the data and log file are. You can see there's a constant amount of data being written all the time (transaction log records when the implicit transactions commit) with large spikes whenever a checkpoint occurs and causes the actual data pages to be written out.
  • The light blue line is the pages allocated per second. You can see that it takes a major nose dive whenever a checkpoint occurs. I'm speculating that this is because of disk contention preventing the transaction log being written to disk (thus slowing down the transaction throughput) while the checkpoint is occuring and writing out data pages
  • The dark blue line at the bottom is the average disk seconds per write. You can see that it hovers around 4-5 milliseconds and spikes to 16-17 when a checkpoint occurs.
  • The brown line in the middle is the average disk write queue length. It hovers around 18-19 and spikes to around 25 when a checkpoint occurs.

Observations:

  • This system is clearly bottlenecked in the I/O subsystem
  • There is a direct correlation between checkpoints occuring and: increased disk queue length, reduced transaction throughput

This is what I would expect to see on a system with lots of updates where the log and data are on the same volume. Remember that I've got them on a fast RAID-10 array. This debunks the theory I've often heard that contention doesn't matter on RAID arrays. Yes, it does. You can still max out the throughput capabalities of any IO subsystem – it just depends what you're trying to do with it. Imagine if I had nonclustered indexes on this table too – more logging and more pages being allocated – much worse performance…

So now I've got my baseline and there are some obvious things I can try to improve things:

  • Data and log file separation
  • Varying number of data files
  • Varying placement of data files (e.g. different RAID arrays)
  • Manual checkpoints
  • Pre-sizing the log
  • Using explicit transactions with varying batch insert sizes
  • Using -E to get larger allocations

I'm not saying that all of these things are going to help, but over the next few weeks I'm going to try them all and report on the results. You'll be able to clearly see the effect of changing these on my benchmark, running on production-quality hardware, rather than just taking people's words for it.

I hope you're going to find these benchmarks and experiments useful – I'll be learning (hopefully) as I go along too.

Let me know if there's anything else you'd like to see me try, and if you're following along (I don't want to spend all this time if no-one's reading the series!)

Thanks!