The July edition of TechNet Magazine is available on the web now and has the latest installment of my regular SQL Q&A column.

This month's topics are:

  • Deferred log truncation from concurrent data and log backups
  • Database mirroring monitoring
  • Multiple transaction log files
  • Best use of SSDs in a SQL environment (high-level)

Check it out at http://technet.microsoft.com/en-us/magazine/hh334997.aspx.

It's been a long time since the last blog post on SSD benchmarking - I've been busy! I'm starting up my benchmarking activities again and hope to post more frequently. You can see the whole progression of benchmarking posts here.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. My test systems now have 16GB each and all tests were performed with the buffer pool ramped up, so memory allocation didn't figure into the performance numbers.

In this recent set of tests I wanted to explore three questions:

  1. What kind of performance gain do I get upgrading from Fusion-io's v1.2 driver to the v2.2 driver?
  2. What is the sweet spot for the number of files on an SSD?
  3. Does a 4Kb block size give any gains in performance for my test?

To keep it simple I'm using one half of the 640GB drive (it's two 320GB drives under the covers). To do this my test harness does the following:

  • Formats the SSDs in one of three ways:
    • Fusion-io basic format (the 320GB drive has 300GB capacity)
    • Fusion-io improved write performance format (the 320GB drive has only 210GB capacity, 70% of normal)
    • Fusion-io maximum write performance format (the 320GB drive has only 151GB capacity, 50% of normal)
  • The SSD format is performed using Fusion-io's ioManager tool
  • Creates 1, 2, 4, 8, 16, 32, 64 or 128 data files, with the file sizes calculated to fill the SSDs
  • My table structure is:

CREATE TABLE MyBigTable (
    c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID (),
    c2 DATETIME DEFAULT GETDATE (),
    c3 CHAR (111) DEFAULT 'a',
    c4 INT DEFAULT 1,
    c5 INT DEFAULT 2,
    c6 BIGINT DEFAULT 42); 
GO

CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

  • I have 64 connections each inserting 2.5 million records into the table (with the loop code running server-side) for a total of 160 million records inserted, in batches of 1000 records per transaction. This works out to be about 37 GB of allocated space from the database.

The clustered index on a random GUID is the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) - for my purposes it's perfect.

I tested each of the eight data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, RAID was not involved):

  • Log and data on a single 320GB SSD with the old v1.2.7 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver and a 4Kb block size (each of the 3 ways of formatting)
That's a total of 9 configurations, with 8 data file layouts in each configuration - making 72 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 360 tests, for a cumulative test time of just over 1.43 million seconds (16.5 days) during April.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. This uses the script from this blog post.

On to the results... the wait stats are *really* interesting!

Note: the y-axes in the graphs below do not start at zero. All the graphs have the same axes so there is nothing to misunderstand. They are not misleading - and I make no apologies for my choice of axes - I want to show the difference between the various formats more clearly.

v1.2.7 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the old driver was 3580 seconds for test completion, with 4 data files and the Improved Write format - a tiny amount less than the time for 8 data files.

v2.2.3 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the new driver was 2993 seconds for test completion, with 8 data files and the Max Write format - a tiny amount less than the time for other formats for 8 data files, and very close to the times for 4 data files.

On average across all the tests the new v2.2 Fusion-io driver gives a 20.5% performance boost over the old v1.2 driver, and for the regular format, the new v2.2 Fusion-io driver gives a 24% performance boost over the old v1.2 driver. It also (but I didn't measure) reduces the amount of system memory required to use the SSDs. Good stuff!

v2.2.3 Fusion-io driver and 4-Kb sector size

 

The performance using a 4-Kb sector size is roughly the same for my test compared to traditional 512-byte sector size. The most performance gain I got was 3% over using a 512-byte sector size, but on average across all tests the performance was very slightly (0.5%) lower.

Wait Stats

The wait stats were very interesting.

The wait stats are presented in the following format:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX  154611.39  128056.51   26554.88  45295507   71.83       0.0034     0.0028    0.0006
PAGELATCH_UP    37948.31   36988.52    959.79    2314370    17.63       0.0164     0.016     0.0004
PAGELATCH_SH    16976      13823.71    3152.3    3751811    7.89        0.0045     0.0037    0.0008

The columns are:

  • WaitType - kind of obvious
  • Wait_S - cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again
  • Resource_S - cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)
  • Signal_S - cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again - called the signal wait time)
  • WaitCount - number of waits of this type during the test
  • Percentage - percentage of all waits during the test that had this type
  • AvgWait_S - average cumulative wait time in seconds
  • AvgRes_S - average resource wait time in seconds
  • AvgSig_S - average signal wait time in seconds

For a single file, the wait stats for all the various formatting options look like:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX  226362.54  193378.56   32983.98  45742117   78.66       0.0049     0.0042    0.0007
PAGELATCH_UP    36701.66   35760.67    940.99    2144533    12.75       0.0171     0.0167    0.0004
PAGELATCH_SH    16775.05   13644.99    3130.06   3542549    5.83        0.0047     0.0039    0.0009

With more files, the percentage of PAGEIOLATCH_EX waits increases, and by the time we get to 8 files, SOS_SCHEDULER_YIELD has started to appear. At 8 files, the wait stats for all the various formatting options look like:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX       244703.77  210859.63   33844.14  45169863   89.47       0.0054     0.0047    0.0007
SOS_SCHEDULER_YIELD  12500.15   0.99        12499.16  823658     4.57        0.0152     0         0.0152
PAGELATCH_SH         5777.54    3749.18     2028.36   476618     2.11        0.0121     0.0079    0.0043

By 16 files, the PAGELATCH waits have disappeared from the top 95%. As the number of files increases to 128, the PAGEIOLATCH_EX waits increase to just over 91% and the wait stats look like this for regular format:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX       304106.32  273671      30435.32  43478489   91.25       0.007      0.0063    0.0007
SOS_SCHEDULER_YIELD  16733      1.35        16731.65  889147     5.02        0.0188     0         0.0188

What does this mean? It's obvious from the wait stats that as I increase the number of data files on the drive, the average resource wait time for each PAGEIOLATCH_EX wait increases from 4.2ms for 1 file up to 6.3ms for 128 files - 50% worse, with the signal wait time static at 0.7ms.

But look at the wait stats for 128 files using the Maximum Write format:

WaitType      Wait_S Resource_S  Signal_s  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX      196322.97 166602.88   29720.09  45435521  88.66      0.0043 0.0037   0.0007
SOS_SCHEDULER_YIELD  16182.48 1.1     16181.38  828595  7.31      0.0195 0   0.0195

The average resource wait time for the PAGEIOLATCH_EX waits has dropped from 6.3ms to 3.7ms! But isn't PAGEIOLATCH_EX a wait type that's for a page *read*? Well, yes, but what I think is happening is that the buffer pool is having to force pages out to disk to make space for the pages being read in to be inserted into (which I believe is included in the PAGEIOLATCH_EX wait time) - and when the SSD is formatted with the improved write algorithm, this is faster and so the PAGEIOLATCH_EX resource wait time decreases.

But why the gradual decrease in PAGELATCH waits and increase in SOS_SCHEDULER_YIELD waits as the number of files increases?

I went back and ran a single file test and used the sys.dm_os_waiting_tasks DMV (see this blog post) to see what the various threads are waiting for. Here's some example output:

session_id  wait_duration_ms  wait_type     resource_description
----------  ----------------  ------------  --------------------
79          11                PAGELATCH_UP  5:1:1520544
80          14                PAGELATCH_UP  5:1:1520544
93          16                PAGELATCH_UP  5:1:1520544
94          0                 PAGELATCH_UP  5:1:1520544
101         15                PAGELATCH_UP  5:1:1520544
111         25                PAGELATCH_UP  5:1:1520544
110         17                PAGELATCH_UP  5:1:1520544
75          6                 PAGELATCH_UP  5:1:1520544
78          17                PAGELATCH_UP  5:1:1520544
88          8                 PAGELATCH_UP  5:1:1520544
107         12                PAGELATCH_UP  5:1:1520544
109         25                PAGELATCH_UP  5:1:1520544
113         20                PAGELATCH_UP  5:1:1520544
.
.

Dividing 1520544 by 8088 gives exactly 188. Running the same query a few seconds later gives most of the threads waiting on resource 5:1:1544808, another exact multiple of 8088. These resources are PFS pages! What we're seeing is PFS page contention, just like you can get in tempdb with lots of concurrent threads creating and dropping temp tables. In this case, I have 64 concurrent threads doing inserts that are causing page splits, which requires page allocations. As the number of files increases, the amount of PFS page contention decreases. It disappears after 8 files because I've only got 8 cores, so there can only be 8 threads running at once (one per SQLOS scheduler, with the others SUSPENDED on the waiter list or waiting in the RUNNABLE queue).

From 16 to 128 files, the wait stats hardly change and the performance (in the Improved Write and Max Write formats) only slightly degrades (5%) with each doubling of the number of files. Without deeper investigation, I'm putting this down to increased amounts of metadata to deal with - maybe with more to do when searching the allocation cache for the allocation unit of the clustered index. If I have time I'll dig in and investigate exactly why.

The SOS_SCHEDULER_YIELD waits are just because the threads are able to do more before having to wait, and so they're hitting voluntary yield points in the code - the workload is becoming more CPU bound.

Summary

I've clearly shown that the new Fusion-io driver gives a nice boost compared to the older one - very cool.

I've also shown that the number of files on the SSD does have an affect on performance too - with the sweet spot appearing to be the number of processor cores. I'd love to see someone do similar tests on a 16-way, 32-way or higher (or lend me one to play with :-)

[Edit: I discussed the results with my friend Thomas Kejser on the SQLCAT team and he sees the same behavior on a 64-way running the same benchmark (in fact we screen-shared on a 64-way system with 2TB and 4 640GB Fusion-io cards this weekend). He posted some more investigations on his blog - see here.]

And finally, I showed that for my workload, using a 4-Kb sector size did not improve performance.

I'd call that a successful test - *really* interesting!

Last time I posted about SSDs I presented the findings from sequential inserts with a variety of configurations and basically concluded that SSDs do not provide a substantial gain over SCSI storage (that is not overloaded) - see this blog post for more details.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. (For the next set of benchmarks I've just upgraded to 16GB of memory and added the second 640GB Fusion-io Duo, for a total of 1.2TB... watch this space!).

In this set of tests I wanted to see how the SSDs behaved for random reads and writes. To do this my test harness does the following:

  • Formats the SSDs in one of three ways:
    • Fusion-io basic format (each 320GB drive has 300GB capacity)
    • Fusion-io improved write performance format (each 320GB drive has only 210GB capacity, 70% of normal)
    • Fusion-io maximum write performance format (each 320GB drive has only 151GB capacity, 50% of normal)
  • The SSD format is performed using Fusion-io's ioManager tool, with their latest publicly-released driver (1.2.7.1).
  • Creates 1, 2, 4, 8, or 16 data files, with the file sizes calculated to fill the SSDs
  • My table structure is:

CREATE TABLE MyBigTable (
    c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID (),
    c2 DATETIME DEFAULT GETDATE (),
    c3 CHAR (111) DEFAULT 'a',
    c4 INT DEFAULT 1,
    c5 INT DEFAULT 2,
    c6 BIGINT DEFAULT 42); 
GO

CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

  • I have 16 connections each inserting 2 million records into the table (with the loop code running server-side)

Now before anyone complains, yes, this is a clustered index on a random GUID. It's the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) - for my purposes it's perfect.

I tested each of the five data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, 128k RAID stripe size - where applicable):

  • Data round-robin between two RAID-10 SCSI (each with 4 x 300GB 15k and one server NIC), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Data on two 320GB SSDs in RAID-0 (each of the 3 ways of formatting), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Log and data on two 320GB SSDs in RAID-0 (each of the 3 ways of formatting)
  • Log and data on two 320GB SSDs in RAID-1 (each of the 3 ways of formatting)  
  • Log and data on single 320GB SSD (each of the 3 ways of formatting)
  • Log and data on separate 320GB SSDs (each of the 3 ways of formatting)
  • Log and data round-robin between two 320GB SSDs (each of the 3 ways of formatting)

That's a total of 19 configurations, with 5 data file layouts in each configuration - making 95 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 475 tests, for a cumulative test time of just less than 250 thousand seconds (2.9 days) at the end of July.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. The wait stats will be presented in the following format:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       26833.45       26822.85          10.60      867558      75.88
WRITELOG                              7097.77        6647.26         450.51     3221475       20.0

The columns are:

  • WaitType - kind of obvious
  • Wait_S - cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again
  • Resource_S - cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)
  • Signal_S - cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again - called the signal wait time)
  • WaitCount - number of waits of this type during the test
  • Percentage - percentage of all waits during the test that had this type

On to the results...

Data on SCSI RAID-10, log on SATA RAID-10

 

Once again this shows what I've shown a few times before - on SCSI having multiple data files on the two RAID arrays gives a performance boost. The two-file case is going from a single RAID array to two RAID arrays - bound to get a performance gain - and it gets a 35% performance boost - 6 times the boost I got from messing around with multiple files for the sequential inserts case last time (see here and here for details).

The best performance I could get from having data on the SCSI arrays was 1595 seconds.

Representative wait stats for a run of this test - one file:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       28993.08       28984.66           8.42      647973      75.53
WRITELOG                              7333.36        6883.82         449.54     3223809      19.10
SLEEP_BPOOL_FLUSH                     1786.18        1781.94           4.24     1147596       4.65

Representative wait stats for a run of this test - two files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       15306.22       15296.67           9.55      679281      63.87
WRITELOG                              7762.25        7270.79         491.47     3215377      32.39

Representative wait stats for a run of this test - four files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       26833.45       26822.85          10.60      867558      75.88
WRITELOG                              7097.77        6647.26         450.51     3221475      20.07

Representative wait stats for a run of this test - eight files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       27556.79       27547.83           8.96      674319      75.09
WRITELOG                              7545.40        7118.93         426.47     3221841      20.56

Representative wait stats for a run of this test - sixteen files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
PAGEIOLATCH_EX                       37716.72       37705.87          10.85      792189      80.13
WRITELOG                              7150.01        6699.36         450.64     3228609      15.19

These numbers are showing the majority of waits are for data pages to be read into the buffer pool - random reads, and the next most prevalent wait is for log block flushes to complete. The more PAGEIOLATCH_EX waits there are, the worse the performance is.

Data on 640GB RAID-0 SSDs, log on SATA RAID-10

 

Don't let this graph fool you - the top and bottom of the scale are only 30 seconds apart. Basically moving the data files from the SCSI arrays to the RAID-0 SSD got around a 3-5x performance gain, no matter how the SSDs are formatted.

Representative wait stats for a run of this test - one file:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              8459.65        7789.91         669.73     3207448      94.48
PAGEIOLATCH_EX                         440.27         392.51          47.77      828420       4.92

Representative wait stats for a run of this test - two, four, eight, or sixteen files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              7957.35        7356.01         601.34     3206855      95.75

The log is the obvious bottleneck in this configuration. 

Data and log on 640GB RAID-0 SSDs

 

And again - high and low values are only 25 seconds apart. Moving log off to the same SSD gave a further 45%-ish improvement across the board, with little difference according to how the SSDs were formatted.

Representative wait stats for a run of this test - any number of files:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              2955.69        2184.99         770.69     3203957      89.24
PAGEIOLATCH_EX                         330.11         288.89          41.23      653147       9.97

The percentages fluctuate up and down a few percent depending on write format and number of files, with the maximum write performance format tending to have a slightly higher percentage of WRITELOG waits than the other two formats.

Note that moving the log to the SSD as well as the data files drastically cuts down the number of WRITELOG waits - what we'd expect.

Data and log on single 320GB SSD

 

The performance numbers for having everything on a single 320GB SSD are only a tiny amount slower than those for two 320GB SSDs - which is what I'd expect.

Representative wait stats for a run of this test - one file with basic format or improved write performance format:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              2911.22        2121.05         790.17     3204459      81.44
PAGEIOLATCH_EX                         602.11         546.56          55.55      758271      16.84

And for one file with maximum write performance format:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              3363.11        2523.63         839.48     3204110      87.54
PAGEIOLATCH_EX                         428.68         406.77          21.92      412081      11.16


You can see that the higher amount of PAGEIOLATCH_EX waits leads to lower overall performance. This makes sense to me.

Data and log on two 320GB RAID-1 SSDs

 

Now, I have an issue with people using SSDs in RAID-0 because it's a single point of failure. In an environment that's going all out on high-availability, if I was using SSDs for performance, depending on the criticality of the data I'd want to at least double-up to RAID-1. For all the various configurations, moving from a single 320GB SSD to two of them in RAID-1 resulted in no more than a 10-15% drop in performance and it's still 3-5x faster than the SCSI setup.

Here's a representative set of wait stats for the entire set of tests:

WaitType                       Wait_S         Resource_S     Signal_S       WaitCount   Percentage
------------------------------ -------------- -------------- -------------- ----------- ----------
WRITELOG                              3949.44        3031.14         918.30     3204694      85.68
PAGEIOLATCH_EX                         608.62         555.98          52.65      692934      13.20

In general the RAID-1 configuration had more waits of both types than the single drive configuration.

Data and log on separate 320GB SSDs

 

Splitting the data and log make for a 5-20% improvement over having everything on a single 320GB SSD.

The wait stats for these configurations show the same trends that we've seen so far - slightly slower performance = slightly more PAGEIOLATCH_EX waits.

Data and log round-robin between separate 320GB SSDs

 

This confused me - the single file case is exactly the same configuration as the test case above, but the results (for each test being run 5 time and then averaged) were almost 10% faster for the first two formats. No significant differences for the other configurations.

The wait stats for these configurations show the same trends that we've seen so far - slightly slower performance = slightly more PAGEIOLATCH_EX waits.

Best-case performance for each number of data files

 

 

 

Well big surprise - the SSDs outperform the SCSI storage for all these tests. The improvement factor varied by the number of data files:

  • 1: SSD was 7.25x faster than SCSI
  • 2: SSD was 4.74x faster than SCSI
  • 4: SSD was 6.81x faster than SCSI
  • 8: SSD was 7.64x faster than SCSI
  • 16: SSD was 9.03x faster than SCSI

The configuration of 4 data files on one SSD and the log on the other SSD, with basic format for both, was the best overall performer, beating the best SCSI configuration (2 data files) by a factor of 4.96.

Summary

Reminder: this test was 32 million inserts with no reads or updates (i.e. no random IO). It is very important to consider the limited scenario being tested and to draw appropriate conclusions.

My conclusions are as follows:

  1. For a random read+write workload, the method of formatting the Fusion-io drives doesn't make much difference. I'd go for the basic format to get the higher capacity, but I'd always to a representative load test to make sure.
  2. For a random read+write workload, the SSDs give at least a 5x performance gain over iSCSI storage
  3. Once again, having multiple data files outperforms having a single data file in most configurations
  4. I can easily correlate IO-subsystem related wait stats to the varying performance of the various configurations

Compared to the sequential insert workload that I benchmarked in the previous set of tests, the random read+write workload makes it worth investigating the investment of moving to SSDs.

Just like last time, these results confirm what I'd heard anecdotally - random operations are the sweet-spot for SSDs.

Brent's playing with the server over the next 4 weeks so I won't be doing any more benchmarking until mid-September at least.

Hope these results are interesting to you!

Back at the start of July I kicked off a survey around your plans for SSDs (see here) and now I present the results to you. There's not much to editorialize here, but the numbers are interesting to see.

 

The "other" answers were (verbatim):

  • 3 x 'have bought and am trying them out'
  • 3 x 'not sure if we need them or not'
  • 2 x 'all production servers are hosted'
  • 1 x 'bought them, tried them..not good enough yet for tempdb'
  • 1 x 'Have some, want more, could you really every have enough?'
  • 1 x 'We get every penny from or spinning media, and have no need for SSD'

The results reflect what I've been hearing when teaching classes and talking to customers/conference attendees over the last six months. People are becoming more interested in SSDs but there's still a lot of wariness about them and of course the whole money issue of being able to buy them. I'm also not surprised (given the general readership demographics of this blog) by the number of people who've analyzed their IOPS requirements and concluded that they don't need SSDs to accomplish that.

 

The "other" answers were (verbatim):

  • 3 x 'not in the budget'
  • 1 x 'I plan to buy expensive drives and throw them at you, paul! love, conor'
  • 1 x 'I'm going to do the same thing Conor will do. Denny'
  • 1 x 'OLAP Scale Out'
  • 1 x 'Use them as cache'
  • 1 x 'Using in an EMC V-MAX SAN to dynamically move high workloads to SSD temporarily'

Ahem - thanks Conor and Denny :-)

Another unsurprising set of results that reflects what I've been hearing. One number I'd be interested in drilling deeper into is answer #3 - are people putting/planning to put tempdb on SSDs because that's what they've heard is the best thing to do, or because tempdb truly is the largest I/O bottleneck that can benefit the most from SSDs? That's a set of experiments I'd like to try out with my Fusion-io drives.

The final "other" answer is also interesting - I was talking to a couple of folks from EMC in Ireland about the V-MAX when we were there earlier this month. Very cool idea to migrate data up and down a set of devices with varying latencies (at the block level, not the file level) - I'd like to see more on how the technology copes with one-off operations like consistency checks or backups - do those IOs affect which layer a block resides in?

Anyway, hope you find these results interesting.

Thanks to all those who responded!

Over the last month we've been teaching in Europe and I haven't had much time to focus on benchmarking, but I've finally finished the first set of tests and analyzed the results.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me.

In this set of tests I wanted to check three things for a sequential write-only workload (i.e. no reads or updates)

  • Best file layout with the hardware I have available
  • Best way to format the SSDs
  • Whether SSDs give a significant performance gain over SCSI storage

The Fusion-io SSDs can be formatted four ways:

  • Regular Windows format
  • Fusion-io's format
  • Fusion-io's improved write performance format
  • Fusion-io's maximum write performance format

I'm using one of the 640GB SSDs in my server, which presents itself as two 320GB drives that I can use individually or tie together in a RAID array. The actual capacity varies depending on how the drives are formatted:

  • With the Windows and normal Fusion-io format, each of the 320GB drives has 300GB capacity
  • With the improved write performance format, each of the 320GB drives has only 210GB capacity, 70% of normal
  • With the maximum write performance format, each of the 320GB drives has only 151GB capacity, 50% of normal

In my tests, I want to determine whether the loss in capacity is worth it in terms of a performance gain. The SSD format is performed using Fusion-io's ioManager tool, with their latest publicly-released driver (1.2.7.1).

My tests involve 16 connections to the server, running server-side code to insert 6.25GB each into a table with a clustered index, one row per page. The database is 160GB with a variety of file layouts:

  • 1 x 160GB file
  • 2 x 80GB files
  • 4 x 40GB files
  • 8 x 20GB files
  • 16 x 10GB files

These drop down to 128/64/32/etc when using a single 320GB drive with the maximum write capacity format. The log file is pre-created at 8GB and does not need to grow during the test.

I tested each of the five data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, 128k RAID stripe size - where applicable):

  • Data on RAID-10 SCSI (8 x 300GB 15k), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Data round-robin between two RAID-10 SCSI (each with 4 x 300GB 15k and one server NIC), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Data on two 320GB SSDs in RAID-0 (each of the 4 ways of formatting), log on RAID-10 SATA (8 x 1TB 7.2k)
  • Log and data on two 320GB SSDs in RAID-0 (each of the 4 ways of formatting)
  • Log and data on single 320GB SSD (each of the 4 ways of formatting)
  • Log and data on separate 320GB SSDs (each of the 4 ways of formatting)
  • Log and data round-robin between two 320GB SSDs (each of the 4 ways of formatting)

That's a total of 22 configurations, with 5 data file layouts in each configuration - making 110 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 550 tests, for a cumulative test time of just less than 110 million seconds (12.7 days) over the last 4 weeks.

And yes, I do have a test harness that automates a lot of this so I only had to reconfigure things 22 times manually. And no, for these tests I didn't have wait stats being captured. I've upgraded the test harness and now it captures wait stats for each test - that'll come in my next post.

On to the results... bear in mind that these results are testing a 100GB sequential insert-only workload and are not using the full size of the disks involved!!!

Data on SCSI RAID-10, log on SATA RAID-10

 

I already blogged about these tests here last week. They prove that for this particular workload, multiple data files on the same RAID array does give a performance boost - albeit only 6%.

The best performance I could get from the SCSI/SATA configurations was completing the test in 1755 seconds.

Data and log on 640GB RAID-0 SSDs (Data on 640GB RAID-0 SSDs, log on SATA RAID-10)

 

The performance whether the log file was on SATA or on the SSD was almost identical, so I'm only including one graph, in the interests in making this post a little shorter.

These results clearly show that the SSDs have to be formatted correctly to get any performance out of them. The SSDs performed the same for all data file configurations until performance almost doubles when the number of data files hits 16. I tested 32 and 64 files and didn't get any further increase. My guess here is that I had enough files that when checkpoints or lazywrites occured, the behavior was as if I was doing a random-write workload rather than sequential-write workload.

The best performance I could get here was with 16 files and the maximum-write format when the test completed in 934 seconds, 1.88x faster than the best SCSI time. This is only 13 seconds slower than the normal format which gives 100% more capacity.

Data and log on single 320GB SSD

 

Here the performance truly sucked when the SSD wasn't formatted correctly. Once it was, the performance was roughly the same for 1, 2, or 4 files but degraded by almost 50% with normal formatting for 8 or 16 files. With improved-wait and maximum-write formatting, the performance was the same as for the 640GB RAID-0 SSD array, but the sharp performance increase with 16 files only happened with the maximum-write formatting.

Data and log on separate 320GB SSDs

 

No major difference here - same characteristics as before when formatted correctly, and the best performance coming from maximum-write formatting and 16 data files.

This configuration gave the best overall performance - 909 seconds - 1.93x the bext performance from the SCSI storage.

Data and log round-robin between separate 320GB SSDs

 

No major differences from the previous configuration.

Best-case performance for each number of data files

  

  

 

Clearly the SSDs outperform the SCSI storage for these tests, but not by very much. The improvement factor varied by the number of data files:

  • 1: SSD was 1.11x faster than SCSI
  • 2: SSD was 1.09x faster than SCSI
  • 4: SSD was 1.06x faster than SCSI
  • 8: SSD was 1.04x faster than SCSI
  • 16: SSD was 2.03x faster than SCSI

The configuration of 16 data files on one SSD and the log on the other SSD, with maximum-write format for both, was the best overall performer, beating the best SCSI configuration (8 data files) by a factor of 1.93.

Summary

Reminder: this test was 100GB of sequential inserts with no reads or updates (i.e. no random IO). It is very important to consider the limited scenario being tested and to draw appropriate conclusions

Several things are clear from these tests:

  1. The Fusion-io SSDs do not perform well unless they are formatted with Fusion-io's tool, which takes seconds and is very easy. I don't see this as a downside at all, and it makes sense to me.
  2. For sequential write-only IO workloads, the improved-write and maximum-write SSD formats do not produce a performance gain and so the loss in storage capacity (30% and 50% respectively) is not worth it.
  3. For sequential write-only IO workloads, the SSDs do not provide a substantial gain over SCSI storage (which is not overloaded).

All three of these results were things I'd heard anecdotally and experienced in ad-hoc tests, but now I have the empirical evidence to be able to state them publicly (and now so do you!).

These tests back-up the assertion I've heard over and over that sequential write-only IO workloads are not the best use-case for SSDs.

One very interesting other result came from these tests - moving to 16 data files changed the characteristics of the test to a more random write-only IO workload, and so the maximum-write format produced a massive performance boost - almost twice the performance of the SCSI storage!

The next set of tests is running right now - 64GB of inserts into a clustered index with a GUID key - random reads and writes in a big way. Early results show the SSDs are *hammering* the performance of the SCSI storage - more in a week or so!

Hope you find these results useful and thanks for reading!

It's been a while since I conducted blog surveys so I'm going to do a few through July and August.

First up - I'd like to know what your plans are for purchasing SSDs - do you already have them or are you planning to buy some in the next year?

Secondly - I'd like to know what you are (or will be) doing with them.

I'll report on the results in a couple of weeks. The surveys are completely anonymous. Spread the word so we get a good volume of responses! 

As always, a big Thanks! for contributing to the blog/community by responding. Please shoot me an email (Contact button, bottom left of the blog), or ping me on Twitter (@PaulRandal) if you have an idea for a good survey.

Cheers!

(PS Comments are deliberately disabled.)

Categories:
SSDs | Surveys

In the previous post in the series I introduced SSDs to the mix and examined the relative performance of storing a transaction log on an 8-drive 7.2k SATA array versus a 640-GB SSD configured as a 320-GB RAID-1 array. The transaction log was NOT the I/O bottleneck in the system and so my results showed only a limited 2.5%-3.5% performance gain from substituting the RAID-1 SSD.

Several people pointed out that many people use such an SSD in a RAID-0 configuration instead of RAID-1, so I promised to try the same set of tests using the SSD as a 640-GB RAID-0 array.

Now, the merits of using a single SSD in a RAID-0 configuration are mixed. If used just on its own, there's no redundancy but the performance should be better, so I'd make sure my client understood the risk involved and made sure that for critical data some form of redundancy is added (e.g. synchronous database mirroring).

I re-ran the 100-GB table population tests with the transaction log on the RAID-0 SSD to see what extra performance gain could be had. To be honest, I didn't expect much, as I know the transaction log isn't the limiting factor.

Here are the graphs for the tests. On the top is the SSD configured for RAID-1, on the bottom is the SSD configured for RAID-0. Each test was performed 5 times and the average time used for the graph.

 

 

You can see that in the RAID-0 configuration, there's a higher performance gain - peaking at just over 6%.

The average perf gain for the RAID-1 SSD over the SATA array across all tests was 2.04%.

The average perf gain for the RAID-0 SSD over the SATA array across all tests was 4.25%.

Clearly the RAID-0 SSD outperformed the RAID-1 SSD (and they both outperformed the SATA array in all tests), as we'd expect, but not by much, as again, the transaction log *isn't the bottleneck*. We should see much better gains by altering the data file configuration - so that's what I'll do next.

Thanks for following the series and for all the comments. Please bear in mind that I'm taking things slowly so I won't jump straight to the best configuration - I want to analyze the changes as we go along.

PS I tried increasing the number of SQL Server connections in the test from 16 to 32 as someone suggested in a comment - the performance got 10% worse! Decreasing the number of connections didn't help either.

Well it's been almost 6 weeks since my last benchmarking blog post as I got side-tracked with the Myth-a-Day series and doing real work for clients :-)

In the benchmarking series I've been doing, I've been tuning the hardware we have here at home so I can start running some backup and other tests. In the last post I'd finally figured out how to tune my networking gear so the performance to my iSCSI arrays didn't suck - see Benchmarking: 1-TB table population (part 5: network optimization again).

Now I've decided that instead of using a 1-TB data set, I'm going to use a 100-GB data set so I don't have to wait 6 hours for each test to finish, it also means that I can start to make use of the SSDs that the nice folks at Fusion-io sent me to test out (sorry guys - I did say it would be a while until I got around to it!). See New hardware to play with: Fusion-io SSDs. I've got two 640-GB cards but I can only install one right now as I need the other PCI-E slot for the NIC card so I can continue to make comparisons with the iSCSI gear without compromising network performance. Then I'll move on to some pure iSCSI vs pure SSD tests.

I've been starting to discuss SSDs a lot more in the classes that I teach and there's some debate over where to put an SSD if you can afford one. I've heard many people say that using them for transaction logs and/or tempdb is the best use of them. We all know what the real answer is though, don't we - that's right - IT DEPENDS!

When to consider SSDs?

An SSD effectively removes the variability in latency/seek-time that you experience with traditional disks, and does best for random read/write workloads compared to sequential ones (I'm not going to regurgitate all the specs on the various drives - Google is your friend). Now, for *any* overwhelmed I/O subsystem, putting an SSD in there that removes some of the wait time for each IO is probably going to speed things up and lead to a workload performance boost - but is that the *best* use of an SSD? Or, in fact, is that an appropriate use of hardware budget?

If you have an overwhelmed I/O subsystem, I'd say there are two cases you fall into:

  1. SQL Server is doing more I/Os than it has to because the query workload has not been tuned
    1. You can tune the queries
    2. You can't do anything about the queries but you can add more memory
    3. You can't to anything about the queries and you can't add more memory
  2. The query workload has been tuned, and the I/O subsystem is still overwhelmed.

I'd say that case #2 is when you want to start looking at upgrading the I/O subsystem, whether with SSDs or not.

If you're in case #1, there are three sub-cases:

  1. You can tune the queries
  2. You can't do anything about the queries but you can add more memory
  3. You can't to anything about the queries and you can't add more memory

If you're in #1.1, go tune the queries first before dropping a bunch of money on hardware. Hardware will only get you so far - sooner or later your workload will scale past the capabilities of the hardware and you'll be in the same situation.

If you're in #1.2, consider giving SQL Server more memory to use for its buffer pool so it can hold more of your workload in memory and cut down on I/Os that way. This is going to be cheaper than buying SSDs.

If you're in case #1.3, you're in the same boat as for case #2.

Now you might want to start considering SSDs, as well as more traditional I/O subsystems. But what to put on it? Again, any overwhelmed I/O subsystem will benefit from an SSD, but only if you put it in the right place.

I have some clients who are using SSDs because they can't connect any more storage to their systems and this gives them a lot of flexibility without the complexity of higher-end storage systems. Others are thinking of SSDs as a greener alternative to traditional spinning disks.

The other thing to consider is redundancy. A 640-GB SSD card presents itself as 2 x 320-GB drives when installed in the OS. As you know, I'm big into availability so to start with I'm going to use these as a 320-GB RAID-1 array (not truly redundant as they're both in the same PCI-E slot - but I can't use both PCI-E slots for SSDs yet). I need to make sure that if a drive fails, I don't lose the ability to keep my workload going. This is what many people forget - just because you have an SSD, doesn't mean you should compromise redundancy. Now don't get me wrong, I'm not saying anything here that implies SSDs are any more likely to fail than any other technology - but it's a drive, and prudence dictates that I get some redundancy in there.

The first test: not overwhelmed transaction log

But which of your data and log files to put onto an SSD? It depends on your workload and which portion of your I/O subsystem is overloaded. This is what I'm going to investigate in the next few posts.

For my first set of tests I wanted to see what effect and SSD would have on my workload if the transaction log isn't the bottleneck but I put in on an SSD, as much advice seems to suggest. In other words, is the log *always* the best thing to put on an SSD? I'll do data files in the next test.

This is a narrow test scenario - inserting 100-GB of server-generated data into a single clustered index. The post Benchmarking: 1-TB table population (part 1: the baseline) explains the scripts I'm using. The hardware setup I'm using is explained in Benchmarking hardware setup, with the addition of the PCI-E mounted 640-GB ioDrive Duo.

For the data file, I had it pre-grown to 100GB, on a RAID-10 array with 8 x 15k 300-GB SCSI drives.

For the log file, I had a variety of sizes and auto-growths, both on a RAID-10 array with 8 x 7.2k 1-TB SCSI drives, and on the 320-GB RAID-1 SSD. The log configurations I tested are:

  • Starting size 256MB with 50MB, 256MB, 512MB auto-growth
  • Fixed auto-growth of 256MB, but starting size of 256MB, 512MB , 1GB, 2GB, 4GB, 8GB

In each case I ran five tests and then took the average to use as the result.

Fixed initial size and varying auto-growth

This test case simulates a system where the log starts small and grows to around 6GB, then remains in steady state. Transaction size is optimized for the maximum 60-KB log block size, when the log *must* be flushed to disk, no matter whether a transaction just committed or not. 

The results from the first test are below:

 

At the 50MB growth rate, the SSDs gave me a 5% gain in overall run time (the green line), but this decreased to around 2% when I started to increase the auto-growth size. Remember each data point represents the average of 5 tests.

Varying initialize size with fixed auto-growth

This test is the same as the previous one, but tries to remove the overhead of transaction log growth by increasing the initial size of the log so it reaches steady-state faster.

The result from the second test are below:

 

For initial log sizes up to and including 2-GB, the SSD gives me a 2.5-3.5% gain over the SATA array. This makes sense as the zero initialization that is required by the log whenever it grows will be a little faster on the SSD. Once I hit 4-GB for the initial size they performed almost the same, and at 8-GB, when no autogrowth occured, the SSD was only 0.8% faster than the RAID-10 array.

Summary

What am I trying to show with these tests? For this post I'm not trying to show you wonderful numbers that'll make you want to go out and buy an SSD right away. I wanted to show you first of all that just because I have an SSD in my system, if I put it in the wrong place, I don't get any appreciable performance gain. Clearly the RAID-10 array where my log is was doing just fine when I set the correct log size and growth. You have to think about your I/O workload before making knee-jerk changes to your I/O subsystem.

But one thing I haven't made clear - a single SSD configured in RAID-1 outperformed a RAID-10 array of 8 7.2k SATA drives in every test I did!

For this test, my bottleneck is the RAID-10 array where the data file is. Before I try various configurations of that, I want to try the SSD in a RAID-0 configuration, which most people use, and also see if I can get the RAID-10 array to go any faster by formatting it with a 64K stripe size.

Christmas comes but once a year... really? Then mine just came early on this afternoon's UPS truck.

The very nice folks at Fusion-io just sent me two of their fully-loaded top-of-the-line ioDrive Duos with 640GB of solid-state flash memory in each. This is really extra-nice of them because on Dell's Small Business website they're currently retailing for $12800 *each*. Expensive? Yes. Worth it? That's what I'm hoping to prove.

There's nothing like expensive, pretty hardware to get me excited... here's what they look like:

Now, above I said 'expensive', and these are, but they pack some pretty amazing specs in terms of read/write bandwidth so you're getting a lot of bangs for you bucks. But where does it really make sense to drop the bucks for the biggest bangs? To answer that I'm planning to do a whole series of blog posts as part of my benchmarking efforts to investigate which operations can benefit the most from these drives. With 1.2TB of SSD storage I'll be able to plug these into one of my test systems here and run comparisons against 15k SCSI and 7.2k SATA drives.

Anyway, there's a lot of hype about the speed of SSDs, and also a lot of angst about SSDs not being Enterprise-ready. I don't agree with them not being Enterprise-ready - in fact, fellow-MVP Greg Linwood, who runs (among other things) our partner company SQLskills Australia, already has a bunch of customers with Fusion-io drives deployed in their enterprises successfully. As with any critical hardware infrastructure (especially cutting-edge stuff like this), the key to success is having everything setup correctly so I'll be blogging about all my experiences with them.

To summarize, I'm very excited! I've been wanting to get my hands on some serious SSD hardware for a couple of years now so I can do some *real* testing - it doesn't get better than this!

Shoot me an email or leave a comment if there's something you're interested in seeing tested.

PS Full disclosure: yes, of course Fusion-io sent me these because they're getting publicity from me blogging about them, but we don't have any editorial/veto agreement. I want to be able to recommend these to our enterprise clients and the only way to honestly do that is to play with them myself - so it's a win-win for both of us. And you guys get to test them vicariously through me, so it's a win-win for you too :-)

Stay tuned...

Theme design by Nukeation based on Jelle Druyts