Photos of the USS Silversides WWII submarine

 

I think this is a fitting blog post for Memorial Day here in the US. Last weekend we had some spare time in Michigan between client visits in Grand Rapids and Houston so we drove over to Muskegon on the east shore of Lake Michigan and visited the Great Lakes Naval Memorial and Museum. It's the home of the USS Silversides, a 311ft 1500 ton submarine that saw extensive service in WWII, sinking 23 enemy ships totaling more than 90,000 tons. You can read an extensive history of the submarine on Wikipedia. The cool thing about the museum is that you can get inside the submarine and go through it on your own time.

I'm a huge submarine fan as my Dad spent a long time on submarines (controlling the nuclear reactors) in his 32 years in the Royal Navy and I lived my childhood very close to the nuclear submarine base at Faslane in Scotland (Helensburgh is very much a Navy town). I spent a lot of time in the base itself over the years and have been on numerous submarines. Dad was very interested when I called him from the deck of the Silversides last weekend and is looking forward to these photos.

All photos are taken with a Canon EOS 5D Mark II (shooting RAW) with a Canon EF 24-105 f/4 L IS lens. I then used Lightroom 3 to do some white balancing and also played around with some free Lightroom presets I downloaded from onOne Software. All the photos here are using their WOW-Edge 10 effect which works really well for these photos packed with metal surfaces and the shadowy lighting on the sub. It also does a vignetting effect too.

Click each photo for a 1024×768 version.

Below: The forward torpedo room – look how there are bunks there in the room too, making use of all available space.

 

Below: battle notes in the officer's mess; dedication plaque.

 

Below: one side of the control room; diving controls.

 

Below: ballast tank controls; targeting indicator.

 

Below: engine speed controls (linked to the engine room); hatchways are pretty small for us 6ft+ tall people (although the deckheads were fine).

 

Below: ballast tank guages; the other side of the control room.

 

Below: some crew quarters; engine room. The submarine is driven by electric motors from its batteries. These engines just charge the batteries, and the engines are still operable.

  

Below: engine details.

  

Below: maneuvering room; aft torpedo room.

 

Hope you enjoyed these!

Survey: cores vs. MAXDOP

I have a few surveys to editorialize but I'd like to kick off another one where you have to run a bit of code and send the results.

For this one I'm interested in how you have your system configured for parallelism, given the number of processor cores and NUMA nodes configured.

For systems using SQL Server 2005 onwards, the following code will work:

SELECT
    [value_in_use] AS [MAXDOP],
    os.*
FROM
    sys.configurations,
    (SELECT
        COUNT (DISTINCT [parent_node_id]) AS [Nodes],
        COUNT (*) AS [Cores]
     FROM
        sys.dm_os_schedulers 
     WHERE
        [status] = 'VISIBLE ONLINE') AS os
WHERE
    [name] = 'max degree of parallelism';
GO

Shoot me the results in email, spreadsheet, or as a comment on the blog. Comment #7 below has a Powershell script from Aaron Nelson (b|t) for all your servers – thanks!

If you have any info on the general workload/workload mix on the server, that would be helpful too.

Thanks!

Benchmarking: Multiple data files on SSDs (plus the latest Fusion-io driver)

It’s been a long time since the last blog post on SSD benchmarking – I’ve been busy! I’m starting up my benchmarking activities again and hope to post more frequently. You can see the whole progression of benchmarking posts here.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. My test systems now have 16GB each and all tests were performed with the buffer pool ramped up, so memory allocation didn’t figure into the performance numbers.

In this recent set of tests I wanted to explore three questions:

  1. What kind of performance gain do I get upgrading from Fusion-io’s v1.2 driver to the v2.2 driver?
  2. What is the sweet spot for the number of files on an SSD?
  3. Does a 4Kb block size give any gains in performance for my test?

To keep it simple I’m using one half of the 640GB drive (it’s two 320GB drives under the covers). To do this my test harness does the following:

  • Formats the SSDs in one of three ways:
    • Fusion-io basic format (the 320GB drive has 300GB capacity)
    • Fusion-io improved write performance format (the 320GB drive has only 210GB capacity, 70% of normal)
    • Fusion-io maximum write performance format (the 320GB drive has only 151GB capacity, 50% of normal)
  • The SSD format is performed using Fusion-io’s ioManager tool
  • Creates 1, 2, 4, 8, 16, 32, 64 or 128 data files, with the file sizes calculated to fill the SSDs
  • My table structure is:

CREATE TABLE MyBigTable (
c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID (),
c2 DATETIME DEFAULT GETDATE (),
c3 CHAR (111) DEFAULT ‘a’,
c4 INT DEFAULT 1,
c5 INT DEFAULT 2,
c6 BIGINT DEFAULT 42);
GO

CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

  • I have 64 connections each inserting 2.5 million records into the table (with the loop code running server-side) for a total of 160 million records inserted, in batches of 1000 records per transaction. This works out to be about 37 GB of allocated space from the database.

The clustered index on a random GUID is the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) – for my purposes it’s perfect.

I tested each of the eight data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, RAID was not involved):

  • Log and data on a single 320GB SSD with the old v1.2.7 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver and a 4Kb block size (each of the 3 ways of formatting)
That’s a total of 9 configurations, with 8 data file layouts in each configuration – making 72 separate configurations. I ran each test 5 times and then took an average of the results – so altogether I ran 360 tests, for a cumulative test time of just over 1.43 million seconds (16.5 days) during April.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. This uses the script from this blog post.

On to the results… the wait stats are *really* interesting!

Note: the y-axes in the graphs below do not start at zero. All the graphs have the same axes so there is nothing to misunderstand. They are not misleading – and I make no apologies for my choice of axes – I want to show the difference between the various formats more clearly.

v1.2.7 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the old driver was 3580 seconds for test completion, with 4 data files and the Improved Write format – a tiny amount less than the time for 8 data files.

v2.2.3 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the new driver was 2993 seconds for test completion, with 8 data files and the Max Write format – a tiny amount less than the time for other formats for 8 data files, and very close to the times for 4 data files.

On average across all the tests the new v2.2 Fusion-io driver gives a 20.5% performance boost over the old v1.2 driver, and for the regular format, the new v2.2 Fusion-io driver gives a 24% performance boost over the old v1.2 driver. It also (but I didn’t measure) reduces the amount of system memory required to use the SSDs. Good stuff!

v2.2.3 Fusion-io driver and 4-Kb sector size

 

The performance using a 4-Kb sector size is roughly the same for my test compared to traditional 512-byte sector size. The most performance gain I got was 3% over using a 512-byte sector size, but on average across all tests the performance was very slightly (0.5%) lower.

Wait Stats

The wait stats were very interesting.

The wait stats are presented in the following format:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
————–  ———  ———-  ——–  ———  ———-  ———  ——–  ——–
PAGEIOLATCH_EX  154611.39  128056.51   26554.88  45295507   71.83       0.0034     0.0028    0.0006
PAGELATCH_UP    37948.31   36988.52    959.79    2314370    17.63       0.0164     0.016     0.0004
PAGELATCH_SH    16976      13823.71    3152.3    3751811    7.89        0.0045     0.0037    0.0008

The columns are:

  • WaitType – kind of obvious
  • Wait_S – cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again
  • Resource_S – cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)
  • Signal_S – cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again – called the signal wait time)
  • WaitCount – number of waits of this type during the test
  • Percentage – percentage of all waits during the test that had this type
  • AvgWait_S – average cumulative wait time in seconds
  • AvgRes_S – average resource wait time in seconds
  • AvgSig_S – average signal wait time in seconds

For a single file, the wait stats for all the various formatting options look like:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
————–  ———  ———-  ——–  ———  ———-  ———  ——–  ——–
PAGEIOLATCH_EX  226362.54  193378.56   32983.98  45742117   78.66       0.0049     0.0042    0.0007
PAGELATCH_UP    36701.66   35760.67    940.99    2144533    12.75       0.0171     0.0167    0.0004
PAGELATCH_SH    16775.05   13644.99    3130.06   3542549    5.83        0.0047     0.0039    0.0009

With more files, the percentage of PAGEIOLATCH_EX waits increases, and by the time we get to 8 files, SOS_SCHEDULER_YIELD has started to appear. At 8 files, the wait stats for all the various formatting options look like:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
——————-  ———  ———-  ——–  ———  ———-  ———  ——–  ——–
PAGEIOLATCH_EX       244703.77  210859.63   33844.14  45169863   89.47       0.0054     0.0047    0.0007
SOS_SCHEDULER_YIELD  12500.15   0.99        12499.16  823658     4.57        0.0152     0         0.0152
PAGELATCH_SH         5777.54    3749.18     2028.36   476618     2.11        0.0121     0.0079    0.0043

By 16 files, the PAGELATCH waits have disappeared from the top 95%. As the number of files increases to 128, the PAGEIOLATCH_EX waits increase to just over 91% and the wait stats look like this for regular format:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
——————-  ———  ———-  ——–  ———  ———-  ———  ——–  ——–
PAGEIOLATCH_EX       304106.32  273671      30435.32  43478489   91.25       0.007      0.0063    0.0007
SOS_SCHEDULER_YIELD  16733      1.35        16731.65  889147     5.02        0.0188     0         0.0188

What does this mean? It’s obvious from the wait stats that as I increase the number of data files on the drive, the average resource wait time for each PAGEIOLATCH_EX wait increases from 4.2ms for 1 file up to 6.3ms for 128 files – 50% worse, with the signal wait time static at 0.7ms.

But look at the wait stats for 128 files using the Maximum Write format:

WaitType      Wait_S Resource_S  Signal_s  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
——————-  ———  ———-  ——–  ———  ———-  ———  ——–  ——–
PAGEIOLATCH_EX      196322.97 166602.88   29720.09  45435521  88.66      0.0043 0.0037   0.0007
SOS_SCHEDULER_YIELD  16182.48 1.1     16181.38  828595  7.31      0.0195 0   0.0195

The average resource wait time for the PAGEIOLATCH_EX waits has dropped from 6.3ms to 3.7ms! But isn’t PAGEIOLATCH_EX a wait type that’s for a page *read*? Well, yes, but what I think is happening is that the buffer pool is having to force pages out to disk to make space for the pages being read in to be inserted into (which I believe is included in the PAGEIOLATCH_EX wait time) – and when the SSD is formatted with the improved write algorithm, this is faster and so the PAGEIOLATCH_EX resource wait time decreases.

But why the gradual decrease in PAGELATCH waits and increase in SOS_SCHEDULER_YIELD waits as the number of files increases?

I went back and ran a single file test and used the sys.dm_os_waiting_tasks DMV (see this blog post) to see what the various threads are waiting for. Here’s some example output:

session_id  wait_duration_ms  wait_type     resource_description
———-  —————-  ————  ——————–
79          11                PAGELATCH_UP  5:1:1520544
80          14                PAGELATCH_UP  5:1:1520544
93          16                PAGELATCH_UP  5:1:1520544
94          0                 PAGELATCH_UP  5:1:1520544
101         15                PAGELATCH_UP  5:1:1520544
111         25                PAGELATCH_UP  5:1:1520544
110         17                PAGELATCH_UP  5:1:1520544
75          6                 PAGELATCH_UP  5:1:1520544
78          17                PAGELATCH_UP  5:1:1520544
88          8                 PAGELATCH_UP  5:1:1520544
107         12                PAGELATCH_UP  5:1:1520544
109         25                PAGELATCH_UP  5:1:1520544
113         20                PAGELATCH_UP  5:1:1520544
.
.

Dividing 1520544 by 8088 gives exactly 188. Running the same query a few seconds later gives most of the threads waiting on resource 5:1:1544808, another exact multiple of 8088. These resources are PFS pages! What we’re seeing is PFS page contention, just like you can get in tempdb with lots of concurrent threads creating and dropping temp tables. In this case, I have 64 concurrent threads doing inserts that are causing page splits, which requires page allocations. As the number of files increases, the amount of PFS page contention decreases. It disappears after 8 files because I’ve only got 8 cores, so there can only be 8 threads running at once (one per SQLOS scheduler, with the others SUSPENDED on the waiter list or waiting in the RUNNABLE queue).

From 16 to 128 files, the wait stats hardly change and the performance (in the Improved Write and Max Write formats) only slightly degrades (5%) with each doubling of the number of files. Without deeper investigation, I’m putting this down to increased amounts of metadata to deal with – maybe with more to do when searching the allocation cache for the allocation unit of the clustered index. If I have time I’ll dig in and investigate exactly why.

The SOS_SCHEDULER_YIELD waits are just because the threads are able to do more before having to wait, and so they’re hitting voluntary yield points in the code – the workload is becoming more CPU bound.

Summary

I’ve clearly shown that the new Fusion-io driver gives a nice boost compared to the older one – very cool.

I’ve also shown that the number of files on the SSD does have an affect on performance too – with the sweet spot appearing to be the number of processor cores (just an observation, definitely NOT a recommendation). I’d love to see someone do similar tests on a 16-way, 32-way or higher (or lend me one to play with :-)

[Edit: I discussed the results with my friend Thomas Kejser on the SQLCAT team and he sees the same behavior on a 64-way running the same benchmark (in fact we screen-shared on a 64-way system with 2TB and 4 640GB Fusion-io cards this weekend). He posted some more investigations on his blog – see here.]

And finally, I showed that for my workload, using a 4-Kb sector size did not improve performance.

I’d call that a successful test – *really* interesting!