Australia classes in Sydney and Canberra in December

I’ve just released our final classes of the year for registration!

We will be coming to Sydney and Canberra in Australia and teaching our new IEPTO1: Immersion Event on Performance Tuning and Optimization – Part 1 :

  • From the curriculum: As well as optimization techniques, this course will also help with design and architecture so you can prevent performance and scalability problems from happening. The cores of this class are comprehensive coverage of indexing and statistics strategies: a SQL Server workload will not perform well unless these are designed, implemented, and tuned correctly. You will also learn why and how to optimize transaction log operations, tempdb and data file configuration, transactions and isolation levels, and locking and blocking.

Our classes in Australia will make use of the excellent Cliftons facilities in both cities, and the price includes breakfast, lunch, and coffee/tea/snacks through the day.

The dates of the classes are:

  • Sydney, NSW, December 8-12
  • Canberra, ACT, December 15-19

The class cost is US$3,995 plus GST, with an early-bird price of US$3,495 plus GST (cut-off dates vary by class).

For both classes, you can register for US$3,195 plus GST through July 31st only – SUPER-early-bird price!

You can get all the logistical, registration, and curriculum details by drilling down from our main schedule page.

We hope to see you there!

July pricing special for new October classes

To celebrate our newly revamped IE1 and IE2 classes becoming IEPTO1 and IEPTO2, through the month of July you can register for either October class in Chicago at the super-early-bird price of $2,995. This price is usually reserved just for our alumnus students. At the end of July, the price will revert to the normal early-bird price of $3,295.

You can get all the logistical, registration, and curriculum details by drilling down from our main schedule page.

Hope to see you there!

PS The classes in Australia in December will be published on July 1st next week!

Revamped IE1 and IE2 classes open for registration in October

I’ve just released our final US classes this year for registration!

We’ve revamped our IE1 and IE2 classes so they’re now both focused on performance and together they form a comprehensive, 10-day performance tuning and optimization course.

  • IE1 is now IEPTO1: Immersion Event on Performance Tuning and Optimization – Part 1 (see here for the revamped curriculum)
    • From the curriculum: As well as optimization techniques, this course will also help with design and architecture so you can prevent performance and scalability problems from happening. The cores of this class are comprehensive coverage of indexing and statistics strategies: a SQL Server workload will not perform well unless these are designed, implemented, and tuned correctly. You will also learn why and how to optimize transaction log operations, tempdb and data file configuration, transactions and isolation levels, and locking and blocking.
  • IE2 is now IEPTO2: Immersion Event on Performance Tuning and Optimization – Part 2 (see here for the revamped curriculum)
    • From the curriculum: The core of this class is understanding resource usage and we will cover in-depth all the areas of concern for a SQL Server workload: I/O, CPU usage, memory usage, query plans, statement execution, parameter sniffing and procedural code, deadlocking, and the plan cache. You will learn how to use specific tools and techniques for analyzing SQL Server: creating and using performance baselines, benchmarking tools, wait and latch statistics, Extended Events, DMVs, and PerfMon. These techniques will be highly adaptable to whatever situation you encounter and you will understand not just how to capture performance data but also how to interpret it, so you can derive answers to your own performance problems rather than relying on someone giving you the answer.

As before, these classes both stand alone perfectly well, but we strongly recommend taking IEPTO1 before IEPTO2 as all its knowledge is assumed in IEPTO2.

If you’ve taken IE1 previously, you don’t need to go back and take IEPTO1. If you’ve already taken IE1, IEPTO2 is the next course for you. If you’ve already taken IE2 but not IE1, we recommend you take IEPTO1.

Our classes in October will be in Chicago, at our usual hotel in Oakbrook Terrace:

  • IE0: Immersion Event for Junior/Accidental DBAs
    • October 6-8
  • IEPTO1: Immersion Event on P.T.O. – Part 1
    • October 6-10 – special super-early-bird pricing through July!
  • IEPTO2: Immersion Event on P.T.O. – Part 2
    • October 13-17 - special super-early-bird pricing through July!

You can get all the logistical, registration, and curriculum details by drilling down from our main schedule page.

We hope to see you there!

Are mixed pages removed by an index rebuild?

This is a question that came up this morning during our IE1 class that I thought would make an interesting blog post as there are some twists to the answer.

The first 8 pages that are allocated to an allocation unit are mixed pages from mixed extents, unless trace flag 1118 is enabled.

See the following blog posts for more info:

Assuming that mixed pages are not disabled with trace flag 1118, does an index rebuild remove all mixed pages or not?

Let’s investigate. First I’ll create a clustered index with 1,000 data pages:

CREATE TABLE [MixedTest] ([c1] BIGINT IDENTITY, [c2] CHAR (8000) DEFAULT 'a');
CREATE CLUSTERED INDEX [MixedTest_CL] ON [MixedTest] ([c1]);
SET NOCOUNT ON;
GO
INSERT INTO [MixedTest] DEFAULT VALUES;
GO 1000

And then make sure that we have mixed pages be examining the first IAM page in the clustered index’s IAM chain. You can get the sp_AllocationMetadata proc here.

EXEC [sp_AllocationMetadata] N'MixedTest';
GO
Object Name   Index ID   Alloc Unit ID       Alloc Unit Type   First Page   Root Page   First IAM Page
------------  ---------  ------------------  ----------------  -----------  ----------  ---------------
MixedTest     1          72057594046185472   IN_ROW_DATA       (1:987)      (1:1732)    (1:988)
DBCC TRACEON (3604);
DBCC PAGE (N'master', 1, 988, 3);
GO

(I’m just including the relevant portion of the DBCC PAGE output here…)

<snip>
IAM: Single Page Allocations @0x00000000227EA08E

Slot 0 = (1:987)                    Slot 1 = (1:989)                    Slot 2 = (1:990)
Slot 3 = (1:991)                    Slot 4 = (1:1816)                   Slot 5 = (1:1817)
Slot 6 = (1:1818)                   Slot 7 = (1:1819)
<snip>

Now I’ll do an offline index rebuild of the clustered index, and look again at the IAM page contents (assume I’m running the proc and DBCC PAGE after the rebuild):

ALTER INDEX [MixedTest_CL] ON [MixedTest] REBUILD;
GO
<snip>
IAM: Single Page Allocations @0x0000000023B0A08E

Slot 0 = (1:1820)                   Slot 1 = (1:446)                    Slot 2 = (1:1032)
Slot 3 = (0:0)                      Slot 4 = (1:1035)                   Slot 5 = (1:1034)
Slot 6 = (1:1037)                   Slot 7 = (1:1036)
<snip>

So the answer is no, an index rebuild does not remove mixed page allocations. Only trace flag 1118 does that.

But this is interesting – there are only 7 mixed pages in the singe-page slot array above. What happened? The answer is that the offline index rebuild ran in parallel, with each thread building a partial index, and then these are stitched together. The ‘stitching together’ operation will cause some of the non-leaf index pages to be deallocated as their contents are merged together. This explains the deallocated page that was originally tracked by entry 3 in the slot array.

Let’s try an offline index rebuild that forces a serial plan.

ALTER INDEX [MixedTest_CL] ON [MixedTest] REBUILD WITH (MAXDOP = 1);
GO
<snip>
IAM: Single Page Allocations @0x0000000023B0A08E

Slot 0 = (1:1822)                   Slot 1 = (1:1823)                   Slot 2 = (1:291)
Slot 3 = (1:292)                    Slot 4 = (0:0)                      Slot 5 = (0:0)
Slot 6 = (0:0)                      Slot 7 = (0:0)
<snip>

In this case there is only one index (i.e. no parallel mini indexes) being built so there are no pages being deallocated in the new index as there is no stitching operation. But why aren’t there 8 mixed pages? This is because during the build phase of the new index, the leaf-level pages are taken from bulk-allocated dedicated extents, regardless of the recovery model in use. The mixed pages are non-leaf index pages (which you can prove to yourself using DBCC PAGE).

For parallel and single-threaded online index operations, the same two patterns occur as for offline index rebuilds, even though the algorithm is slightly different.

Enjoy!

Online index rebuild corruption bug in SQL Server 2012 SP1

This is a quick blog post to let you know about a bug in SQL Server 2012 SP1 that can cause data loss when performing index maintenance.

The data loss issue can happen in some circumstances when you do a parallel online rebuild of a clustered index while there are concurrent data modifications happening on the table AND you also hit a deadlock and another error. Nasty when it occurs, but that should hopefully be a rare combination.

The workaround is to limit the online rebuild operation to be single threaded using the WITH (MAXDOP = 1) option.

There is a hotfix available – see KB 2969896 for more details.

Depending on which build you are on of SQL Server 2012 or 2014, the best option for you will vary. See my friend Aaron Bertrand’s post for comprehensive details.

Target and actual SQL Server uptime survey results

Exactly five years ago I published survey results showing target uptime SLAs and actual uptime measurements. I re-ran the survey a few weeks ago to see what’s changed, if anything, in the space of five years, and here are the results.

24×7 Systems

 24x7target Target and actual SQL Server uptime survey results

 24x7actual Target and actual SQL Server uptime survey results

Other responses:

  • 1 x 99.95%

Non 24×7 Systems

Non24x7target Target and actual SQL Server uptime survey results

Other responses:

  • 7 x “No target or target unknown”
  • 1 x “0830 – 1730 M-Sat”

Non24x7actual Target and actual SQL Server uptime survey results

Other values:

  • 1 x “n/a”

Summary

Well, the good thing is that this survey had almost twice the number of respondents as the 2009 survey, but that could just be that a lot more people read my blog now than five years ago.

My takeaway from the data is that nothing has really changed over the last five years. Given the really low response rate to the survey (when I usually get more than 2-300 responses for a typical survey), my inference is that the majority of you out there don’t have well-defined uptime targets (or recovery time objective service level agreements, RTO SLAs, or whatever you want to call it) and so didn’t respond to the survey. The same thing happens when surveying something like backup testing frequency – where you *know* you’re supposed to do it, but don’t do it enough so feel guilty and don’t respond to the survey.

For those of you that responded, or didn’t respond and do have targets, well done! For those of you that don’t have targets, I don’t blame you, I blame the environment you’re in. Most DBAs I know that *want* to do something about HA/DR are prevented from doing so by their management not placing enough importance on the subject, from talking to a bunch of you. This is also shown by the demand for our various in-person training classes: IE2 on Performance Tuning is usually over-subscribed even though it runs 3-4 times per year, but IE3 on HA/DR has only sold out once even though we generally run it only once per year.

Performance is the number one thing on the collective minds of most I.T. management, not HA/DR planning, and that’s just wrong. Business continuity is so crucial, especially in this day and age of close competition where being down can cause fickle customers to move to a different store/service provider.

If you’re reading this and you know you don’t have well-defined uptime targets then I strongly encourage you to raise the issue with your management, as it’s likely that your entire HA/DR strategy is lacking too. For more information, you can read the results post from the survey five years ago (Importance of defining and measuring SLAs).

Don’t wait until disaster strikes to make HA/DR a priority.

Most common wait stats over 24 hours and changes since 2010

Back in February I kicked off a survey asking you to run code that created a 24-hour snapshot of the most prevalent wait statistics. It’s taken me a few months to provide detailed feedback to everyone who responded and to correlate all the information together. Thanks to everyone who responded!

I did this survey because I wanted to see how the results had changed since my initial wait statistics survey back in 2010.

The results are interesting!

2010 Survey Results

Results from 1823 servers, top wait type since server last restarted (or waits cleared). The blog post for this survey (Wait statistics, or please tell me where it hurts) has a ton of information about what these common wait types mean, and I’m not going to repeat all that in this blog post.

waitstatssurvey Most common wait stats over 24 hours and changes since 2010

2014 Survey Results

Results from 1708 servers, top wait type over 24 hours

2014waits Most common wait stats over 24 hours and changes since 2010

The distribution of the top waits has changed significantly over the last four years, even when taking into account that in the 2010 survey I didn’t filter out BROKER_RECEIVE_WAITFOR.

  • CXPACKET is still the top wait type, which is unsurprising
  • OLEDB has increased to being the top wait type roughly 17% of the time compared to roughly 4% in 2010
  • WRITELOG has increased to being the top wait 10% of the time compared with 6% in 2010
  • ASYNC_NETWORK_IO has decreased to being the top wait 8% of the time compared with 15% in 2010
  • PAGEIOLATCH_XX has decreased to being the top wait 7% of the time compared with 18% in 2010

These percentages remain the same even when I ignore the BROKER_RECEIVE_WAITFOR waits in the 2010 results.

Now I’m going to speculate as to what could have caused the change in results. I have no evidence that supports most of what I’m saying below, just gut feel and supposition – you might disagree. Also, even though the people reading my blog and responding to my surveys are likely to be paying more attention to performance and performance tuning than the general population of people managing SQL Server instances across the world, I think that these results are representative of what’s happening on SQL Server instances across the world.

I think that OLEDB waits have increased in general due to more and more people using 3rd-party performance monitoring tools that make extensive, repeated use of DMVs. Most DMVs are implemented as OLE-DB rowsets and will cause many tiny OLEDB waits (1-2 milliseconds on average, or smaller). This hypothesis is actually borne out by the data I received and confirmation from many people who received my detailed analyses of results they sent me. If you see hundreds of millions or billions of tiny OLEDB waits, this is likely the cause.

I think WRITELOG waits being the top wait have increased partly because other bottlenecks have become less prevalent, and so the next highest bottleneck is the transaction log, and partly because more workloads are hitting logging bottlenecks inside SQL Server that are alleviated starting in SQL Server 2012 (blog post coming next week!). I also think that WRITELOG waits have been prevented from becoming even more prevalent because of the increased use of solid-state disks for transaction log storage mitigating the increased logging from higher workloads.

Now it could be that the drop in PAGEIOLATCH_XX and ASYNC_NETWORK_IO waits being the top wait is just an effect caused by the increase in OLEDB and WRITELOG waits. It could also be because of environmental changes…

PAGEIOLATCH_XX waits being the top wait might have decreased because of:

  • Increased memory on servers meaning that buffer pools are larger and more of the workload fits in memory, so fewer read I/Os are necessary.
  • Increased usage of solid-state disks meaning that individual I/Os are faster, so when I/Os do occur, the PAGEIOLATCH_XX wait time is smaller and so the aggregate wait time is smaller and it is no longer the top wait.
  • More attention being paid to indexing strategies and buffer pool usage.

ASYNC_NETWORK_IO waits being the top wait might have decreased because of fewer poorly written applications, or fixes to applications that previously were poorly written. This supposition is the most tenuous of the four and I really have no evidence for this at all. I suspect it’s more likely the change is an effect of the changes in prevalence of the other wait types discussed above.

Summary

I think it’s interesting how the distribution of top waits has occurred over the last four years and I hope my speculation above rings true with many of you. I’d love to hear your thoughts on all of this in the post comments.

It’s not necessarily bad to have any particular wait type as the most prevalent one in your environment, as waits always happen, so there has to be *something* that’s the top wait on your system. What’s useful though is to trend your wait statistics over time and notice how code/workload/server/schema changes are reflected in the distribution of wait statistics.

There is lots of information about wait statistics in my Wait Statistics blog category and there’s a new whitepaper (SQL Server Performance Tuning Using Wait Statistics: A Beginners Guide) on wait statistics written by Jonathan and Erin in conjunction with Red Gate which you can download from our website here.

Enjoy!

Survey: target uptime – planned and actual

It’s been five years(!) since the last time I asked about your target uptimes for your critical SQL Server instances and I think we’d all be interested to see how things have changed.

Edit 6/2/14: The survey is closed now – see here for the results.

So I present four surveys to you. For your most critical SQL Server instance:

  • If it’s a 24×7 system, what’s the target uptime?
  • If it’s a 24×7 system, what’s your measured uptime over the last year?
  • If it’s not a 24×7 system, what’s the target uptime?
  • If it’s a 24×7 system, what’s your measured uptime over the last year?

You’ll notice that the surveys are termed in percentages. Here’s what the percentages mean for a 24×7 system:

  • 99.999% = 5.26 minutes of downtime per year
  • 99.99% = 52.56 minutes of downtime per year
  • 99.9% = 8.76 hours of downtime per year
  • 99.5% = 1.825 days of downtime per year
  • 99% = 3.65 days of downtime per year
  • 98.5% = 5.475 days of downtime per year
  • 98% = 7.3 days of downtime per year
  • 95% = 18.25 days of downtime per year

If your target uptime allows for planned maintenance downtime, then that doesn’t count as unplanned downtime, as long as your system was only down for the length of time allowed. But don’t cheat yourself and retroactively classify unplanned downtime as planned, so it doesn’t affect your actual, measured uptime.

For instance, if you have a 99.9% uptime goal for a 24×7 system, with a quarterly 4-hour maintenance window, then I would select 99.9% in the 24×7 target survey. For that same system, if the downtime was limited to the proscribed 4-hour window each quarter, and there was no other downtime *at all*, I would select 99.999% on the 24×7 measured uptime survey.

Basic advice is to use common sense in how you answer. If you say you have a 24×7 system but you have a 12-hour maintenance window each week, I wouldn’t classify that as a 24×7 system.

24×7 Systems

Survey 1: 24×7 system target uptime


Survey 2: 24×7 system measured uptime
Please be honest. Remember if you choose 99.999% that means you’re saying your system was up for all but 5 minutes in the last year.


Non-24×7 Systems

Survey 3: Non-24×7 system target uptime
Use ‘Other’ to answer if your answer is ‘No target or target unknown’.


Survey 4: Non-24×7 system measured uptime
Please be honest.



I’ll editorialize the results in a week or two.

Thanks!

Causes of IO_COMPLETION and WRITE_COMPLETION SQL Server wait types

In many of the sets of wait statistics I’ve been analyzing, the IO_COMPLETION and WRITE_COMPLETION waits show up (but never as the most prevalent wait type).

The official definition of these wait types are:

  • IO_COMPLETION: Occurs while waiting for I/O operations to complete. This wait type generally represents non-data page I/Os. Data page I/O completion waits appear as PAGEIOLATCH_* waits.
  • WRITE_COMPLETION: Occurs when a write operation is in progress.

I promised many of the people who sent me wait statistics recently that I would write a blog post giving more detailed information on when these wait types occur, so here it is.

I used the Extended Events code in my post How to determine what causes a particular wait type to watch for these wait types occurring and then ran a variety of operations and analyzed the call stacks. There are way too many occurrences to document them all here, so I’ll summarize my findings below.

Note that these are not lists are not exhaustive, but you get the idea of the kinds of operation where these wait types occur.

IO_COMPLETION

  • Reading log blocks from the transaction log (during any operation that causes the log to be read from disk – e.g. recovery)
  • Reading allocation bitmaps from disk (e.g. GAM, SGAM, PFS pages) during many operations (e.g. recovery, DB startup, restore)
  • Writing intermediate sort buffers to disk (these are called ‘Bobs’)
  • Reading and writing merge results from/to disk during a merge join
  • Reading and writing eager spools to disk
  • Reading VLF headers from the transaction log

WRITE_COMPLETION

  • Writing any page to a database snapshot (e.g. while running DBCC CHECK*, which is often the most common cause of this wait type)
  • Writing VLF headers while creating or growing a transaction log file
  • Writing a file’s header page to disk
  • Writing portions of the transaction log during database startup
  • Writing allocation pages to disk when creating or growing a data file

These aren’t waits that I’d generally be concerned about, and I’d expect the individual resource wait times to be in line with those of the read and write latencies of the instance.

Enjoy!

SQLskills community mentoring – round 6

We’ve had a mentoring program here at SQLskills for a few years (the brainchild of Jonathan), where each of us can pick someone who’s attended one of our Immersion Events in the past and offer to be their mentor for six months.

It’s time to kick off the fifth mentoring session. You can read about the previous mentees below:

I’ll be the only one mentoring this time. I continue to find it interesting helping my mentees with a variety of personal and professional goals (whatever they want help with) and getting to know them better. This time I’ll be mentoring Dainius Sutkevicius who’s taken IE1, IE2, and IE3 previously. A brief resume of Dainius in his own words:

Dainius Sutkevicius has been working as a Sr. SQL Server DBA for an international natural and organic foods retailer for the past several years.

He is highly-motivated, well versed IT professional with over 15 years of experience in logistics, telecommunications, trading and retail industries. He possesses diverse skills and experience in troubleshooting, optimizing, maintaining and developing secure, high availability and performance database solutions.

When he is not tinkering with technology he enjoys spending time with his family, digital imaging, javelin throwing and reading about quantum physics, cosmology and philosophy.

He blogs at http://www.SqlSorcerer.com and is on Twitter @SqlSorcerer.

Congratulations to Dainius – I’m looking forward to working with him over the next six months.

PS I’m often asked how to get into our mentoring program – it’s simple: attend one of our Immersion Events and get to know one of us, and then ask and see if there’s a match that works.