The Accidental DBA (Day 26 of 30): Monitoring Disk I/O

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Database storage can seem like a black box. A DBA takes care of databases, and those databases often reside somewhere on a SAN – space simply presented to a DBA as a drive letter representing some amount of space. But storage is about more than a drive letter and a few hundred GBs. Yes, having enough space for your database files is important. But I often see clients plan for capacity and not performance, and this can become a problem down the road. As a DBA, you need to ask your storage admin not just for space, but for throughput, and the best way to back up your request is with data.

I/O Data in SQL Server
I’ve mentioned quite a few DMVs in these Accidental DBA posts, and today is no different. If you want to look at I/O from within SQL Server, you want to use the sys.dm_io_virtual_file_stats DMV. Prior to SQL Server 2005 you could get the same information using the fn_virtualfilestats function, so don’t despair if you’re still running SQL Server 2000!  Paul has a query that I often use to get file information in his post, How to examine IO subsystem latencies from within SQL Server. The sys.dm_io_virtual_file_stats DMV accepts database_id and file_id as inputs, but if you join over to sys.master_files, you can get information for all your database files. If I run this query against one of my instances, and order by write latency (desc) I get:

filestats1 1024x238 The Accidental DBA (Day 26 of 30): Monitoring Disk I/O

output from sys.dm_os_virtual_file_stats

This data makes it look like I have some serious disk issues – a write latency of over 1 second is disheartening, especially considering I have a SSD in my laptop! I include this screenshot because I want to point out that this data is cumulative. It only resets on a restart of the instance. You can initiate large IO operations – such as index rebuilds – against a database that can greatly skew your data, and it may take time for the data to normalize again. Keep this in mind not only when you view the data at a point in time, but when you share findings with other teams. Joe has a great post that talks about this in more detail, Avoid false negatives when comparing sys.dm_io_virtual_file_stats data to perfmon counter data, and the same approach applies to data from storage devices that your SAN administrators may use.

The information in the sys.dm_io_virtual_file_stats DMV is valuable not only because it shows latencies, but also because it tells you what files have the have the highest number of reads and writes and MBs read and written. You can determine which databases (and files) are your heavy hitters and trend that over time to see if it changes and how.

I/O Data in Windows

If you want to capture I/O data from Windows, Performance Monitor is your best bet. I like to look at the following counters for each disk:

  • Avg. Disk sec/Read
  • Avg. Disk Bytes/Read
  • Avg. Disk sec/Write
  • Avg. Disk Bytes/Write

Jon talked about PerfMon counters earlier and the aforementioned counters tell you about latency and throughput.  Latency is how long it takes for an I/O request, but this can be measured at different points along the layers of a solution. Normally we are concerned with latency as measured from SQL Server. Within Windows, latency is the time from when Windows initiated the I/O request to the completion of the request. As Joe mentioned his post, you may see some variation between what you see for latency from SQL Server versus from Windows.

When we measure latency using Windows Performance Monitor, we look at Avg Disk sec/Read and Avg Disk sec/Write. Disk cache, on a disk, controller card, or a storage system, impact read and write values. Writes are typically written to cache and should complete very quickly. Reads, when not in cache, have to be pulled from disk and that can take longer.  While it’s easy to think of latency as being entirely related to disk, it’s not. Remember that we’re really talking about the I/O subsystem, and that includes the entire path from the server itself all the way to the disks and back. That path includes things like HBAs in the server, switches, controllers in the SAN, cache in the SAN and the disks themselves. You can never assume that latency is high because the disks can’t keep up. Sometimes the queue depth setting for the HBAs is too low, or perhaps you have an intermittently bad connection with a failing component like a GBIC (gigabit interface converter) or maybe a bad port card. You have to take the information you have (latency), share it with your storage team, and ask them to investigate. And hopefully you have a savvy storage team that knows to investigate all parts of the path. A picture is worth a thousand words in more complex environments. It often best to draw out, with the storage administrator, the mapping from the OS partition to the SAN LUN or volume. This should generate a discussion about the server, the paths to the SAN and the SAN itself. Remember what matters is getting the I/O to the application. If the IO leaves the disk but gets stuck along the way, that adds to latency. There could be an alternate path available (multi-pathing), but maybe not.

Our throughput, measured by Avg. Disk Bytes/Read and Avg. Disk Bytes/Write, tells us how much data is moving between the server and storage. This is valuable to understand, and often more useful than counting I/Os, because we can use this to understand how much data our disks will be need to be able to read and write to keep up with demand. Ideally you capture this information when the system is optimized – simple things like adding indexes to reduce full table scans can affect the amount of I/O – but often times you will need to just work within the current configuration.

Capturing Baselines

I alluded to baselines when discussing the sys.dm_os_virtual_file_stats DMV, and if you thought I was going to leave it at that then you must not be aware of my love for baselines!
You will want to capture data from SQL Server and Windows to provide throughput data to your storage administrator. You need this data to procure storage on the SAN that will not only give you enough space to accommodate expected database growth, but that will also give you the IOPs and MB/sec your databases require.

Beyond a one-time review of I/O and latency numbers, you should set up a process to capture the data on a regular basis so you can identify if things change and when. You will want to know if a database suddenly starts issuing more IOs (did someone drop an index?) or if the change is I/Os is gradual. And you need to make sure that I/Os are completing in the timeframe that you expect. Remember that a SAN is shared storage, and you don’t always know with whom you’re sharing that storage. If another application with high I/O requirements is placed on the same set of disks, and your latency goes up, you want to be able to pinpoint that change and provide metrics to your SAN administrator that support the change in performance in your databases.

Summary

As a DBA you need to know how your databases perform when it comes to reads and writes, and it’s a great idea to get to know your storage team. It’s also a good idea to understand where your databases really “live” and what other applications share the same storage. When a performance issue comes up, use your baseline data as a starting part, and don’t hesitate to pull in your SAN administrators to get more information. While there’s a lot of data readily available for DBAs to use, you cannot get the entire picture on your own. It may not hurt to buy your storage team some pizza or donuts and make some new friends icon smile The Accidental DBA (Day 26 of 30): Monitoring Disk I/O Finally, if you’re interested in digging deeper into the details of SQL Server I/O, I recommend starting with Bob Dorr’s work:

The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

For the last set of posts in our Accidental DBA series we’re going to focus on troubleshooting, and I want to start with Wait Statistics.  When SQL Server executes a task, if it has to wait for anything – a lock to be released from a page, a page to be read from disk into memory, a write to the transaction log to complete – then SQL Server records that wait and the time it had to wait.  This information accumulates, and can be queried using the sys.dm_os_wait_stats DMV, which was first available in SQL Server 2005.  Since then, the waits and queues troubleshooting methodology has been a technique DBAs can use to identify problems, and areas for optimizations, within an environment.

If you haven’t worked with wait statistics, I recommend starting with Paul’s wait stats post, and then working through Tom Davidson’s SQL Server 2005 Waits and Queues whitepaper.

Viewing Wait Statistics

If you run the following query:

SELECT *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

You will get back output that isn’t that helpful, as you can see below:

WaitStats output The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

sys.dm_os_wait_stats output

It looks like FT_IFTS_SCHEDULER_IDLE_WAIT is the biggest wait, and SQL Server’s waited for 1930299679 ms total.  This is kind of interesting, but not what I really need to know.  How I do really use this data?  It needs some filtering and aggregation.  There are some waits that aren’t going to be of interest because they occur all the time and are irrelevant for our purposes; we can filter out those wait types.  To make the most of our wait stats output, I really want to know the highest wait based on the percentage of time spent waiting overall, and the average wait time for that wait.  The query that I use to get this information is the one from Paul’s post (mentioned above).  I won’t paste it here (you can get it from his post) but if I run that query against my instance, now I get only three rows in my output:

WaitStats output 2 The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

sys.dm_os_wait_stats output with wait_types filtered out

If we reference the various wait types listed in the MSDN entry for sys.dm_os_wait_stats, we see that the SQLTRACE_WAIT_ENTRIES wait type, “Occurs while a SQL Trace event queue waits for packets to arrive on the queue.”

Well, this instance is on my local machine and isn’t very active, so that wait is likely due to the default trace that’s always running.  In a production environment, I probably wouldn’t see that wait, and if I did, I’d check to see how many SQL Traces were running.  But for our purposes, I’m going to add that as a wait type to filter out, and then re-run the query.  Now there are more rows in my output, and the percentage for the PAGEIOLATCH_SH and LCK_M_X waits has changed:

WaitStats output 3 1024x248 The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

sys.dm_os_wait_stats output with SQLTRACE_WAIT_ENTRIES also filtered out

If you review the original query, you will see that the percentage calculation for each wait type uses the wait_time_ms for the wait divided by the SUM of wait_time_ms for all waits.  But “all waits” are those wait types not filtered by the query.  Therefore, as you change what wait types you do not consider, the calculations will change.  Keep this in mind when you compare data over time or with other DBAs in your company – it’s a good idea to make sure you’re always running the same query that filters out the same wait types.

Capturing Wait Statistics

So far I’ve talked about looking at wait statistics at a point in time.  As a DBA, you want to know what waits are normal for each instance.  And there will be waits for every instance; even if it’s highly tuned or incredibly low volume, there will be waits.  You need to know what’s normal, and then use those values when the system is not performing well.

The easiest way to capture wait statistics is to snapshot the data to a table on a regular basis, and you can find queries for this process in my Capturing Baselines for SQL Server: Wait Statistics article on SQLServerCentral.com.  Once you have your methodology in place to capture the data, review it on a regular basis to understand your typical waits, and identify potential issues before they escalate.  When you do discover a problem, then you can use wait statistics to aid in your troubleshooting.

Using the Data

At the time that you identify a problem in your environment, a good first step is to run your wait statistics query and compare the output to your baseline numbers.  If you see something out of the ordinary, you have an idea where to begin your investigation.  But that’s it; wait statistics simply tell you where to start searching for your answer.  Do not assume that your highest wait is the problem, or even that it’s a problem at all.  For example, a common top wait is CXPACKET, and CXPACKET waits indicate that parallelism is used, which is expected in a SQL Server environment.  If that’s your top wait, does that mean you should immediately change the MAXDOP setting for the instance?  No.  You may end up changing it down the road, but a better direction is to understand why that’s the highest wait.  You may have CXPACKET waits because you’re missing some indexes and there are tons of table scans occurring.  You don’t need to change MAXDOP, you need to start tuning.

Another good example is the WRITELOG wait type.  WRITELOG waits occur when SQL Server is waiting for a log flush to complete.  A log flush occurs when information needs to be written to the database’s transaction log.  A log flush should complete quickly, because when there is a delay in a log write, then the task that initiated the modification has to wait, and tasks may be waiting behind that.  But a log flush doesn’t happen instantaneously every single time, so you will have WRITELOG waits.  If you see WRITELOG as your top wait, don’t immediately assume you need new storage.  You should only assume that you need to investigate further.  A good place to start would be looking at read and write latencies, and since I’ll be discussing monitoring IO more in tomorrow’s post we’ll shelve that discussion until then.

As you can see from these two examples, wait statistics are a starting point.  They are very valuable – it’s easy to think of them as “the answer”, but they’re not.  Wait statistics do not tell you the entire story about a SQL Server implementation.  There is no one “thing” that tells you the entire story, which is why troubleshooting can be incredibly frustrating, yet wonderfully satisfying when you find the root of a problem.  Successfully troubleshooting performance issues in SQL Server requires an understanding of all the data available to aid in your discovery and investigation, understanding where to start, and what information to capture to correlate with other findings.

The Accidental DBA (Day 23 of 30): SQL Server HA/DR Features

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Two of the most important responsibilities for any DBA are protecting the data in a database and keeping that data available.  As such, a DBA may be responsible for creating and testing a disaster recovery plan, and creating and supporting a high availability solution.  Before you create either, you have to know your RPO and RTO, as Paul talked about a couple weeks ago.  Paul also discussed what you need to consider when developing a recovery strategy, and yesterday Jon covered considerations for implementing a high availability solution.

In today’s post, I want to provide some basic information about disaster recovery and high availability solutions used most often.  This overview will give you an idea of what options might be a fit for your database(s), but you’ll want to understand each technology in more detail before you make a final decision.

Backup/Restore

No matter what type of implementation you support, you need a disaster recovery plan.  Your database may not need to be highly available, and you may not have the budget to create a HA solution even if the business wants one.  But you must have a method to recover from a disaster.  Every version, and every edition, of SQL Server supports backup and restore.  A bare bones DR plan requires a restore of the most recent database backups available – this is where backup retention comes in to play.  Ideally you have a location to which you can restore.  You may have a server and storage ready to go, 500 miles away, just waiting for you to restore the files.  Or you may have to purchase that server, install it from the ground up, and then restore the backups.  While the plan itself is important, what matters most is that you have a plan.

Log Shipping

Log shipping exists on a per-user-database level and requires the database recovery model to use either full or bulk-logged recovery (see Paul’s post for a primer on the differences).  Log shipping is easy to understand – it’s backup from one server and restore on another – but the process is automated through jobs.  Log shipping is fairly straight forward to configure and you can use the UI or script it out (prior to SQL Server 2000 there was no UI).  Log shipping is available in all currently supported versions of SQL Server, and all editions.

You can log ship to multiple locations, creating additional redundancy, and you can configure a database for log shipping if it’s the primary database in a database mirroring or availability group configuration.  You can also use log shipping when replication is in use.

With log shipping you can allow limited read-only access on secondary databases for reporting purposes (make sure you understand the licensing impact), and you can take advantage of backup compression to reduce the size of the log backups and therefore decrease the amount of data sent between locations.  Note: backup compression was first available only in SQL Server 2008 Enterprise, but starting in SQL Server 2008 R2 it was available in Standard Edition.

While Log Shipping is often used for disaster recovery, you can use it as a high availability solution, as long as you can accept some amount of data loss and some amount of downtime.  Alternatively, in a DR scenario, if you implement a longer delay between backup and restore, then if data is changed or removed from the primary database – either purposefully or accidentally – you can possibly recover it from the secondary.

Failover Cluster Instance

A Failover Cluster Instance (also referred to as FCI or SQL FCI) exists at the instance level and can seem scary to newer DBA because it requires a Windows Server Failover Cluster (WSFC).  A SQL FCI usually requires more coordination with other teams (e.g. server, storage) than other configurations.  But clustering is not incredibly difficult once you understand the different parts involved.  A Cluster Validation Tool was made available in Windows Server 2008, and you should ensure the supporting hardware successfully passes its configuration tests before you install SQL Server, otherwise you may not be able to get your instance and up and running.

SQL FCIs are available in all currently supported versions of SQL Server, and can be used with Standard Edition (2 nodes only), Business Intelligence Edition in SQL Server 2012 (2 nodes only), and Enterprise Edition.  The nodes in the cluster share the same storage, so there is only one copy of the data.  If a failure occurs for a node, SQL Server fails over to another available node.

If you have a two-node WSFC with only one instance of SQL Server, one of the nodes is always unused, basically sitting idle.  Management may view this as a waste of resources, but understand that it is there as insurance (that second node is there to keep SQL Server available if the first node fails).  You can install a second SQL Server instance and use log shipping or mirroring with snapshots to create a secondary copy of the database for reporting (again, pay attention to licensing costs).  Or, those two instances can both support production databases, creating a better use of the hardware.  However, be aware of resource utilization when a node fails and both instances run on the same node.

Finally, a SQL FCI can provide intra-data center high availability, but because it uses shared storage, you do have a single point of failure.  A SQL FCI can be used for cross-data center disaster recovery if you use multi-site SQL FCIs in conjunction with storage replication.  This does require a bit more work and configuration, because you have more moving parts, and it can become quite costly.

Database Mirroring

Database mirroring is configured on a per-user-database basis and the database must use the Full recovery model.  Database mirroring was introduced in SQL Server 2005 SP1 and is available in Standard Edition (synchronous only) and Enterprise Edition (synchronous and asynchronous).  A database can be mirrored to only one secondary server, unlike log shipping.

Database mirroring is extremely easy to configure using the UI or scripting.  A third instance of SQL Server, configured as a witness, can detect the availability of the primary and mirror servers.  In synchronous mode with automatic failover, if the primary server becomes unavailable and the witness can still see the mirror, failover will occur automatically if the database is synchronized.

Note that you cannot mirror a database that contains FILESTREAM data, and mirroring is not appropriate if you need multiple databases to failover simultaneously, or if you use cross-database transactions or distributed transactions.  Database mirroring is considered a high availability solution, but it can also be used for disaster recovery, assuming the lag between the primary and mirror sites is not so great that the mirror database is too far behind the primary for RPO to be met.  If you’re running Enterprise Edition, snapshots can be used on the mirror server for point-in-time reporting, but there’s a licensing cost that comes with reading off the mirror server (as opposed to if it’s used only when a failover occurs).

Availability Groups

Availability groups (AGs) were introduced in SQL Server 2012 and require Enterprise Edition.  AGs are configured for one or more databases, and if a failover occurs, the databases in a group failover together.  They allow three synchronous replicas (the primary and two secondaries), whereas database mirroring allowed only one synchronous secondary, and up to four asynchronous replicas.  Failover in an Availability Group can be automatic or manual.  Availability Groups do require a Windows Failover Clustering Server (WFCS), but do not require a SQL FCI.  An AG can be hosted on SQL FCIs, or on standalone servers within the WFCS.

Availability Groups allow read-only replicas that allow for lower latency streaming updates, so you can offload reporting to another server and have it be near real-time.  Availability Groups offer some fantastic functionality, but just as with a SQL FCI, there are many moving parts and the DBA cannot work in a vacuum for this solution, it requires a group effort.  Make friends with the server team, the storage team, the network folks, and the application team.

Transactional Replication

Transactional Replication gets a shout out here, even though it is not always considered a high availability solution as Paul discusses in his post, In defense of transaction replication as an HA technology.  But it can work as a high availability solution provided you can accept its limitations.  For example, there is no easy way to fail back to the primary site…however, I would argue this is true for log shipping as well because log shipping requires you to backup and restore (easy but time consuming).  In addition, with transactional replication you don’t have a byte-for-byte copy of the publisher database, as you do with log shipping, database mirroring or availability groups.  This may be a deal-breaker for some, but it may be quite acceptable for your database(s).

Transactional Replication is available in all currently supported versions and in Standard and Enterprise Editions, and may also be a viable option for you for disaster recovery.  It’s important that you clearly understand what it can do, and what it cannot, before you decide to use it.  Finally, replication in general isn’t for the faint of heart.  It has many moving parts and can be overwhelming for an Accidental DBA.  Joe has a great article on SQL Server Pro that covers how to get started with transactional replication.

Summary

As we’ve seen, there are many options available that a DBA can use to create a highly available solution and/or a system that can be recovered in the event of a disaster.  It all starts with understanding how much data you can lose (RPO) and how long the system can be unavailable (RTO), and you work from there.  Remember that the business needs to provide RPO and RTO to you, and then you create the solution based on that information.  When you present the solution back to the business, or to management, make sure it is a solution that YOU can support.  As an Accidental DBA, whatever technology you choose must be one with which you’re comfortable, because when a problem occurs, you will be the one to respond and that’s not a responsibility to ignore.  For more information on HA and DR solutions I recommend the following:

The Accidental DBA (Day 19 of 30): Tools for On-Going Monitoring

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

In yesterday’s post I covered the basics of baselines and how to get started.  In addition to setting up baselines, it’s a good idea to get familiar with some of the free tools available to DBAs that help with continued monitoring of a SQL Server environment.

Performance Monitor and PAL

I want to start with Performance Monitor (PerfMon).  I’ve been using PerfMon since I started working with computers and it is still one of my go-to tools.  Beginning in SQL Server 2005, Dynamic Management Views and Functions (DMVs and DMFs) were all the rage, as they exposed so much more information than had been available to DBAs before.  (If you don’t believe me, try troubleshooting a parameter sniffing issue in SQL Server 2000.)  But PerfMon is still a viable option because it provides information about Windows as well as SQL Server.  There are times that it’s valuable to look at that data side-by-side.  PerfMon is on every Windows machine, it’s reliable, and it’s flexible.  It provides numerous configuration options, not to mention all the different counters that you can collect.  You have the ability to tweak it for different servers if needed, or just use the same template every time.  It allows you to generate a comprehensive performance profile of a system for a specified time period, and you can look at performance real-time.

If you’re going to use PerfMon regularly, take some time to get familiar it. When viewing live data, I like to use config files to quickly view counters of interest.  If I’ve captured data over a period of time and I want to get a quickly view and analyze the data, I use PAL.  PAL stands for Performance Analysis of Logs and it’s written and managed by some folks at Microsoft.  You can download PAL from CodePlex, and if you don’t already have it installed, I recommend you do it now.

Ok, once PAL is installed, set up PerfMon to capture some data for you.  If you don’t know which counters to capture, don’t worry.  PAL comes with default templates that you can export and then import into PerfMon and use immediately.  That’s a good start, but to get a better idea of what counters are relevant for your SQL Server solution, plan to read Jonathan’s post on essential PerfMon counters (it goes live this Friday, the 21st).  Once you’ve captured your data, you can then run it through PAL, which will do all the analysis for you and create pretty graphs.  For step-by-step instructions on how to use PAL, and to view some of those lovely graphs, check out this post from Jonathan, Free Tools for the DBA: PAL Tool.  Did you have any plans for this afternoon?  Cancel them; you’ll probably have more fun playing with data.

SQL Trace and Trace Analysis Tools

After PerfMon, my other go-to utility was SQL Trace.  Notice I said “was.”  As much as I love SQL Trace and its GUI Profiler, they’re deprecated in SQL Server 2012.  I’ve finally finished my mourning period and moved on to Extended Events.  However, many of you are still running SQL Server 2008R2 and earlier so I know you’re still using Trace.  How many of you are still doing analysis by pushing the data into a table and then querying it?  Ok, put your hands down, it’s time to change that.  Now you need to download ClearTrace and install it.

ClearTrace is a fantastic, light-weight utility that will parse and normalize trace files.  It uses a database to store the parsed information, then queries it to show aggregated information from one trace file, or a set of files.  The tool is very easy to use – you can sort queries based on reads, CPU, duration, etc.  And because the queries are normalized, if you group by the query text you can see the execution count for the queries.

A second utility, ReadTrace, provides the same functionality as ClearTrace, and more.  It’s part of RML Utilities, a set of tools developed and used by Microsoft.  ReadTrace provides the ability to dig a little deeper into the trace files, and one of the big benefits is that it allows you to compare two trace files.  ReadTrace also stores information in a database, and normalizes the data so you can group by query text, or sort by resource usage.  I recommend starting with ClearTrace because it’s very intuitive to use, but once you’re ready for more powerful analysis, start working with ReadTrace.  Both tools include well-written documentation.

Note: If you’re a newer DBA and haven’t done much with Trace, that’s ok.  Pretend you’ve never heard of it, embrace Extended Events.

SQLNexus

If you’re already familiar with the tools I’ve mentioned above, and you want to up your game, then the next utility to conquer is SQLNexus.  SQLNexus analyzes data captured by SQLDiag and PSSDiag, utilities shipped with SQL Server that Microsoft Product Support uses when troubleshooting customer issues.  The default templates for SQLDiag and PSSDiag can be customized, by you, to capture any and all information that’s useful and relevant for your environment, and you can then run that data through SQLNexus for your analysis.  It’s pretty slick and can be a significant time-saver, but the start-up time is higher than with the other tools I’ve mentioned.  It’s powerful in that you can use it to quickly capture point-in-time representations of performance, either as a baseline or as a troubleshooting step.  Either way, you’re provided with a comprehensive set of information about the solution – and again, you can customize it as much as you want.

Essential DMVs for Monitoring

In SQL Server 2012 SP1 there are 178 Dynamic Management Views and Functions.  How do you know which ones are the most useful when you’re looking at performance?  Luckily, Glenn had a great set of diagnostic queries to use for monitoring and troubleshooting.  You can find the queries on Glenn’s blog, and he updates them as needed, so make sure you follow his blog or check back regularly to get the latest version.  And even though I rely on Glenn’s scripts, I wanted to call out a few of my own favorite DMVs:

  • sys.dm_os_wait_stats – I want to know what SQL Server is waiting on, when there is a problem and when there isn’t.  If you’re not familiar with wait statistics, read Paul’s post, Wait statistics, or please tell me where it hurts (I still chuckle at that title).
  • sys.dm_exec_requests – When I want to see what’s executing currently, this is where I start.
  • sys.dm_os_waiting_tasks – In addition to the overall waits, I want to know what tasks are waiting right now (and the wait_type).
  • sys.dm_exec_query_stats – I usually join to other DMVs such as sys.dm_exec_sql_text to get additional information, but there’s some great stuff in here including execution count and resource usage.
  • sys.dm_exec_query_plan – Very often you just want to see the plan. This DMV has cached plans as well as those for queries that are currently executing.
  • sys.dm_db_stats_properties – I always take a look at statistics in new systems, and when there’s a performance issue, initially just to check when they were last updated and the sample size.  This DMF lets me do that quickly for a table, or entire database (only for SQL 2008R2 SP2 and SQL 2012 SP1).

Kimberly will dive into a few of her favorite DMVs in tomorrow’s post.

Wrap Up

All of the utilities mentioned in this post are available for free.  But it’s worth mentioning that there are tools you can purchase that provide much of the same functionality and more.  As an Accidental DBA, you may not always have a budget to cover the cost of these products, which is why it’s important to know what’s readily available.  And while the free tools may require more effort on your part, using them to dig into your data and figure out what’s really going on in your system is one of the best ways to learn about SQL Server and how it works.

The Accidental DBA (Day 18 of 30): Baselines

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Baselines are a part of our normal, daily life.  It usually takes 25 minutes to get to work?  Baseline.  You need 7 hours of sleep each night to feel human and be productive?  Baseline.  Your weight is…  Ok, we won’t go there, but you get my point.  Your database server is no different, it has baselines as well.  As a DBA it’s critical that you know what they are and how to use them.

The why…

“But wait,” you say, “why do I need baselines for my server?  It’s always working so there’s no commute, it hopefully never sleeps, and its weight never changes (so unfair).”  You need them; trust me.  A baseline of your database server:

  • Helps you find what’s changed before it becomes a problem
  • Allows you to proactively tune your databases
  • Allows you to use historical information when troubleshooting a problem
  • Provides data to use for trending of the environment and data
  • Captures data – actual numbers – to provide to management, and both server and storage administrators, for resource and capacity planning

There are many, viable reasons to capture baselines.  The challenge is the time it takes to figure out where to store the information, what to capture, and when to capture it.  You also need to create methods to report on it and really use that data.

Where to store your baseline data

You’re a DBA or developer, and you know T-SQL, so the most obvious place to store your baseline information is in a database.  This is your chance to not only exercise your database design skills, but put your DBA skills to work for your own database.  Beyond design, you also need space for the database, you need to schedule regular backups, and you also want to verify integrity and perform index and statistics maintenance regularly.  Most of the posts that have appeared in this Accidental DBA series are applicable for this database, as well as your Productions databases.

To get you started, here’s a CREATE DATABASE script that you can use to create a database to hold your baseline data (adjust file locations as necessary, and file size and growth settings as you see fit):

USE [master];
GO

CREATE DATABASE [BaselineData] ON PRIMARY
( NAME = N'BaselineData',
  FILENAME = N'M:\UserDBs\BaselineData.mdf',
  SIZE = 512MB,
  FILEGROWTH = 512MB
) LOG ON
( NAME = N'BaselineData_log',
  FILENAME = N'M:\UserDBs\BaselineData_log.ldf',
  SIZE = 128MB,
  FILEGROWTH = 512MB
);

ALTER DATABASE [BaselineData] SET RECOVERY SIMPLE;

What to capture

Now that you have a place to store your data, you need to decide what information to collect.  It’s very easy to start capturing baseline data with SQL Server, particularly in version 2005 and higher.  DMVs and catalog views provide a plethora of information to accumulate and mine.  Windows Performance Monitor is a built-in utility used to log metrics related to not just SQL Server but also the resources it uses such as CPU, memory, and disk.  Finally, SQL Trace and Extended Events can capture real-time query activity, which can be saved to a file and reviewed later for analysis or comparison.

It’s easy to get overwhelmed with all the options available, so I recommend starting with one or two data points and then adding on over time.  Data file sizes are a great place to start.  Acquiring more space for a database isn’t always a quick operation; it really depends on how your IT department is organized – and also depends on your company having unused storage available.  As a DBA, you want to avoid the situation where your drives are full, and you also want to make sure your data files aren’t auto-growing.

With the statements below, you can create a simple table that will list each drive and the amount of free space, as well as the snapshot date:

USE [BaselineData];
GO
IF EXISTS ( SELECT  1
     FROM    [sys].[tables]
     WHERE   [name] = N'FreeSpace' )
  DROP TABLE [dbo].[FileInfo]

CREATE TABLE [dbo].[FreeSpace] (
   [LogicalVolume] NVARCHAR(256),
   [MBAvailable] BIGINT,
   [CaptureDate] SMALLDATETIME
)
ON [PRIMARY];
GO

Then you can set up a SQL Agent job to capture the data at a regular interval with the query below:

INSERT INTO [dbo].[FreeSpace](
   [LogicalVolume],
   [AvailableBytes],
   [CaptureDate]
)
SELECT DISTINCT
   ([vs].[logical_volume_name]),
   ([vs].[available_bytes] / 1048576),
   GETDATE()
FROM [sys].[master_files] AS [f]
CROSS APPLY [sys].[dm_os_volume_stats]([f].[database_id],[f].[file_id]) AS [vs];
GO

There is a catch with the above query – it’s only applicable if you’re running SQL Server 2008 R2 SP1 and higher (including SQL Server 2012).  If you’re using a previous version, you can use xp_fixeddrives to capture the data:

INSERT INTO [dbo].[FreeSpace](
   [LogicalVolume],
   [MBAvailable]
)
EXEC xp_fixeddrives;

UPDATE [dbo].[FreeSpace]
SET [CaptureDate] = GETDATE()
WHERE [CaptureDate] IS NULL;
GO

Capturing free space is a great start, but if you’ve pre-sized your database files (which is recommended) the free space value probably won’t change for quite a while.  Therefore, it’s a good idea to capture file sizes and available space within as well.  You can find scripts to capture this information in my Capturing Baselines on SQL Server: Where’s My Space? article.

When to capture

Deciding when you will collect data will depend on the data itself.  For the file and disk information, the data doesn’t change often enough that you need it to collect hourly.  Daily is sufficient – perhaps even weekly if the systems are low volume.  If you’re capturing Performance Monitor data, however, then you would collect at shorter intervals, perhaps every 1 minute or every 5 minutes.  For any data collection, you have to find the right balance between capturing it often enough to accumulate the interesting data points, and not gathering so much data that it becomes unwieldy and hard to find what’s really of value.

Separate from the interval at which to capture, for some data you also need to consider the timeframes.  Performance Monitor is a great example.  You may decide to collect counters every 5 minutes, but then you have to determine whether you want to sample 24×7, only on weekdays, or only during business hours.   Or perhaps you only want to capture metrics during peak usage times.  When in doubt, start small.  While you can always change your collection interval and timeframe later on, it’s much easier to start small to avoid getting overwhelmed, rather than collect everything and then try to figure out what to remove.

Using baseline data

Once you’ve set up your process for data capture, what’s next?  It’s very easy to sit back and let the data accumulate, but you need to be proactive.  You won’t want to keep data forever, so put a job in place that will delete data after a specified time.  For the free space example above, it might make sense to add a clustered index on the [CaptureDate] column, and then purge data older than three months (or six months – how long you keep the data will depend on how you’re using it).

Finally, you need to use that data in some way.  You can simply report on it – the query below will give you free disk information for a selected volume for the past 30 days:

SELECT
   [LogicalVolume],
   [MBAvailable],
   [CaptureDate]
FROM [dbo].[FreeSpace]
WHERE [LogicalVolume] = 'C'
   AND [CaptureDate] > GETDATE() - 30
ORDER BY [CaptureDate];
GO

This type of query is great for trending and analysis, but to take full advantage of the data, as part of your daily Agent job, set up a second step that queries the current day’s values and if there is less than 10GB of free space, send you an email to notify you that disk space is low.

Getting started

At this point you should have a basic understanding of baselines, and you have a few queries to get you started.  If you want to learn more you can peruse my Baselines series on SQLServerCentral.com, and for an in-depth review, you can head over to Pluralsight to view my SQL Server: Benchmarking and Baselines course.  Once you’ve set up your baselines, you will be ready to explore quick methods to review or process the data.  There are many free tools that a DBA can use to not only see what happens in real-time, but also review captured data for analysis and trending.  In tomorrow’s post, we’ll look at a few of those utilities in more detail.

The Accidental DBA (Day 13 of 30): Consistency Checking

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

If you’ve been following along with our Accidental DBA series, you’ll know that the posts for the last week covered topics related to one of the most important tasks (if not the most important) for a DBA: backups.  I consider consistency checks, often referred to as CHECKDB, as one of the next most important tasks for a DBA.  And if you’ve been a DBA for a while, and if you know how much I love statistics, you might wonder why fragmentation and statistics take third place.  Well, I can fix fragmentation and out-of-date/inaccurate statistics at any point.  I can’t always “fix” corruption.  But let’s take a step back and start at the beginning.

What are consistency checks?

A consistency check in SQL Server verifies the logical and physical integrity of the objects in a database. A check of the entire database is accomplished with the DBCC CHECKDB command, but there are other variations that can be used to selectively check objects in the database: DBCC CHECKALLOC, DBCC CHECKCATALOG, DBCC CHECKTABLE and DBCC CHECKFILEGROUP. Each command performs a specific set of validation commands, and it’s easy to think that to in order to perform a complete check of the database you need to execute all of them. This is not correct.

When you execute CHECKDB, it runs CHECKALLOC, CHECKTABLE for every table and view (system and user) in the database, and CHECKCATALOG. It also includes some additional checks, such as those for Service Broker, which do not exist in any other command. CHECKDB is the most comprehensive check and is the easiest way to verify the integrity of the database in one shot. You can read an in-depth description of what it does from Paul, it’s author, here.

CHECKFILEGROUP runs CHECKALLOC and then CHECKTABLE for every table in the specified filegroup. If you have a VLDB (Very Large DataBase) you may opt to run CHECKFILEGROUP for different filegroups on different days, and run CHECKCATALOG another day, to break up the work.

How often should I run Consistency Checks?

If you can run a consistency check every day for your database, I recommend that you do so. But it’s quite common that a daily execution of CHECKDB doesn’t fit into your maintenance window – see Paul’s post on how often most people do run checks. In that case, I recommend you run your checks once a week. And if CHECKDB for your entire database doesn’t complete in your weekly maintenance window, then you have to figure out what’s possible within the time-frame available. I mentioned VLDBs earlier, and Paul has a nice post on options for breaking up checks for large database. You will have to determine out what works best for your system – there isn’t a one-size-fits-all solution. You may need to get creative, which is one of the fun aspects of being DBA. But don’t avoid running consistency checks simply because you have a large database or a small maintenance window.

Why do I need to run consistency checks?

Consistency checks are critical because hardware fails and accidents happen. The majority of database corruption occurs because of issues with the I/O subsystem, as Paul mentions here. Most of the time these are events that are out of your control, and all you can do is be prepared. If you haven’t experienced database corruption yet in your career, consider yourself lucky, but don’t think you’re exempt. It’s much more common that many DBAs realize and you should expect that it’s going to occur in one of your databases, on a day that you have meetings booked back-to-back, need to leave early, and while every other DBA is on vacation.

What if I find corruption?

If you encounter database corruption, the first thing to do is run DBCC CHECKDB and let it finish. Realize that a DBCC command isn’t the only way to find corruption – if a page checksum comes up as invalid as part of a normal operation, SQL Server will generate an error. If a page cannot be read from disk, SQL Server will generate an error. However it’s encountered, make sure that CHECKDB has completed and once you have the output from it, start to analyze it (it’s a good idea to save a copy of the output). Output from CHECKDB is not immediately intuitive. If you need help reviewing it, post to one of the MSDN or StackOverflow forums, or use the #sqlhelp hashtag on Twitter.

Understand exactly what you’re facing in terms of corruption before you take your next step, which is deciding whether you’re going to run repair or restore from backup. This decision depends on numerous factors, and this is where your disaster recovery run-book comes into play. Two important considerations are how much data you might lose (and CHECKDB won’t tell you what data you will lose if you run repair, you’ll have to go back and try to figure that afterwards) and how long the system will be unavailable – either during repair or restore. This is not an easy decision. If you decide to repair, make certain you take a full backup of the database first. You always want a copy of the database, just in case. I would also recommend that if you decide to run repair, run it against a copy of the database first, so you can see what it does. This may also help you understand how much data you would lose. Finally, after you’ve either run repair or restored from backup, run CHECKDB again. You need to confirm that the database no longer has integrity issues.

Please understand that I have greatly simplified the steps to go through if you find corruption. For a deeper understanding of what you need to consider when you find corruption, and options for recovering, I recommend a session that Paul did a few years ago on Corruption Survival Techniques, as what he discussed still holds true today.

What about CHECKIDENT and CHECKCONSTRAINTS?

There are two additional DBCC validation commands: DBCC CHECKIDENT and DBCC CHECKCONSTRAINTS. These commands are not part of the normal check process. I blogged about CHECKIDENT here, and you use this command to check and re-seed values for an identity column. CHECKCONSTRAINTS is a command to verify that data in a column or table adheres to the defined constraints. This command should be run any time you run CHECKDB with the REPAIR_ALLOW_DATA_LOSS option. Repair in DBCC will fix corruption, and it doesn’t take constraints into consideration; it just alters data structures as needed so that data can be read and modified. As such, after running repair, constraint violations can exist, and you need to run CHECKCONSTRAINTS for the entire database to find them.

What’s next?

We’ve only scratched the surface of consistency checking. This is a topic worthy of hours of discussion – not just in the how and why, but also what to do when corruption exists. If you plan on attending our Immersion Event for the Accidental DBA, and want to get a jump on the material, I recommend reading through the posts to which I’ve linked throughout, and also going through Paul’s CHECKDB From Every Angle blog category, starting with the oldest post and working your way forward. Hopefully your experience with database corruption will be limited to testing and what you hear about from colleagues…but don’t bet on it icon smile The Accidental DBA (Day 13 of 30): Consistency Checking

The Nuance of DBCC CHECKIDENT That Drives Me Crazy

When I put together my DBCC presentation a couple years ago I created a demo for the CHECKIDENT command.  I had used it a few times and figured it was a pretty straight-forward command.  In truth, it is, but there is one thing that I don’t find intuitive about it.  And maybe I’m the only one, but just in case, I figured I’d write a quick post about it.

CHECKIDENT is used to check the current value for an identity column in a table, and it can also be used to change the identity value.  The syntax is:

DBCC CHECKIDENT
 (
   table_name
     [, { NORESEED | { RESEED [, new_reseed_value ] } } ]
 )
[ WITH NO_INFOMSGS ]

To see it in action, let’s connect to a copy of the AdventureWorks2012 database and run it against the SalesOrderHeader table:

USE [AdventureWorks2012];
GO

DBCC CHECKIDENT ('Sales.SalesOrderHeader');

In the output we get:

Checking identity information: current identity value '75123', current column value '75123'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Hooray, seems pretty basic, right?  Well, did you know that running the command as I did above can change the identity seed if the identity value and column value do not match?  This is what I meant initially when I said it wasn’t intuitive.  I didn’t include any options with the command, therefore I do not expect it to make any changes.  In fact, you have to include an option to ensure you do not make a change.  Let’s take a look.

First we’ll create a table with an identity column and populate it with 1000 rows:

USE [AdventureWorks2012];
GO

CREATE TABLE [dbo].[identity_test] (
   id INT IDENTITY (1,1),
   info VARCHAR(10));
GO

SET NOCOUNT ON;
GO

INSERT INTO [dbo].[identity_test] (
   [info]
   )
   VALUES ('test data');
GO 1000

Now we’ll run CHECKIDENT, as we did above:

DBCC CHECKIDENT ('dbo.identity_test');

Checking identity information: current identity value '1000', current column value '1000'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Our results are what we expect.  Now let’s reseed the identity value down to 10:

DBCC CHECKIDENT ('dbo.identity_test', RESEED, 10);

Checking identity information: current identity value '1000'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

The output doesn’t tell us specifically that the identity has been reseeded, so we’ll run CHECKIDENT again, but this time with the NORESEED option (different than what we ran initially):

DBCC CHECKIDENT ('dbo.identity_test', NORESEED);

Checking identity information: current identity value '10', current column value '1000'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Now we can see that the identity value and the current column are different, and because we included the NORESEED option, nothing happened.  And this is my point: if you do not include the NORESEED option, if the identity and column values do not match, the identity will reseed:

--first execution
DBCC CHECKIDENT ('dbo.identity_test');
PRINT ('first execution done');

--second execution
DBCC CHECKIDENT ('dbo.identity_test');
PRINT ('second execution done');

Checking identity information: current identity value '10', current column value '1000'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
first execution done

Checking identity information: current identity value '1000', current column value '1000'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
second execution done

So just in case I’m not the only one for whom this isn’t obvious: Make sure to include the NORESEED option when running DBCC CHECKIDENT.  Most of the time, the identity value probably matches the value for the column.  But that one time where it doesn’t, you may not want to reseed it…at least not right away.

SQL Server Training and Conferences for the Fall

There has been a lot of conversation this week in Twitterverse related to training and conferences in the SQL Server community.  I wanted to share some details and my own thoughts related to a few specific events in which I am involved (and it’s all very exciting!).

Training

First, Paul announced a new IE event that will kick off at the end of September: IE0: Immersion Event for the Accidental DBA.  I am thrilled to be an instructor for this course, and I’m really looking forward to teaching with Jonathan.  I worked with so many Accidental DBAs in my previous job – people who were the application administrator and also had to manage the application database.  We had a fairly general class that talked about databases, and we ended up tweaking that content to create a class solely focused on teaching those application administrators what they needed to do to support their SQL Server database.  In the beginning it was a half day class, but we kept coming up with more content we wanted to cover, and had expanded the training to a full day before I left.  How happy am I that Jon and I now have three days to help SQL Server application administrators, Accidental DBAs, and Junior DBAs learn the basics?!

If you’re interested in attending our class, or know someone who might like attend, please check out the syllabus and registration page.  And if you have any questions about the course, please do not hesitate to contact me or Jon!

Conferences

Second, I am speaking at the SQLIntersection conference in Las Vegas this fall.  Kimberly blogged about it on Monday and you can see the entire lineup of sessions here.  I’ll be presenting three sessions:

  • Making the Leap From Profiler to Extended Events
  • Free Tools for More Free Time
  • Key Considerations for Better Schema Design

SQLintersection is a unique conference because it is pairs with DEVintersection and SharePointintersection, and attendees have access to sessions across multiple Windows technologies.  I have more detail about my Extended Events session below, and the Free Tools session will cover usage scenarios for some of the applications I’ve discussed before in my Baselines sessions (e.g. PAL, RMLUtilities).  The last session on schema design is geared toward developers – but is also appropriate for DBAs – and I have a lot of great ideas for the content as I’ve just finished recording my next Pluralsight course, Developing and Deploying SQL Server ISV Applications, which should go live next week!

And finally, I will be speaking at the PASS Summit this October in Charlotte, NC!  I am very honored to had the following session selected:

Making the Leap From Profiler to Extended Events

You know how you discover something wonderful and you want everyone you meet to try it?  That’s this session.  I had my light bulb moment with Extended Events and believe that everyone else should use it too.  But I get that there’s some hesitation, for a whole host of reasons, so I created this session to help people understand Extended Events better, using what they already know about Profiler and SQL Trace.  Change is hard, I get that, and people have used Profiler and Trace for years…over a decade in some cases!  But both are deprecated in 2012 and Extended Events is here to stay.  You need to learn XEvents not just because it’s what you’ll use for tracing going forward, but also because it can help you troubleshoot issues in ways you’ve never been able to before.

I will also be part of a panel discussion:

How to Avoid Living at Work: Lessons from Working at Home

When I joined SQLskills last summer and started working from home, I had to make significant adjustments.  Some days, working at home was just as challenging as work itself.  But 10 months in, I can’t imagine not working at home.  I’m really looking forward to being able to share my experiences, and also hear what my rock star colleagues have learned.  If you’re thinking of working from home, or even if you currently work from home, please join me, Tom LaRockAaron BertrandAndy LeonardSteve JonesGrant FritcheyKaren Lopez, and  Kevin Kline for what I’m sure will be an invaluable and engaging discussion.

Whew!  It’s going to be a busy fall filled with SQL Server events, but I wouldn’t have it any other way.  I am very much looking forward to all of these events – and I hope to see you at one of them!

T-SQL Tuesday #41: Presenting and Loving It

TSQL2sDay150x150 T SQL Tuesday #41: Presenting and Loving It

I’ve been on a TSQL Tuesday hiatus, which was completely unintentional.  When I read Bob Pusateri’s topic for this month I knew I had to end my sabbatical and get back in the swing of these posts.  Bob’s question was, “How did you come to love presenting?”  Oh Bob…how much time do you have?  icon smile T SQL Tuesday #41: Presenting and Loving It

It goes back to graduate school.  I’ve blogged before about my mentor, Susan Brown, and in my original homage I mentioned that I would not be the speaker that I am today, were it not for her.  And I said that then, and still believe it now, because she found and created opportunities for me to speak publicly, and she provided feedback and encouragement – two things absolutely vital for any speaker to improve and succeed.

During my first year of graduate school the School of Kinesiology held a research day, designed to bring the entire department together to showcase our research efforts.  It’s very easy to have no idea what other researchers are doing not just within the University, but even within a small department like Kinesiology.  The idea was to explain our research, what we’d learned, and share ideas.  I gave a 10 minute session on the research we were doing with botulinum toxin (yes, Botox before it was cool for cosmetic reasons) and its effects on upper limb function in children with spasticity.  I was terrified.  I had spoken in front of groups before – I took a Communications (read: public speaking) class my junior year, I was a leader in my sorority (yes, you read that right) and spoke often, and I had done campus tours during my senior year (Bob has a great story about tours in his post).  But speaking to hundreds of people, who were my peers and professors?  That was a whole new ballgame.

I can’t remember how many slides I created, at least 10, before Susan told me that she typically used one slide for each 10 minutes of a talk.  I remember thinking she was crazy…talking for 10 minutes in front of the entire department (and many other researchers from different areas of the University) seemed like an eternity.  [What’s ironic is that I can’t always finish recording a SQLskills Insider Video in less than 10 minutes these days.]

At any rate, I remember standing at the front of the room in the Michigan League Ballroom feeling incredibly uncomfortable.  Not only were there hundreds of people there, but I was wearing a dress (if you know me, you’re laughing).  I made it through my 10 minutes with one slight timing issue – I had someone play a video, which taught me the importance of having the entire presentation under my control – and I cannot remember if it was great or horrible.  But I didn’t walk away thinking, “I’ll never do this again.”

Soon after Susan asked if I would like to take over teaching the Motor Control portion of the introductory Movement Science course required for all Kinesiology students.  The course was broken into three sections, Motor Control, Biomechanics and Exercise Physiology, with students rotating between the sections and a different instructor for each.  This meant I would teach the same material three times in a semester, which sounds boring but was ideal as a first time instructor.  And I would get paid.  I said yes.

Susan gave me all of her materials, and I converted all of her overheads (yes, overheads) to PowerPoint.  Then I started reading.  While I had taken the same class myself as a sophomore, had taken many advanced Motor Control classes since then, and was getting a master’s degree in Motor Control, teaching the course was something else entirely.  You have to know the information at a different level.  During those early days I often thought of the learning model followed in medicine, “See one, do one, teach one.”  I’d learned it, I’d been tested on it, now it was time for me to teach it.

Some may state that teaching is not the same as presenting.  If you get down into the details, that’s probably true, but it’s not a topic I want to debate here.  To me, they are one and the same.  For me, when I present at a User Group, a SQLSaturday or a conference like the PASS Summit, I am teaching.

And that is what I love: teaching.  And I realized it in graduate school, when I was teaching that introductory Movement Science course.  It happened in the first semester, in the very first rotation.  I cannot remember the name of the student, but she grasped what I was teaching, she understood.  She asked questions during class, she took notes, and she aced the quizzes and the test.  I taught, she learned.  That was pretty cool.

Now…do I believe that I had that much of an impact on her learning?  No.  Do I believe that if I weren’t such a fantastic teacher that she wouldn’t have done so well?  Absolutely not.  She was a smart kid, motivated, and interested in the material.  She would have done well no matter what.  But in those few weeks I realized that I had something to teach those who wanted to learn, and I realized that I wanted to be good at teaching – for them.

As a student, I appreciated good instructors.  Not every instructor is fully “on” every single day – teaching is hard, and the semester is long.  But there were many instructors whose classes I enjoyed, not just for the material, but for the way they explained it.  Susan was that type of instructor.  I wanted to be that type of instructor.  So I worked at it.  For some, teaching and presenting come naturally.  For many, we have to work at it.  And to work at it, you practice.  I taught that same section of that same course for two years.  Yes, 12 times.  But that experience established a foundation upon which I’ve been building ever since.

In my first technology job I wore many hats, and one of them was software trainer.  In my next job, I sought out opportunities to teach others, and eventually, I found the SQL Community and realized that I could present at User Groups, SQLSaturdays and conferences, like so many others.  And here I am.  I still love teaching, I love it when you see the light bulb go on for someone. I love it when you hear that someone took what they learned and applied it to their environment, and then learned something new.  And I really appreciate it when attendees come back and tell me what they learned – as I have not seen every use case and am always, always learning myself.

One of the things that I value most about SQL Server is that it’s vast and it’s always changing.  As such, my learning never ends, and the opportunity to find new things to teach never ends.  As my good friend Allen White ( b | t ) always says, “I can learn from all of you” (and that means you, dear reader).  If you want to share what you learn, I encourage you teach.  Don’t think of it as presenting – that word can be scary.  Think of it as teaching.  Everyone has great and interesting experiences.  Learn how to tell a story, and share what you know.

SQL Server Maintenance Plans and Parallelism – Index Rebuilds

In my previous post, SQL Server Maintenance Plans and Parallelism – CHECKDB, we looked at the degree of parallelism used when CHECKDB is run.  It ultimately depends on SQL Server Edition and the max degree of parallelism setting for the instance, which is not the case for index rebuilds (today’s topic, as you probably surmised!).

Index Rebuilds

The max degree of parallelism can be configured for index rebuilds using WITH (MAXDOP = n):

USE [AdventureWords2012];
GO

ALTER INDEX [IX_SalesOrderDetail_ProductID] ON [Sales].[SalesOrderDetail]
     REBUILD WITH (MAXDOP = 8);
GO

If this option is included, it overrides the max degree of parallelism value configured for the instance. For example, I can rebuild the IX_SalesOrderDetail_ProductID index on Sales.SalesOrderDetail with MAXDOP set to 8, even though MAXDOP is set to 4 for the instance.  If WITH (MAXDOP = n) is not specified for an ALTER INDEX … REBUILD statement, then SQL Server will use the MAXDOP value set for the instance.

Now, unfortunately, parallel index operations are only permitted in Enterprise Edition.  If you’re running Standard Edition, you’re stuck with single threaded rebuilds, just like you’re stuck with single threaded integrity checks.  Despite this sad news, I thought I’d run through a demo that shows the max degree of parallelism used during the index rebuild. I’m going to run ALTER INDEX REBUILD for a selected index in the AdventureWorks2012 database, and I’ll use Extended Events to capture each statement executed (sp_statement_completed event), and the actual query plan for the statement (query_post_execution_showplan event).

**Important note here again: it is NOT recommended to capture the query_post_execution_showplan event against a live, production system.  This event generates significant performance overhead, and you are warned of this when configuring the session via the GUI.  If you repeat any of the demos here, please make sure to execute them against a test environment.  It’s very important to me that you do not bring down your production environment.**

Here are the statements to create the event session, start it, run the ALTER INDEX … REBUILD statements, then stop the event session.  As in my previous post, I am using a file target to capture the output, and the path is C:\temp.  You may need to modify this path for your environment.  I still have max degree of parallelism set to 4 for my instance, but we’ll set it before we run anything just for good measure.

sp_configure 'show advanced options', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
sp_configure 'max degree of parallelism', 4;
GO
RECONFIGURE WITH OVERRIDE;
GO

CREATE EVENT SESSION [CapturePlans] ON SERVER
ADD EVENT sqlserver.query_post_execution_showplan(
     ACTION(sqlserver.plan_handle,sqlserver.sql_text)),
ADD EVENT sqlserver.sp_statement_completed(
     ACTION(sqlserver.sql_text))
ADD TARGET package0.event_file(SET filename=N'C:\temp\CapturePlans.xel'),
ADD TARGET package0.ring_buffer(SET max_memory=(102400))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,
     MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF);
GO

ALTER EVENT SESSION [CapturePlans]
     ON SERVER
     STATE=START;
GO

USE [AdventureWords2012];
GO

ALTER INDEX [IX_SalesOrderDetailEnlarged_ProductID] ON [Sales].[SalesOrderDetailEnlarged]
     REBUILD WITH (MAXDOP = 8);
GO

ALTER INDEX [IX_SalesOrderDetailEnlarged_ProductID] ON [Sales].[SalesOrderDetailEnlarged]
     REBUILD;
GO

ALTER EVENT SESSION [CapturePlans]
     ON SERVER
     STATE=STOP;
GO

Note that I used a different version of the SalesOrderDetail table named SalesOrderDetailEnlarged.  This table has over 4 million rows in it and was populated using Jonathan’s Create Enlarged AdventureWorks Table script to ensure I’d have a table large enough to warrant a parallel rebuild.  After I stopped the event session I opened the .xel file from C:\temp in Management Studio and added the sql_text column to the display so I could easily find the ALTER INDEX statements.

The screen shot below is from the ALTER INDEX statement with MAXDOP = 8 included.  The query_post_execution_showplan event is highlighted, you can see the sql_text, and I hovered over the showplan_xml to show the first part of the xml version of the plan.  Note the red box around QueryPlan DegreeofParallelism…it’s 8, as expected:

indexrebuild 8 SQL Server Maintenance Plans and Parallelism   Index Rebuilds

ALTER INDEX … REBUILD WITH (MAXDOP = 8)

If you’re playing along at home in your test environment, you can click on the Query Plan to see the graphical view, or double-click the XML to view that plan that way.  Now check out the screen capture below, which is for the ALTER INDEX statement that did not include the MAXDOP option:

indexrebuild default SQL Server Maintenance Plans and Parallelism   Index Rebuilds

ALTER INDEX … REBUILD (default option)

The max degree of parallelism for the plan is 4 because if the MAXDOP option is not included, SQL Server uses the max degree of parallelism set for the instance.  Note that this holds true when parallelism is disabled for an instance (max degree of parallelism = 1):

sp_configure 'max degree of parallelism', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO

ALTER EVENT SESSION [CapturePlans]
 ON SERVER
 STATE=START;
GO

USE [AdventureWords2012];
GO

ALTER INDEX [IX_SalesOrderDetailEnlarged_ProductID] ON [Sales].[SalesOrderDetailEnlarged]
     REBUILD;
GO

ALTER EVENT SESSION [CapturePlans]
 ON SERVER
 STATE=STOP;
GO
indexrebuild default maxdop1 1024x509 SQL Server Maintenance Plans and Parallelism   Index Rebuilds

ALTER INDEX … REBUILD (default option) – MAXDOP = 1 for instance

The plan shows a DegreeOfParallelism of 0 – this means that the query did not use parallelism – and that the plan includes a NonParallelPlanReason* of “MaxDOPSetToOne”.  Therefore, if MAXDOP is set to 1 for an instance, and the default ALTER INDEX … REBUILD statements are used to rebuild indexes – where the MAXDOP option is not included – then rebuilds will be single-threaded.  For some well-known applications (e.g. SharePoint, SAP, BizTalk)  it is recommended to set the max degree of parallelism to 1 for the instance.  While that option may be appropriate for application-specific queries, it means that your index rebuild operations may run longer than if parallelism was enabled.  It may be worth modifying your index maintenance script to include the MAXDOP option for ALTER INDEX REBUILD statements.

In the event that you have a max degree of parallelism value above 1 specified for the instance, but you’re not sure what the “right” MAXDOP value should be for your index rebuilds, you can let SQL Server decide.  If you include the WITH (MAXDOP = 0) option in your rebuild syntax, then the optimizer will determine how many CPUs to use, which could be anywhere from 1 to all of the CPUs available to SQL Server.  This is the recommended setting per Books Online, but I would caution you to use this option only if you’re comfortable with SQL Server potentially using all CPUs for a rebuild.  If you happen to be running other tasks or processes in the database while the rebuilds run – not ideal, but for a 24×7 solution you often don’t have a choice – then you should specify a MAXDOP value below the total number of CPUs available.

Finally, in case you’re wondering about parallelism and reorganizing indexes…the WITH (MAXDOP = n) option is not available for ALTER INDEX REORGANIZE, as index reorganization is always a single-threaded operation.  The final post in this series will cover parallelism and the UPDATE STATISTICS command, and if you’re manually managing statistics and specifying the sample, you don’t want to miss it!

*If you’re interested, Joe talks about the NonParallelPlanReason attribute  in his post, SQL Server 2012 Execution Plan’s NonParallelPlanReason, which may be useful when you’re digging into execution plans in SQL Server 2012 and higher.