The Accidental DBA (Day 26 of 30): Monitoring Disk I/O

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Database storage can seem like a black box. A DBA takes care of databases, and those databases often reside somewhere on a SAN – space simply presented to a DBA as a drive letter representing some amount of space. But storage is about more than a drive letter and a few hundred GBs. Yes, having enough space for your database files is important. But I often see clients plan for capacity and not performance, and this can become a problem down the road. As a DBA, you need to ask your storage admin not just for space, but for throughput, and the best way to back up your request is with data.

I/O Data in SQL Server
I’ve mentioned quite a few DMVs in these Accidental DBA posts, and today is no different. If you want to look at I/O from within SQL Server, you want to use the sys.dm_io_virtual_file_stats DMV. Prior to SQL Server 2005 you could get the same information using the fn_virtualfilestats function, so don’t despair if you’re still running SQL Server 2000!  Paul has a query that I often use to get file information in his post, How to examine IO subsystem latencies from within SQL Server. The sys.dm_io_virtual_file_stats DMV accepts database_id and file_id as inputs, but if you join over to sys.master_files, you can get information for all your database files. If I run this query against one of my instances, and order by write latency (desc) I get:

output from sys.dm_os_virtual_file_stats

output from sys.dm_os_virtual_file_stats

This data makes it look like I have some serious disk issues – a write latency of over 1 second is disheartening, especially considering I have a SSD in my laptop! I include this screenshot because I want to point out that this data is cumulative. It only resets on a restart of the instance. You can initiate large IO operations – such as index rebuilds – against a database that can greatly skew your data, and it may take time for the data to normalize again. Keep this in mind not only when you view the data at a point in time, but when you share findings with other teams. Joe has a great post that talks about this in more detail, Avoid false negatives when comparing sys.dm_io_virtual_file_stats data to perfmon counter data, and the same approach applies to data from storage devices that your SAN administrators may use.

The information in the sys.dm_io_virtual_file_stats DMV is valuable not only because it shows latencies, but also because it tells you what files have the have the highest number of reads and writes and MBs read and written. You can determine which databases (and files) are your heavy hitters and trend that over time to see if it changes and how.

I/O Data in Windows

If you want to capture I/O data from Windows, Performance Monitor is your best bet. I like to look at the following counters for each disk:

  • Avg. Disk sec/Read
  • Avg. Disk Bytes/Read
  • Avg. Disk sec/Write
  • Avg. Disk Bytes/Write

Jon talked about PerfMon counters earlier and the aforementioned counters tell you about latency and throughput.  Latency is how long it takes for an I/O request, but this can be measured at different points along the layers of a solution. Normally we are concerned with latency as measured from SQL Server. Within Windows, latency is the time from when Windows initiated the I/O request to the completion of the request. As Joe mentioned his post, you may see some variation between what you see for latency from SQL Server versus from Windows.

When we measure latency using Windows Performance Monitor, we look at Avg Disk sec/Read and Avg Disk sec/Write. Disk cache, on a disk, controller card, or a storage system, impact read and write values. Writes are typically written to cache and should complete very quickly. Reads, when not in cache, have to be pulled from disk and that can take longer.  While it’s easy to think of latency as being entirely related to disk, it’s not. Remember that we’re really talking about the I/O subsystem, and that includes the entire path from the server itself all the way to the disks and back. That path includes things like HBAs in the server, switches, controllers in the SAN, cache in the SAN and the disks themselves. You can never assume that latency is high because the disks can’t keep up. Sometimes the queue depth setting for the HBAs is too low, or perhaps you have an intermittently bad connection with a failing component like a GBIC (gigabit interface converter) or maybe a bad port card. You have to take the information you have (latency), share it with your storage team, and ask them to investigate. And hopefully you have a savvy storage team that knows to investigate all parts of the path. A picture is worth a thousand words in more complex environments. It often best to draw out, with the storage administrator, the mapping from the OS partition to the SAN LUN or volume. This should generate a discussion about the server, the paths to the SAN and the SAN itself. Remember what matters is getting the I/O to the application. If the IO leaves the disk but gets stuck along the way, that adds to latency. There could be an alternate path available (multi-pathing), but maybe not.

Our throughput, measured by Avg. Disk Bytes/Read and Avg. Disk Bytes/Write, tells us how much data is moving between the server and storage. This is valuable to understand, and often more useful than counting I/Os, because we can use this to understand how much data our disks will be need to be able to read and write to keep up with demand. Ideally you capture this information when the system is optimized – simple things like adding indexes to reduce full table scans can affect the amount of I/O – but often times you will need to just work within the current configuration.

Capturing Baselines

I alluded to baselines when discussing the sys.dm_os_virtual_file_stats DMV, and if you thought I was going to leave it at that then you must not be aware of my love for baselines!
You will want to capture data from SQL Server and Windows to provide throughput data to your storage administrator. You need this data to procure storage on the SAN that will not only give you enough space to accommodate expected database growth, but that will also give you the IOPs and MB/sec your databases require.

Beyond a one-time review of I/O and latency numbers, you should set up a process to capture the data on a regular basis so you can identify if things change and when. You will want to know if a database suddenly starts issuing more IOs (did someone drop an index?) or if the change is I/Os is gradual. And you need to make sure that I/Os are completing in the timeframe that you expect. Remember that a SAN is shared storage, and you don’t always know with whom you’re sharing that storage. If another application with high I/O requirements is placed on the same set of disks, and your latency goes up, you want to be able to pinpoint that change and provide metrics to your SAN administrator that support the change in performance in your databases.

Summary

As a DBA you need to know how your databases perform when it comes to reads and writes, and it’s a great idea to get to know your storage team. It’s also a good idea to understand where your databases really “live” and what other applications share the same storage. When a performance issue comes up, use your baseline data as a starting part, and don’t hesitate to pull in your SAN administrators to get more information. While there’s a lot of data readily available for DBAs to use, you cannot get the entire picture on your own. It may not hurt to buy your storage team some pizza or donuts and make some new friends 🙂 Finally, if you’re interested in digging deeper into the details of SQL Server I/O, I recommend starting with Bob Dorr’s work:

Our online training (Pluralsight) courses that can help you with this topic:

The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

For the last set of posts in our Accidental DBA series we’re going to focus on troubleshooting, and I want to start with Wait Statistics.  When SQL Server executes a task, if it has to wait for anything – a lock to be released from a page, a page to be read from disk into memory, a write to the transaction log to complete – then SQL Server records that wait and the time it had to wait.  This information accumulates, and can be queried using the sys.dm_os_wait_stats DMV, which was first available in SQL Server 2005.  Since then, the waits and queues troubleshooting methodology has been a technique DBAs can use to identify problems, and areas for optimizations, within an environment.

If you haven’t worked with wait statistics, I recommend starting with Paul’s wait stats post, and then working through the SQL Server Performance Tuning Using Wait Statistics: A Beginners Guide whitepaper.

Viewing Wait Statistics

If you run the following query:

SELECT *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

You will get back output that isn’t that helpful, as you can see below:

sys.dm_os_wait_stats output

sys.dm_os_wait_stats output

It looks like FT_IFTS_SCHEDULER_IDLE_WAIT is the biggest wait, and SQL Server’s waited for 1930299679 ms total.  This is kind of interesting, but not what I really need to know.  How I do really use this data?  It needs some filtering and aggregation.  There are some waits that aren’t going to be of interest because they occur all the time and are irrelevant for our purposes; we can filter out those wait types.  To make the most of our wait stats output, I really want to know the highest wait based on the percentage of time spent waiting overall, and the average wait time for that wait.  The query that I use to get this information is the one from Paul’s post (mentioned above).  I won’t paste it here (you can get it from his post) but if I run that query against my instance, now I get only three rows in my output:

sys.dm_os_wait_stats output with wait_types filtered out

sys.dm_os_wait_stats output with wait_types filtered out

If we reference the various wait types listed in the MSDN entry for sys.dm_os_wait_stats, we see that the SQLTRACE_WAIT_ENTRIES wait type, “Occurs while a SQL Trace event queue waits for packets to arrive on the queue.”

Well, this instance is on my local machine and isn’t very active, so that wait is likely due to the default trace that’s always running.  In a production environment, I probably wouldn’t see that wait, and if I did, I’d check to see how many SQL Traces were running.  But for our purposes, I’m going to add that as a wait type to filter out, and then re-run the query.  Now there are more rows in my output, and the percentage for the PAGEIOLATCH_SH and LCK_M_X waits has changed:

sys.dm_os_wait_stats output with SQLTRACE_WAIT_ENTRIES also filtered out

sys.dm_os_wait_stats output with SQLTRACE_WAIT_ENTRIES also filtered out

If you review the original query, you will see that the percentage calculation for each wait type uses the wait_time_ms for the wait divided by the SUM of wait_time_ms for all waits.  But “all waits” are those wait types not filtered by the query.  Therefore, as you change what wait types you do not consider, the calculations will change.  Keep this in mind when you compare data over time or with other DBAs in your company – it’s a good idea to make sure you’re always running the same query that filters out the same wait types.

Capturing Wait Statistics

So far I’ve talked about looking at wait statistics at a point in time.  As a DBA, you want to know what waits are normal for each instance.  And there will be waits for every instance; even if it’s highly tuned or incredibly low volume, there will be waits.  You need to know what’s normal, and then use those values when the system is not performing well.

The easiest way to capture wait statistics is to snapshot the data to a table on a regular basis, and you can find queries for this process in my Capturing Baselines for SQL Server: Wait Statistics article on SQLServerCentral.com.  Once you have your methodology in place to capture the data, review it on a regular basis to understand your typical waits, and identify potential issues before they escalate.  When you do discover a problem, then you can use wait statistics to aid in your troubleshooting.

Using the Data

At the time that you identify a problem in your environment, a good first step is to run your wait statistics query and compare the output to your baseline numbers.  If you see something out of the ordinary, you have an idea where to begin your investigation.  But that’s it; wait statistics simply tell you where to start searching for your answer.  Do not assume that your highest wait is the problem, or even that it’s a problem at all.  For example, a common top wait is CXPACKET, and CXPACKET waits indicate that parallelism is used, which is expected in a SQL Server environment.  If that’s your top wait, does that mean you should immediately change the MAXDOP setting for the instance?  No.  You may end up changing it down the road, but a better direction is to understand why that’s the highest wait.  You may have CXPACKET waits because you’re missing some indexes and there are tons of table scans occurring.  You don’t need to change MAXDOP, you need to start tuning.

Another good example is the WRITELOG wait type.  WRITELOG waits occur when SQL Server is waiting for a log flush to complete.  A log flush occurs when information needs to be written to the database’s transaction log.  A log flush should complete quickly, because when there is a delay in a log write, then the task that initiated the modification has to wait, and tasks may be waiting behind that.  But a log flush doesn’t happen instantaneously every single time, so you will have WRITELOG waits.  If you see WRITELOG as your top wait, don’t immediately assume you need new storage.  You should only assume that you need to investigate further.  A good place to start would be looking at read and write latencies, and since I’ll be discussing monitoring IO more in tomorrow’s post we’ll shelve that discussion until then.

You can get detailed information on what the various wait types mean from our comprehensive SQL Server Wait Types and Latch Classes library. Here’s a link to the CXPACKET page, for example.

As you can see from these two examples, wait statistics are a starting point.  They are very valuable – it’s easy to think of them as “the answer”, but they’re not.  Wait statistics do not tell you the entire story about a SQL Server implementation.  There is no one “thing” that tells you the entire story, which is why troubleshooting can be incredibly frustrating, yet wonderfully satisfying when you find the root of a problem.  Successfully troubleshooting performance issues in SQL Server requires an understanding of all the data available to aid in your discovery and investigation, understanding where to start, and what information to capture to correlate with other findings.

Our online training (Pluralsight) courses that can help you with this topic:

The Accidental DBA (Day 23 of 30): SQL Server HA/DR Features

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Two of the most important responsibilities for any DBA are protecting the data in a database and keeping that data available.  As such, a DBA may be responsible for creating and testing a disaster recovery plan, and creating and supporting a high availability solution.  Before you create either, you have to know your RPO and RTO, as Paul talked about a couple weeks ago.  Paul also discussed what you need to consider when developing a recovery strategy, and yesterday Jon covered considerations for implementing a high availability solution.

In today’s post, I want to provide some basic information about disaster recovery and high availability solutions used most often.  This overview will give you an idea of what options might be a fit for your database(s), but you’ll want to understand each technology in more detail before you make a final decision.

Backup/Restore

No matter what type of implementation you support, you need a disaster recovery plan.  Your database may not need to be highly available, and you may not have the budget to create a HA solution even if the business wants one.  But you must have a method to recover from a disaster.  Every version, and every edition, of SQL Server supports backup and restore.  A bare bones DR plan requires a restore of the most recent database backups available – this is where backup retention comes in to play.  Ideally you have a location to which you can restore.  You may have a server and storage ready to go, 500 miles away, just waiting for you to restore the files.  Or you may have to purchase that server, install it from the ground up, and then restore the backups.  While the plan itself is important, what matters most is that you have a plan.

Log Shipping

Log shipping exists on a per-user-database level and requires the database recovery model to use either full or bulk-logged recovery (see Paul’s post for a primer on the differences).  Log shipping is easy to understand – it’s backup from one server and restore on another – but the process is automated through jobs.  Log shipping is fairly straight forward to configure and you can use the UI or script it out (prior to SQL Server 2000 there was no UI).  Log shipping is available in all currently supported versions of SQL Server, and all editions.

You can log ship to multiple locations, creating additional redundancy, and you can configure a database for log shipping if it’s the primary database in a database mirroring or availability group configuration.  You can also use log shipping when replication is in use.

With log shipping you can allow limited read-only access on secondary databases for reporting purposes (make sure you understand the licensing impact), and you can take advantage of backup compression to reduce the size of the log backups and therefore decrease the amount of data sent between locations. Note: backup compression was first available only in SQL Server 2008 Enterprise, but starting in SQL Server 2008 R2 it was available in Standard Edition.

While Log Shipping is often used for disaster recovery, you can use it as a high availability solution, as long as you can accept some amount of data loss and some amount of downtime.  Alternatively, in a DR scenario, if you implement a longer delay between backup and restore, then if data is changed or removed from the primary database – either purposefully or accidentally – you can possibly recover it from the secondary.

Failover Cluster Instance

A Failover Cluster Instance (also referred to as FCI or SQL FCI) exists at the instance level and can seem scary to newer DBA because it requires a Windows Server Failover Cluster (WSFC).  A SQL FCI usually requires more coordination with other teams (e.g. server, storage) than other configurations.  But clustering is not incredibly difficult once you understand the different parts involved.  A Cluster Validation Tool was made available in Windows Server 2008, and you should ensure the supporting hardware successfully passes its configuration tests before you install SQL Server, otherwise you may not be able to get your instance and up and running.

SQL FCIs are available in all currently supported versions of SQL Server, and can be used with Standard Edition (2 nodes only), Business Intelligence Edition in SQL Server 2012 (2 nodes only), and Enterprise Edition.  The nodes in the cluster share the same storage, so there is only one copy of the data.  If a failure occurs for a node, SQL Server fails over to another available node.

If you have a two-node WSFC with only one instance of SQL Server, one of the nodes is always unused, basically sitting idle.  Management may view this as a waste of resources, but understand that it is there as insurance (that second node is there to keep SQL Server available if the first node fails).  You can install a second SQL Server instance and use log shipping or mirroring with snapshots to create a secondary copy of the database for reporting (again, pay attention to licensing costs).  Or, those two instances can both support production databases, creating a better use of the hardware.  However, be aware of resource utilization when a node fails and both instances run on the same node.

Finally, a SQL FCI can provide intra-data center high availability, but because it uses shared storage, you do have a single point of failure.  A SQL FCI can be used for cross-data center disaster recovery if you use multi-site SQL FCIs in conjunction with storage replication.  This does require a bit more work and configuration, because you have more moving parts, and it can become quite costly.

Database Mirroring

Database mirroring is configured on a per-user-database basis and the database must use the Full recovery model.  Database mirroring was introduced in SQL Server 2005 SP1 and is available in Standard Edition (synchronous only) and Enterprise Edition (synchronous and asynchronous).  A database can be mirrored to only one secondary server, unlike log shipping.

Database mirroring is extremely easy to configure using the UI or scripting.  A third instance of SQL Server, configured as a witness, can detect the availability of the primary and mirror servers.  In synchronous mode with automatic failover, if the primary server becomes unavailable and the witness can still see the mirror, failover will occur automatically if the database is synchronized.

Note that you cannot mirror a database that contains FILESTREAM data, and mirroring is not appropriate if you need multiple databases to failover simultaneously, or if you use cross-database transactions or distributed transactions.  Database mirroring is considered a high availability solution, but it can also be used for disaster recovery, assuming the lag between the primary and mirror sites is not so great that the mirror database is too far behind the primary for RPO to be met.  If you’re running Enterprise Edition, snapshots can be used on the mirror server for point-in-time reporting, but there’s a licensing cost that comes with reading off the mirror server (as opposed to if it’s used only when a failover occurs).

Availability Groups

Availability groups (AGs) were introduced in SQL Server 2012 and require Enterprise Edition.  AGs are configured for one or more databases, and if a failover occurs, the databases in a group failover together.  They allow three synchronous replicas (the primary and two secondaries), whereas database mirroring allowed only one synchronous secondary, and up to four asynchronous replicas.  Failover in an Availability Group can be automatic or manual.  Availability Groups do require a Windows Failover Clustering Server (WFCS), but do not require a SQL FCI.  An AG can be hosted on SQL FCIs, or on standalone servers within the WFCS.

Availability Groups allow read-only replicas that allow for lower latency streaming updates, so you can offload reporting to another server and have it be near real-time.  Availability Groups offer some fantastic functionality, but just as with a SQL FCI, there are many moving parts and the DBA cannot work in a vacuum for this solution, it requires a group effort.  Make friends with the server team, the storage team, the network folks, and the application team.

Transactional Replication

Transactional Replication gets a shout out here, even though it is not always considered a high availability solution as Paul discusses in his post, In defense of transaction replication as an HA technology.  But it can work as a high availability solution provided you can accept its limitations.  For example, there is no easy way to fail back to the primary site…however, I would argue this is true for log shipping as well because log shipping requires you to backup and restore (easy but time consuming).  In addition, with transactional replication you don’t have a byte-for-byte copy of the publisher database, as you do with log shipping, database mirroring or availability groups.  This may be a deal-breaker for some, but it may be quite acceptable for your database(s).

Transactional Replication is available in all currently supported versions and in Standard and Enterprise Editions, and may also be a viable option for you for disaster recovery.  It’s important that you clearly understand what it can do, and what it cannot, before you decide to use it.  Finally, replication in general isn’t for the faint of heart.  It has many moving parts and can be overwhelming for an Accidental DBA.  Joe has a great article on SQL Server Pro that covers how to get started with transactional replication.

Summary

As we’ve seen, there are many options available that a DBA can use to create a highly available solution and/or a system that can be recovered in the event of a disaster.  It all starts with understanding how much data you can lose (RPO) and how long the system can be unavailable (RTO), and you work from there.  Remember that the business needs to provide RPO and RTO to you, and then you create the solution based on that information.  When you present the solution back to the business, or to management, make sure it is a solution that YOU can support.  As an Accidental DBA, whatever technology you choose must be one with which you’re comfortable, because when a problem occurs, you will be the one to respond and that’s not a responsibility to ignore.  For more information on HA and DR solutions I recommend the following: