Baselines for SQL Server and Azure SQL Database

Last week I got an email from a community member who had read this older article of mine on baselining, and asked if there were any updates related to SQL Server 2016, SQL Server 2017, or vNext (SQL Server 2019). It was a really good question. I haven’t visited that article in a while and so I took the time to re-read it. I’m rather proud to say that what I said then still holds up today.

The fundamentals of baselining are the same as they were back in 2012 when that article was first published. What is different about today? First, there are a lot more metrics in the current release of SQL Server that you can baseline (e.g. more events in Extended Events, new DMVs, new PerfMon counters,  sp_server_diagnostics_component_results). Second, options for capturing baselines have changed. In the article I mostly talked about rolling your own scripts for baselining. If you’re looking to establish baselines for your servers you still have the option to develop your own scripts, but you also can use a third-party tool, and if you’re running SQL Server 2016+ or Azure SQL Database, you can use Query Store.

As much as I love Query Store, I admit that it is not all-encompassing in terms of baselining a server. It does not replace a third-party tool, nor does it fully replace rolling your own scripts. Query Store captures metrics specific to query execution, and you’re not familiar with this feature, feel free to check out my posts about it.

Consider this core question: What should we baseline in our SQL Server environment? If you have a third-party tool, the data captured is determined by the application, and some of them allow you to customize and capture additional metrics. But if you roll your own scripts, there are some fundamental things that I think you should capture such as instance configuration, file space and usage information, and wait statistics.

Beyond that, it really goes back to the question of what problem are you trying to solve? If you are looking at implementing In-Memory OLTP, then you want to capture information related to query execution times and frequency, locking, latching, and memory use. After you implement In-Memory OLTP, you look at those exact same metrics and compare the data. If you’re looking at using Columnstore indexes, you need to look at query performance as it stands right now (duration, I/O, CPU) and capture how it changes after you’ve added one or more Columnstore indexes. But to be really thorough you should also look at index usage for the involved tables, as well as query performance for other queries against those tables to see if and/or how performance changes after you’ve added the index. Very few things in SQL Server work truly in isolation, they’re all interacting with each other in some way…which is why baselining can be a little bit overwhelming and why I recommend that you start small.

Back to the original question: is there anything new to consider with SQL Server 2016 and higher? While third-party tools continue to improve and more metrics are available as new features are added and SQL Server continues to evolve, the only thing “really new” is the addition of Query Store and its ability to capture query performance metrics natively within SQL Server. Hopefully this helps as you either look at different third-party tools that you may want to purchase, or you look at rolling your own set of scripts.  If you’re interested in writing your own scripts, I have a set of references that might be of use here.

Lastly, you’ll note that I haven’t said much about Azure SQL Database, and that’s because it’s an entirely different beast.  If you have one or more Azure SQL Databases, then you may know that within the Portal there are multiple options for looking at system performance, including Intelligent Insights and Query Performance Insight.  Theoretically, you could still roll your own scripts in Azure SQL DB, but I would first explore what Microsoft provides to see if it meets your needs.  Have fun!

Collection of Baseline Scripts

The topic of baselines in SQL Server is one that I’ve had an interest in for a long time.  In fact, the very first session I ever gave back in 2011 was on baselines.  I still believe they are incredibly important, and most of the data I capture is still the same, but I have tweaked a couple things over the years.  I’m in the process of creating a set of baseline scripts that folks can use to automate the capture of this information, in the event that they do not have/cannot afford a third-party monitoring tool (note, a monitoring tool such as SQL Sentry’s Performance Advisor can make life WAY easier, but I know that not every can justify the need to management).  For now, I’m starting with links to all relevant posts and then I’ll update this post once I have everything finalized.

These scripts are just a starting point for what to monitor.  One thing I like to point in our IEPTO2: Performance Tuning and Optimization course is that there is A LOT of data you can capture related to your SQL Server environment.  Your options include Performance Monitor (using Custom Data Collectors), queries via Extended Events or Trace (depending on version), and any data from the DMVs or system views within SQL Server. You have decide what to capture based on

1) What problem you might be trying to solve in your environment, and

2) What information is most important for you to have.  Start simple, and then work your way up.

Figure out the one or two most critical things to capture, and start there, and then add on.

If you find there’s something missing from my scripts, let me know and I’ll try to get it added!

Monitoring in general

Configuration Information

Disk and I/O

Maintenance

Errors

Performance

Pluralsight

The Accidental DBA (Day 26 of 30): Monitoring Disk I/O

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Database storage can seem like a black box. A DBA takes care of databases, and those databases often reside somewhere on a SAN – space simply presented to a DBA as a drive letter representing some amount of space. But storage is about more than a drive letter and a few hundred GBs. Yes, having enough space for your database files is important. But I often see clients plan for capacity and not performance, and this can become a problem down the road. As a DBA, you need to ask your storage admin not just for space, but for throughput, and the best way to back up your request is with data.

I/O Data in SQL Server
I’ve mentioned quite a few DMVs in these Accidental DBA posts, and today is no different. If you want to look at I/O from within SQL Server, you want to use the sys.dm_io_virtual_file_stats DMV. Prior to SQL Server 2005 you could get the same information using the fn_virtualfilestats function, so don’t despair if you’re still running SQL Server 2000!  Paul has a query that I often use to get file information in his post, How to examine IO subsystem latencies from within SQL Server. The sys.dm_io_virtual_file_stats DMV accepts database_id and file_id as inputs, but if you join over to sys.master_files, you can get information for all your database files. If I run this query against one of my instances, and order by write latency (desc) I get:

output from sys.dm_os_virtual_file_stats

output from sys.dm_os_virtual_file_stats

This data makes it look like I have some serious disk issues – a write latency of over 1 second is disheartening, especially considering I have a SSD in my laptop! I include this screenshot because I want to point out that this data is cumulative. It only resets on a restart of the instance. You can initiate large IO operations – such as index rebuilds – against a database that can greatly skew your data, and it may take time for the data to normalize again. Keep this in mind not only when you view the data at a point in time, but when you share findings with other teams. Joe has a great post that talks about this in more detail, Avoid false negatives when comparing sys.dm_io_virtual_file_stats data to perfmon counter data, and the same approach applies to data from storage devices that your SAN administrators may use.

The information in the sys.dm_io_virtual_file_stats DMV is valuable not only because it shows latencies, but also because it tells you what files have the have the highest number of reads and writes and MBs read and written. You can determine which databases (and files) are your heavy hitters and trend that over time to see if it changes and how.

I/O Data in Windows

If you want to capture I/O data from Windows, Performance Monitor is your best bet. I like to look at the following counters for each disk:

  • Avg. Disk sec/Read
  • Avg. Disk Bytes/Read
  • Avg. Disk sec/Write
  • Avg. Disk Bytes/Write

Jon talked about PerfMon counters earlier and the aforementioned counters tell you about latency and throughput.  Latency is how long it takes for an I/O request, but this can be measured at different points along the layers of a solution. Normally we are concerned with latency as measured from SQL Server. Within Windows, latency is the time from when Windows initiated the I/O request to the completion of the request. As Joe mentioned his post, you may see some variation between what you see for latency from SQL Server versus from Windows.

When we measure latency using Windows Performance Monitor, we look at Avg Disk sec/Read and Avg Disk sec/Write. Disk cache, on a disk, controller card, or a storage system, impact read and write values. Writes are typically written to cache and should complete very quickly. Reads, when not in cache, have to be pulled from disk and that can take longer.  While it’s easy to think of latency as being entirely related to disk, it’s not. Remember that we’re really talking about the I/O subsystem, and that includes the entire path from the server itself all the way to the disks and back. That path includes things like HBAs in the server, switches, controllers in the SAN, cache in the SAN and the disks themselves. You can never assume that latency is high because the disks can’t keep up. Sometimes the queue depth setting for the HBAs is too low, or perhaps you have an intermittently bad connection with a failing component like a GBIC (gigabit interface converter) or maybe a bad port card. You have to take the information you have (latency), share it with your storage team, and ask them to investigate. And hopefully you have a savvy storage team that knows to investigate all parts of the path. A picture is worth a thousand words in more complex environments. It often best to draw out, with the storage administrator, the mapping from the OS partition to the SAN LUN or volume. This should generate a discussion about the server, the paths to the SAN and the SAN itself. Remember what matters is getting the I/O to the application. If the IO leaves the disk but gets stuck along the way, that adds to latency. There could be an alternate path available (multi-pathing), but maybe not.

Our throughput, measured by Avg. Disk Bytes/Read and Avg. Disk Bytes/Write, tells us how much data is moving between the server and storage. This is valuable to understand, and often more useful than counting I/Os, because we can use this to understand how much data our disks will be need to be able to read and write to keep up with demand. Ideally you capture this information when the system is optimized – simple things like adding indexes to reduce full table scans can affect the amount of I/O – but often times you will need to just work within the current configuration.

Capturing Baselines

I alluded to baselines when discussing the sys.dm_os_virtual_file_stats DMV, and if you thought I was going to leave it at that then you must not be aware of my love for baselines!
You will want to capture data from SQL Server and Windows to provide throughput data to your storage administrator. You need this data to procure storage on the SAN that will not only give you enough space to accommodate expected database growth, but that will also give you the IOPs and MB/sec your databases require.

Beyond a one-time review of I/O and latency numbers, you should set up a process to capture the data on a regular basis so you can identify if things change and when. You will want to know if a database suddenly starts issuing more IOs (did someone drop an index?) or if the change is I/Os is gradual. And you need to make sure that I/Os are completing in the timeframe that you expect. Remember that a SAN is shared storage, and you don’t always know with whom you’re sharing that storage. If another application with high I/O requirements is placed on the same set of disks, and your latency goes up, you want to be able to pinpoint that change and provide metrics to your SAN administrator that support the change in performance in your databases.

Summary

As a DBA you need to know how your databases perform when it comes to reads and writes, and it’s a great idea to get to know your storage team. It’s also a good idea to understand where your databases really “live” and what other applications share the same storage. When a performance issue comes up, use your baseline data as a starting part, and don’t hesitate to pull in your SAN administrators to get more information. While there’s a lot of data readily available for DBAs to use, you cannot get the entire picture on your own. It may not hurt to buy your storage team some pizza or donuts and make some new friends 🙂 Finally, if you’re interested in digging deeper into the details of SQL Server I/O, I recommend starting with Bob Dorr’s work:

Our online training (Pluralsight) courses that can help you with this topic: