Important Query Store Fixes – January 2019

The latest CUs for SQL Server 2016 and 2017 contain some important Query Store fixes that I thought worth mentioning for those of you on either version or those of you looking to upgrade.  As of this writing, the current CU for SQL Server 2016 SP2 is CU5, and for SQL Server 2017 it is CU13.  Many times we see fixes that make it into a SQL Server 2017 CU ported back to a SQL Server 2016 build.  Interestingly enough, there are some Query Store fixes in 2016 CUs that are not in 2017 CUs.  I don’t know if that’s because the issues do not exist in 2017, or if it’s just that they have been fixed yet in 2017.  I’m planning to update this post if the fixes are added down the read.  So here we go, in descending CU order…

SQL Server 2017 CU13FIX: A dump file may be generated when you run the DML internal plan on Query Store enabled database in SQL Server 2017

This will occur for anyone using Automatic Plan Correction (which means you will be on Enterprise Edition), as noted by having FORCE_LAST_GOOD_PLAN enabled.  You can read more about Automatic Plan Correction here (I’m a  fan, it’s pretty cool and very helpful for those with small DBA teams and lots of databases, or those DBAs who just have too much on their plate and are constantly putting out fires).

SQL Server 2017 CU11 and SQL Server 2016 SP2 CU5FIX: Transactions and log truncation may be blocked when you use Query Store in SQL Server 2016 and 2017

I’m aware of multiple companies that have run into this issue and I’m glad to see that the fix was ported back to 2016.  I highly recommend getting up to this CU if you can, as the resolution requires a restart (I’ve heard that killing the session_id also works but I haven’t seen that).  Also note that the command referenced in the KB article is ALTER DATABASE <databasename> SET QUERY_STORE_CLEAR, but I have also seen this when trying to change one of the Query Store options (e.g. changing size or retention days).  I am pretty sure it’s any ALTER DATABASE statement that changes Query Store configuration.

SQL Server 2016 SP2 CU5FIX: Query Store enabled database takes long time on startup after you apply cumulative update for a SQL Server version

Note that this is applicable to systems with Availability Groups, and this is a fix that is not listed for any 2017 CU…I’ll keep watching to see if it shows up.

SQL Server 2016 SP2 CU4FIX: Access violation when SQL Server 2016 tries to start Query Store Manager during startup

This is also a fix that is not listed for any 2017 CU…

SQL Server 2017 CU5FIX: Access violation occurs when Query Store collects runtime statistics in SQL Server 2017

I don’t see this fix in any SQL Server 2016 CU, but I’ll keep watching.

SQL Server 2016 SP2 CU2FIX: Slow performance of SQL Server 2016 when Query Store is enabled

There were many performance-related improvements for Query Store in SQL Server 2017, and huge props to the SQL Server team for getting these back-ported to 2016.  At a bare minimum, this is the build you should be running, but I would rather see you on CU5 at this point.

Note: If you installed CU2 for SQL Server 2017 at any point, please read: Query Store Fix in SQL Server 2017 for an explanation of what you need to do when you apply a higher CU and why.

You may look at this list of fixes and be concerned.  Don’t be.  These fixes are a good thing!  They tell me that more and more people are using Query Store (which is great) and while they have run into issues, I believe it’s because these are bigger systems (see the access violation fix which is for systems with 256 logical cores) or they are interesting/edge-case workloads, both of which probably don’t exist in Azure, and may be why these issues haven’t been seen previously.  I remain, as always, a big proponent of Query Store.  If you’re seeing something interesting feel free to comment here or drop me an email.  I’ve had some folks do that and I was able to confirm they were seeing a bug and let them know that it would be fixed soon.  I’ve had one or two other cases where it’s an issue I haven’t seen and I recommended contacting Product Support.  Either way, if you’re unsure about Query Store I have lots of resources to get you started, to get it configured properly, and I answer the question “what about performance?!”.  I hope this helps!

 

Workload Tuning Training

I recently booked my flight to the U.K. for SQLBits, and I realized I probably should write a post about my full day training session on workload tuning! It’s on Wednesday, February 27, 2019, and you can read the abstract here: Stop Tuning Your Queries and Start Tuning Your Workload.

I included a fair bit of information about what I’ll cover, so I thought it might be interesting to anyone considering this session to hear the backstory about how it came to be.
First, this is a new full day session, I haven’t presented it anywhere else previously. It does have pieces of content I’ve discussed elsewhere, but this really represents a different approach for tuning and troubleshooting. Namely: let’s not focus on one or two or five specific queries that are causing problems, let’s focus on the workload as a whole to identify patterns and then address them in specific queries.

What I have noticed over the past couple years is that I see similar patterns when I’m looking at a given system. Very often, within a solution there are patterns in terms of how queries are written, typical performance problems, and then typical ways they are addressed or fixed internally.

For example, we engage with a customer and discover they use table variables in a lot of their code, and this causes poor cardinality estimates and then a host of subsequent issues. We optimize a couple queries using temporary tables instead, and they then replace all table variables with temporary tables, and then start seeing issues related to temp table contention. Both options have a time and a place, the key is to understand when and why.

The ultimate goal of my session is to share my experience working with a wide range of customers and solutions over the years, the patterns I have seen, and options for addressing them. I’ll tell you now that there is no “easy button” in terms of fixes. But you already knew that 😊 A big key is figuring out how to convince your team of the changes that need to be made, and understanding how and why is critical. That’s what we’ll discuss, because I want you to walk away with applicable information that you can use immediately in your environment.

Still have questions? Add a comment or send me an email. I’d love to hear from you. I hope to see you in Manchester in Feburary!

p.s. I should also mention that next week (Jan 15 – Jan 17) I’m presenting my Query Store course online, and it’s the only delivery of it scheduled for the first part of the year. There is still space if you’re interested! Three afternoons next week, each day with two 1.5 hour sessions. I really like this format, it gives people good breaks to absorb information and think of/ask questions.

Different Query Store Settings for a Database in an Availability Group

Last week there was a question on #sqlhelp on Twitter about the status of Query Store for a database in an Availability Group. I’ve written about Query Store and Availability Groups before so if you’re not familiar with QS behavior in an AG, check out that post first. But this question was, I think, specific to the values that shows on a read-only replica and how there were different query store settings between a primary and secondary. Let’s set it up and take a look.

Environment
I have a multi-node cluster set up, and an AG (Avengers) running for the WideWorldImporters database across two nodes (CAP\ROGERS and BUCKY\BARNES). Right now, CAP is primary and BUCKY is secondary:

Avengers AG Status

Avengers AG Status

 

Query Store
On the primary, if you check the status of Query Store in the UI, we see the following:

Query Store configuration for WideWorldImporters on CAP

Query Store configuration for WideWorldImporters on CAP

 

If you check the status of Query Store from the secondary, the settings are the same:

Query Store configuration for WideWorldImporters on BUCKY

Query Store configuration for WideWorldImporters on BUCKY

 

Now we’ll make a change to the Query Store on the primary using TSQL:

USE [master];
GO
ALTER DATABASE [WideWorldImporters]
SET QUERY_STORE (MAX_STORAGE_SIZE_MB = 1024);
GO

And if we check the settings again, we see they’re updated on the primary:

Query Store configuration for WideWorldImporters after change on CAP

Query Store configuration for WideWorldImporters after change on CAP

 

But on the primary, you’ll see that settings show the same values as before:

Query Store configuration for WideWorldImporters after change on BUCKY

Query Store configuration for WideWorldImporters after change on BUCKY

 

This is expected.  When the instance starts up and the database loads on a replica, the Query Store settings are pulled from sys.database_query_store_options, and these are cached in memory. Any change that is subsequently made to the configuration of Query Store on the primary is persisted to disk on the primary and then propagated over to the secondary and written to disk, just like anything else. However, on-disk changes to Query Store settings are only propagated to the memory cache when the secondary replica is restarted, or if there is a failover and the secondary becomes primary.

Conclusion

If you’re seeing a disparity between values for Query Store settings on a primary and any or all of your replicas, it is expected and you can be assured that the changes have been written to the secondary copies and committed.

Handling Dates in Query Store

Query Store retains query performance data at the plan level.  This data is then broken out into intervals of time, determined by the INTERVAL_LENGTH_MINUTES setting.  The time intervals are found in the sys.query_store_runtime_stats_interval system view, and the start_time and end_time columns are of the DATETIMEOFFSET data type.  This means that the date is time-zone aware, and in Query Store the data is stored as UTC.  Now why does all of this matter?  Because handling dates in Query Store is important if you’re going to query the data directly.

Here’s what we get with a simple query against the runtime stats interval view:

SELECT *
FROM sys.query_store_runtime_stats_interval
ORDER BY runtime_stats_interval_id;
GO
sys.query_store_runtime_stats_interval output

sys.query_store_runtime_stats_interval output

These are 10 minute intervals and notice that they all have +00:00 for the offset, which is  +/- UTC.  If you want to display data based on the intervals, and convert it to your local timezone, you need to use AT TIME ZONE.  We can use a variation of a query from my last post, Finding the Slowest Query in a Stored Procedure, to see this in action.

SELECT
   [qsq].[query_id],
   [qsp].[plan_id],
   [qsq].[object_id],
   [rs].[runtime_stats_interval_id],
   [rsi].[start_time] AT TIME ZONE 'Eastern Standard Time' [EST StartTime],
   [rsi].[end_time] AT TIME ZONE 'Eastern Standard Time' [EST EndTime],
   [rs].[count_executions],
   [rs].[avg_duration],
   [rs].[avg_cpu_time],
   [rs].[avg_logical_io_reads]
FROM [sys].[query_store_query] [qsq]
JOIN [sys].[query_store_query_text] [qst]
   ON [qsq].[query_text_id] = [qst].[query_text_id]
JOIN [sys].[query_store_plan] [qsp]
   ON [qsq].[query_id] = [qsp].[query_id]
JOIN [sys].[query_store_runtime_stats] [rs]
   ON [qsp].[plan_id] = [rs].[plan_id]
JOIN [sys].[query_store_runtime_stats_interval] [rsi]
   ON [rs].[runtime_stats_interval_id] = [rsi].[runtime_stats_interval_id]
WHERE [qsq].[query_id] = 1421
   AND [qsp].[plan_id] = 1426
   AND [rsi].[end_time] > DATEADD(HOUR, -1, GETUTCDATE())
ORDER BY [rs].[runtime_stats_interval_id];
GO
StartTime and EndTime converted

StartTime and EndTime converted

Since I’m in chilly Cleveland, OH, which is in the Eastern Time Zone, I’ve converted the data as such.  If you’re not familiar with the exact time zone text you can use, query sys.time_zone_info.  Note that DATETIMEOFFSET does not account for Daylight Savings Time, so you’ll have to handle that in your code (yay for writing this post in January).

Separate from displaying the date, you may want to query for a particular range, as I’ve done in the above query with AND [rsi].[end_time] > DATEADD(HOUR, -1, GETUTCDATE()).  Notice that I’m using GETUTCDATE(), which is due to an interesting behavior with DATEADD which Aaron Bertrand has detailed in his post, Performance Surprises and Assumptions: DATEADD.  There are other date fields you could use here, such as last_execution_time, and it, as well as every other date column in the Query Store system views, is DATETIMEOFFSET.  Of note: last_execution_time exists in in sys.query_store_query, sys.query_store_plan, and sys.query_store_runtime_stats but they can and will be different between a query, its plan(s), and the runtime stats intervals.

Hopefully this helps the next time you’re writing queries against the Query Store system views!

Finding the Slowest Query in a Stored Procedure

This post, Finding the Slowest Query in a Stored Procedure, originally appeared on the PASS website.  It has been re-posted here with permission.

Figuring out exactly what causes slow performance for a stored procedure can sometimes feel like trying to unravel a ball of Clark Griswold’s Christmas lights.  It’s not uncommon to see procedures with hundreds, even thousands of lines of code.  You may have been told which stored procedure runs slow by a user or manager, or you might have found it by looking in SQL Server DMVs.  Either way, once you have detected the offending procedure, where do you start?

If you’re running SQL Server 2016, one option is Query Store.  Query Store captures individual queries, but it also captures the object_id, so you can find all the queries that are associated with an object to determine which ones are problematic.

Setup

We’ll work with a restored copy of the WideWorldImporters database, which you can download from GitHub.

First, since this is a demo database, we will enable Query Store and clear out any historical information:

ALTER DATABASE [WideWorldImporters]
     SET QUERY_STORE = ON;
GO

ALTER DATABASE [WideWorldImporters]
     SET QUERY_STORE (
     OPERATION_MODE=READ_WRITE,
     INTERVAL_LENGTH_MINUTES = 10
     );
GO

/*
Do not run in a Production database unless you want
to remove all Query Store data
*/

ALTER DATABASE [WideWorldImporters]
    SET QUERY_STORE CLEAR;
GO

Next, we will create a very simple stored procedure that has multiple queries within.

USE [WideWorldImporters];
GO

DROP PROCEDURE IF EXISTS [Sales].[usp_GetCustomerDetail];
GO

CREATE PROCEDURE [Sales].[usp_GetCustomerDetail]
     @CustomerName NVARCHAR(100)
AS

CREATE TABLE #CustomerList (
     [RowID] INT IDENTITY (1,1),
     [CustomerID] INT,
     [CustomerName] NVARCHAR (100)
     );

INSERT INTO #CustomerList (
     [CustomerID],
     [CustomerName]
     )
SELECT
     [CustomerID],
     [CustomerName]
FROM [Sales].[Customers]
WHERE [CustomerName] LIKE @CustomerName
UNION
SELECT
     [CustomerID],
     [CustomerName]
FROM [Sales].[Customers_Archive]
WHERE [CustomerName] LIKE @CustomerName;

SELECT
     [o].[CustomerID],
     [o].[OrderID],
     [il].[InvoiceLineID],
     [o].[OrderDate],
     [i].[InvoiceDate],
     [ol].[StockItemID],
     [ol].[Quantity],
     [ol].[UnitPrice],
     [il].[LineProfit]
INTO #CustomerOrders
FROM [Sales].[Orders] [o]
INNER JOIN [Sales].[OrderLines] [ol]
     ON [o].[OrderID] = [ol].[OrderID]
INNER JOIN [Sales].[Invoices] [i]
     ON [o].[OrderID] = [i].[OrderID]
INNER JOIN [Sales].[InvoiceLines] [il]
     ON [i].[InvoiceID] =  [il].[InvoiceID]
     AND [il].[StockItemID] = [ol].[StockItemID]
     AND [il].[Quantity] = [ol].[Quantity]
     AND [il].[UnitPrice] = [ol].[UnitPrice]
WHERE [o].[CustomerID] IN (SELECT [CustomerID] FROM #CustomerList);

SELECT
     [cl].[CustomerName],
     [si].[StockItemName],
     SUM([co].[Quantity]) AS [QtyPurchased],
     SUM([co].[Quantity]*[co].[UnitPrice]) AS [TotalCost],
     [co].[LineProfit],
     [co].[OrderDate],
     DATEDIFF(DAY,[co].[OrderDate],[co].[InvoiceDate]) AS [DaystoInvoice]
FROM #CustomerOrders [co]
INNER JOIN #CustomerList [cl]
     ON [co].[CustomerID] = [cl].[CustomerID]
INNER JOIN [Warehouse].[StockItems] [si]
     ON [co].[StockItemID] = [si].[StockItemID]
GROUPBY [cl].[CustomerName], [si].[StockItemName],[co].[InvoiceLineID],
     [co].[LineProfit], [co].[OrderDate], DATEDIFF(DAY,[co].[OrderDate],[co].[InvoiceDate])
ORDER BY [co].[OrderDate];
GO

Viewing the Data

With the stored procedure created, we can now execute it with different input parameters:

EXEC [Sales].[usp_GetCustomerDetail] N'Alvin Bollinger';
GO 10
EXEC [Sales].[usp_GetCustomerDetail] N'Tami Braggs';
GO 10
EXEC [Sales].[usp_GetCustomerDetail] N'Logan Dixon';
GO 10
EXEC [Sales].[usp_GetCustomerDetail] N'Tara Kotadia';
GO 10

If we use the Query Store Top Resource Consuming Queries report within Management Studio, we can see what individual queries have been executed:

Top Resource Consuming Queries

Top Resource Consuming Queries Report

In this example, the queries for our stored procedure are quite evident – but we don’t have much of a workload.  In a production system, all the queries from a stored procedure probably won’t appear in this view. It will likely be necessary to use T-SQL to find all the queries for a stored procedure, which currently has no built-in report to find all queries for a given stored procedure.  Using the query below, we can list all queries captured by Query Store for the usp_GetCustomerDetail procedure:

SELECT
     [qsq].[query_id],
     [qsp].[plan_id],
     [qsq].[object_id],
     [qst].[query_sql_text],
     ConvertedPlan = TRY_CONVERT(XML, [qsp].[query_plan])
FROM [sys].[query_store_query] [qsq]
JOIN [sys].[query_store_query_text] [qst]
     ON [qsq].[query_text_id] = [qst].[query_text_id]
JOIN [sys].[query_store_plan] [qsp]
     ON [qsq].[query_id] = [qsp].[query_id]
WHERE [qsq].[object_id] = OBJECT_ID(N'Sales.usp_GetCustomerDetail');
GO
All queries for usp_GetCustomerDetail

All queries for usp_GetCustomerDetail

This confirms which queries are part of the procedure – but if we are troubleshooting performance, we are really interested in execution statistics.  We can expand this query slightly to get the data we want:

SELECT
     [qsq].[query_id],
     [qsp].[plan_id],
     [qsq].[object_id],
     [rs].[runtime_stats_interval_id],
     [rsi].[start_time],
     [rsi].[end_time],
     [rs].[count_executions],
     [rs].[avg_duration],
     [rs].[avg_cpu_time],
     [rs].[avg_logical_io_reads],
     [rs].[avg_rowcount],
     [qst].[query_sql_text],
     ConvertedPlan = TRY_CONVERT(XML, [qsp].[query_plan])
FROM [sys].[query_store_query] [qsq]
JOIN [sys].[query_store_query_text] [qst]
     ON [qsq].[query_text_id] = [qst].[query_text_id]
JOIN [sys].[query_store_plan] [qsp]
     ON [qsq].[query_id] = [qsp].[query_id]
JOIN [sys].[query_store_runtime_stats] [rs]
     ON [qsp].[plan_id] = [rs].[plan_id]
JOIN [sys].[query_store_runtime_stats_interval] [rsi]
     ON [rs].[runtime_stats_interval_id] = [rsi].[runtime_stats_interval_id]
WHERE [qsq].[object_id] = OBJECT_ID(N'Sales.usp_GetCustomerDetail')
     AND [rs].[last_execution_time] > DATEADD(HOUR, -1, GETUTCDATE())
     AND [rs].[execution_type] = 0
ORDER BY [qsq].[query_id], [qsp].[plan_id], [rs].[runtime_stats_interval_id];
GO
Runtime statistics for usp_GetCustomerDetail

Runtime statistics for usp_GetCustomerDetail

We’ve only retrieved CPU, duration, and IO metrics in this query – much more information is available in Query Store depending on what you’re interested in seeing, or the problem you are trying to solve.  A quick review of the duration and CPU columns helps us identify the second query in the stored procedure as the one that takes the longest to execute.

There are a few important items worth noting about the query we ran, and its output:

1. We restricted the runtime statistics to only those from the last hour with this predicate:
[rs].[last_execution_time] > DATEADD(HOUR, -1, GETUTCDATE())

2. We restricted the output to show only successful executions with this predicate:
[rs].[execution_type] = 0

3.  Runtime statistics are aggregated across the collection interval, which is set with the INTERVAL_LENGTH_MINUTES option.  For this demo, the interval was set to 10 minutes, a value that’s low for a production environment, and we’re only seeing data for one 10-minute interval (19:30 to 19:40 on 2018-11-28) because we’ve only run our set of queries one time.

*If you’re not familiar with Query Store settings and considerations around them, check out my Query Store Settings post.

4.  The statistics represent the average CPU, duration, logical I/O, etc. for a query’s plan, based on how many times the query was executed in that interval (count_executions).  Minimum, maximum, and standard deviation numbers for each counter are also calculated and can be useful in finding plans with high variability.

From here, we can drill further into this specific query using the Query Store Tracked Queries report and the query_id (2).

Tracked Queries Report for query_id 2

Tracked Queries Report for query_id 2

Again, we’ve only executed the stored procedure 40 times in total, so while it’s great to see this query’s plan and see that we have a missing index recommendation, this data isn’t that interesting compared to what we might see in a production environment.  We can use a simple loop to run the stored procedure repeatedly with different Customer Names, and let that run for a bit to generate some data:

DECLARE @CustomerID INT = 801
DECLARE @CustomerName NVARCHAR(100)

WHILE 1=1
BEGIN

     SELECT @CustomerName = SUBSTRING([CustomerName], 1, 10) + '%'
     FROM [Sales].[Customers]
     WHERE [CustomerID] = @CustomerID;

     EXEC [Sales].[usp_GetCustomerDetail] @CustomerName;

     IF @CustomerID < 1092
     BEGIN
          SET @CustomerID = @CustomerID + 1
     END
     ELSE
     BEGIN
          SET @CustomerID = 801
     END

END

Now if we look at our execution statistics, you will see we have more intervals of data (noted by the start_time and end_time, and the respective runtime_stats_interval_id).  In addition, you’ll note that query_id 3 has three plans (plan_ids 3, 16, and 18):

Runtime Statistics for usp_GetCustomerDetail after additional executions

Runtime Statistics for usp_GetCustomerDetail after additional executions

Looking at query performance across time is not that easy in the grid output.  If we look at any individual query again in the Tracked Queries report for a 30-minute window, we get a much better view of the data:

Tracked Queries Report for query_id 3

Tracked Queries Report for query_id 3

Important points about this output:

1. This is query_id 3, which has three different plans, which are represented by the different plan_id values on the right, and different colors on the graph.

2.  The circles represent the average values for successful execution.  The squares represent canceled executions (which we excluded in our query).  If you see a triangle, that means an error in the execution.

3.  Clicking on the different plan_id values displays the corresponding query plan in the bottom window.  Hold down the SHIFT key to select two plan_id values, then use the ComparePlans button above to open a new window and see both plans side-by-side.

Even though this view allows us to trend individual query performance over time, we don’t get a good sense of overall query performance relative to other queries in the stored procedure.  We have to go back to T-SQL for that, and for this example, we’ll aggregate our metrics based on the query_id and the plan­_id, as a query can have multiple plans and they may have wildly different performance characteristics.

SELECT
     [qsq].[query_id],
     [qsp].[plan_id],
     OBJECT_NAME([qsq].[object_id])AS [ObjectName],
     SUM([rs].[count_executions]) AS [TotalExecutions],
     AVG([rs].[avg_duration]) AS [Avg_Duration],
     AVG([rs].[avg_cpu_time]) AS [Avg_CPU],
     AVG([rs].[avg_logical_io_reads]) AS [Avg_LogicalReads],
     MIN([qst].[query_sql_text]) AS[Query]
FROM [sys].[query_store_query] [qsq]
JOIN [sys].[query_store_query_text] [qst]
     ON [qsq].[query_text_id] = [qst].[query_text_id]
JOIN [sys].[query_store_plan] [qsp]
     ON [qsq].[query_id] = [qsp].[query_id]
JOIN [sys].[query_store_runtime_stats] [rs]
     ON [qsp].[plan_id] = [rs].[plan_id]
JOIN [sys].[query_store_runtime_stats_interval] [rsi]
     ON [rs].[runtime_stats_interval_id] = [rsi].[runtime_stats_interval_id]
WHERE [qsq].[object_id] = OBJECT_ID(N'Sales.usp_GetCustomerDetail')
     AND [rs].[last_execution_time] &gt; DATEADD(HOUR, -1, GETUTCDATE())
     AND [rs].[execution_type] = 0
GROUP BY [qsq].[query_id], [qsp].[plan_id], OBJECT_NAME([qsq].[object_id])
ORDER BY AVG([rs].[avg_cpu_time]) DESC;
GO

The data is ordered by avg_cpu_time descending, so the worst performing query in terms of CPU is listed first.  You can change the order based on what is of interest to you, but you’ll notice that query_id 2 is still our worst offender.

Averaged runtime statistics by query for usp_GetCustomerDetail

Averaged runtime statistics by query for usp_GetCustomerDetail

Conclusion

There are multiple sources of query performance data within SQL Server, and the DMVs and Extended Events (or Trace for those of you running SQL Server 2008R2 and below) are traditional tools used to understand how long it takes for a query to execute.  Query Store is a new option for capturing query metrics, and it shines over both in that it’s persisted in the user database (the DMVs only represent what’s currently in cache, and are cleared on a restart).  In addition, Query Store doesn’t require any setup other than enabling it one time – you need to have an event session defined and running to capture query performance with Extended Events, and the same with Trace.  Lastly, Query Store allows you to look at all the queries for a stored procedure and drill into what is not performing well either at this moment, or over time.

T-SQL code used here can be found in its entirety on GitHub.

Query Store Performance Overhead: What you need to know

“What is the performance overhead of enabling Query Store?”

I get asked this question almost every time I present on a topic related to Query Store.  What people are really asking is “Does enabling Query Store affect the performance of my queries?”  Where “my queries” are user queries, queries from the application, etc.

The short answer:

  • The majority of workloads won’t see an impact on system performance
    • Will there be an increase in resource use (CPU, memory)?  Yes.
    • Is there a “magic number” to use to figure out Query Store performance and the increase in resource use?  No, it will depend on the type of workload.  Keep reading.
  • An impact on system performance can be seen with ad-hoc workloads (think Entity Framework, NHibernate), but I still think it’s worth enabling. With an ad-hoc workload there are additional factors to consider when using Query Store.
  • You should be running the latest version CU for SQL Server 2017 and latest CU for SQL Server 2016 SP2 to get all performance-related improvements Microsoft has implemented specific to Query Store

The long answer…

One reason the SQL Server 2016 release was such a solid release was that it was data driven.  “Data Driven” was frequently used in Microsoft marketing materials for 2016, but it wasn’t hype; it was true.  At the time of the SQL Server release, Azure SQL Database had been in existence for over two years and Microsoft had been capturing telemetry and using that data to understand how features were being used, as well as improve existing features.

One of the features that benefited most from the insight provided by the telemetry data was Query Store, which was originally released in private preview for Azure SQL Database in early 2015.  As Query Store was implemented for more databases, the information captured was used to enhance its functionality and improve its performance.  Query Store was made publicly available in late 2015, and included in the SQL Server 2016 release.  The telemetry data was invaluable to Microsoft’s developers as they prepared Query Store for release, but the variety in size and workload that exist in on-premises solutions was not accurately represented.  Much of this was due to limitations in Azure tiers at the time and the limited number (comparatively) of companies that had embraced using a cloud solution.

Thus, while the initial internal thresholds for Query Store were determined based upon Azure SQL Database solutions and appropriate for most on-prem systems, they were not fully suited to every variation of an on-prem solution.  This is not atypical – it’s extremely difficult to develop software that accommodates every single workload in the world both in the cloud and on-prem.

This history is relevant when people ask about solution performance and Query Store.

First, understand that there are differences in how Query Store works in Azure SQL Database compared to on-prem.  A good example is the amount of space that you can allocate to Query Store within the user database (MAX_STORAGE_SIZE_MB). In Azure SQL Database the maximum value one can set for MAX_STORAGE_SIZE_MB is 10GB.  There is no a limit for SQL Server 2016 or 2017.  As a result of this limitation for Azure SQL DB, the amount of data that Query Store has to manage can be significantly less than what we see for an on-prem solution.  There are many production environments with a Query Store that is 10GB or less in size, but I know of Query Stores that are 200-250GB in size on disk, which typically indicates an anti-pattern with the workload.

Separate from storing the Query Store in the user database, data is also held in different memory buffers (e.g. query hash map, runtime statistics cache store).  Data is inserted into these memory buffers for new queries, updated for previously-executed queries, and while data is flushed to disk regularly, it is expected that data continuously resides in these buffers.  The data for the query hash map is consistent, but the volume of data in the runtime statistics cache store fluctuates depending on the workload.

There are multiple ways to characterize a workload, but in the case of Query Store, we’re most interested in the number of unique queries generated.  We tend to characterize workloads with a high number of unique queries as ad-hoc – those that use Entity Framework or NHibernate, for example.  But there are other variations, such as multi-versioned tables, which also create a significant number of unique queries.  To be clear, the following are unique queries:

SELECT e.FirstName, e.LastName, d.Name

FROM dbo.Employees e

JOIN dbo.Department d

ON e.department_id = d.department_id

WHERE e.LastName = ‘Harbaugh’;



SELECT e.FirstName, e.LastName, d.Name

FROM dbo.Employees e

JOIN dbo.Department d ON e.department_id = d.department_id

WHERE e.LastName = ‘Winovich’;



SELECT e.firstname, e.lastname, d.name

FROM dbo.Employees e

JOIN dbo.Department d

ON e.department_id = d.department_id

WHERE e.lastname = ‘Carr’;

Just like the plan cache, Query Store identifies each of the above queries as unique (even though they all have the same query_hash) based on the text, and assigns each its own query_text_id in Query Store.  This query_text_id, combined with context_settings_id, object_id, batch_sql_handle, and query_parameterization_type create a unique hash for each query which Query Store uses internally, and is stored in buffers in memory, along with the runtime statistics for each unique hash.  The more unique query texts in a workload, the more overhead there may be to manage the data.

Understand that if you have an ad hoc workload, you already have a system that is prone to performance issues due to high compiles, plan cache bloat, and variability in query performance across queries that are textually the same in terms of query_hash, but have different literal values (as shown above).  For an ad-hoc workload that is also high volume (high number of batch requests/sec), when you enable Query Store it can appear that a performance problem has been introduced.  It is tempting to look at any decrease in performance as a problem with Query Store.  However, it’s a function of the type of the workload and simply the cost of doing business for said workload.  If you want to capture Query Store data about an ad-hoc workload (to then identify query patterns and address them) then you’ll have to expect and plan for the overhead associated with it.

You can control, to a small degree, the number of queries captured using the QUERY_CAPTURE_MODE setting.  The default value of ALL means that every single executed will be captured. The value of AUTO means that only queries that exceed a threshold (set internally by Microsoft) will be captured.  As noted in my Query Store Settings post, AUTO is the recommendation for a production environment, particularly one that is ad-hoc.

The SQL Server team made performance-related improvements in Query Store in the SQL Server 2017 release, and these were back-ported to SQL Server 2016 SP2.  There have been a few additional fixes in SQL Server 2017, such as this one, in CU11.  I know of a couple people who have run into this issue, so if you’re on SQL Server 2017 and using Query Store, I definitely recommend applying the latest CU.

Final thoughts

Now we can answer, “Can enabling Query Store make some of your queries run slower?”  It depends on your workload, your version of SQL Server, and the settings you have enabled.  For those folks with a mostly procedure-type workload, I haven’t seen many issues.  For those with ad-hoc, high volume workloads, you are now aware of the potential overhead and you can plan accordingly.  If you’re on the fence about it, enable it during a low-volume time and monitor the system.  If you feel there’s a problem, turn it off.  But the data gathered on any system can be used to help make that system better, even if you have to do it incrementally.  Whether your workload is procedure-based, ad-hoc, or a mix, Query Store is an invaluable resource that can be used to capture query metrics, find queries that perform poorly or execute frequently, and force plans to stabilize query performance.

Baselines for SQL Server and Azure SQL Database

Last week I got an email from a community member who had read this older article of mine on baselining, and asked if there were any updates related to SQL Server 2016, SQL Server 2017, or vNext (SQL Server 2019). It was a really good question. I haven’t visited that article in a while and so I took the time to re-read it. I’m rather proud to say that what I said then still holds up today.

The fundamentals of baselining are the same as they were back in 2012 when that article was first published. What is different about today? First, there are a lot more metrics in the current release of SQL Server that you can baseline (e.g. more events in Extended Events, new DMVs, new PerfMon counters,  sp_server_diagnostics_component_results). Second, options for capturing baselines have changed. In the article I mostly talked about rolling your own scripts for baselining. If you’re looking to establish baselines for your servers you still have the option to develop your own scripts, but you also can use a third-party tool, and if you’re running SQL Server 2016+ or Azure SQL Database, you can use Query Store.

As much as I love Query Store, I admit that it is not all-encompassing in terms of baselining a server. It does not replace a third-party tool, nor does it fully replace rolling your own scripts. Query Store captures metrics specific to query execution, and you’re not familiar with this feature, feel free to check out my posts about it.

Consider this core question: What should we baseline in our SQL Server environment? If you have a third-party tool, the data captured is determined by the application, and some of them allow you to customize and capture additional metrics. But if you roll your own scripts, there are some fundamental things that I think you should capture such as instance configuration, file space and usage information, and wait statistics.

Beyond that, it really goes back to the question of what problem are you trying to solve? If you are looking at implementing In-Memory OLTP, then you want to capture information related to query execution times and frequency, locking, latching, and memory use. After you implement In-Memory OLTP, you look at those exact same metrics and compare the data. If you’re looking at using Columnstore indexes, you need to look at query performance as it stands right now (duration, I/O, CPU) and capture how it changes after you’ve added one or more Columnstore indexes. But to be really thorough you should also look at index usage for the involved tables, as well as query performance for other queries against those tables to see if and/or how performance changes after you’ve added the index. Very few things in SQL Server work truly in isolation, they’re all interacting with each other in some way…which is why baselining can be a little bit overwhelming and why I recommend that you start small.

Back to the original question: is there anything new to consider with SQL Server 2016 and higher? While third-party tools continue to improve and more metrics are available as new features are added and SQL Server continues to evolve, the only thing “really new” is the addition of Query Store and its ability to capture query performance metrics natively within SQL Server. Hopefully this helps as you either look at different third-party tools that you may want to purchase, or you look at rolling your own set of scripts.  If you’re interested in writing your own scripts, I have a set of references that might be of use here.

Lastly, you’ll note that I haven’t said much about Azure SQL Database, and that’s because it’s an entirely different beast.  If you have one or more Azure SQL Databases, then you may know that within the Portal there are multiple options for looking at system performance, including Intelligent Insights and Query Performance Insight.  Theoretically, you could still roll your own scripts in Azure SQL DB, but I would first explore what Microsoft provides to see if it meets your needs.  Have fun!

Removing a database from a replica in an Availability Group

I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica.  The storage on the replica is slower than that on the primary (not a desired scenario, but it is what it is) and the secondary has fallen behind before, but this time it was to the point where it made more sense remove the database from the replica and re-initialize, rather than wait several hours for it to catch up.  What I’m about detail is not an ideal solution.  In fact, your solution should be architected to avoid this scenario entirely (storage of equal capability for all involved nodes is essential).  But, stuff happens (e.g., a secondary database unexpectedly pausing), and the goal was to get the replica synchronized again with no downtime.

In my demo environment I have two nodes, CAP and BUCKY.  CAP is the primary, BUCKY is the replica, and there are two databases, AdventureWorks2012 and WideWorldImporters in my TestLocation AG:

Availability Group (named TestLocation) Configuration

Availability Group (named TestLocation) Configuration

In this case, my WideWorldImporters database is the one that’s behind on the secondary replica, so this is the database we want to remove and then re-initialize.  On the secondary (BUCKY) we will remove WideWorldImporters from the AG with this TSQL:

USE [master];
GO

ALTER DATABASE [WideWorldImporters]
     SET HADR OFF;
GO

You can also do this in the UI, if you right-click on the database within the AG and select Remove Secondary Database, but I recommend scripting it and then running it (screen shot for reference):

Removing WideWorldImporters from the AG via SSMS

Removing WideWorldImporters from the AG via SSMS

After removing the database, it will still be listed for the AG but it will have a red X next to it (don’t panic).  It will also be listed in the list of Databases, but it will have a status of Restoring…

WideWorldImporters database removed on the secondary replica

WideWorldImporters database removed on the secondary replica

If you check the primary, the WideWorldImporters database there is healthy:

Database and AG health on the primary

Database and AG health on the primary

You can still access WideWorldImporters as it’s part of the AG and using the Listener.  The system is still available, but I’m playing without a net.  If the primary goes down, I will have not have access to the WideWorldImporters database.  In this specific case, this was a risk I was willing to take (again, because the time to restore the database was less than the time it would take the secondary to catch up).  Also note that because this database is in an Availability Group by itself, the transaction log will be truncated when it’s backed up.

At this point, you want to kick off a restore of the most recent full backup of the database on the replica (BUCKY):

USE [master];
GO

RESTORE DATABASE [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Full.bak'
     WITH  FILE = 1,
     MOVE N'WWI_Primary' TO N'C:\Databases\WideWorldImporters.mdf',
     MOVE N'WWI_UserData' TO N'C:\Databases\WideWorldImporters_UserData.ndf',
     MOVE N'WWI_Log' TO N'C:\Databases\WideWorldImporters.ldf',
     MOVE N'WWI_InMemory_Data_1' TO N'C:\Databases\WideWorldImporters_InMemory_Data_1',
     NORECOVERY,
     REPLACE,
     STATS = 5;

GO

Depending on how long this takes, at some point I disable the jobs that run differential or log backups on the primary (CAP), and then manually kick off a differential backup on the primary (CAP).

USE [master];
GO

BACKUP DATABASE [WideWorldImporters]
     TO  DISK = N'C:\Backups\WWI_Diff.bak'
     WITH  DIFFERENTIAL ,
     INIT,
     STATS = 10;
GO

Next, restore the differential on the replica (BUCKY):

USE [master];
GO

RESTORE DATABASE [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Diff.bak'
     WITH  FILE = 1,
     NORECOVERY,
     STATS = 5;
GO

Finally, take a log backup on the primary (CAP):

USE [master];
GO

BACKUP LOG [WideWorldImporters]
     TO  DISK = N'C:\Backups\WWI_Log.trn'
     WITH NOFORMAT,
     INIT,
     STATS = 10;
GO

And then restore that log backup on the replica (BUCKY):

USE [master];
GO

RESTORE LOG [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Log.trn'
     WITH  FILE = 1,
     NORECOVERY,
     STATS = 5;
GO

At this point, the database is re-initialized and ready to be added back to the Availability Group.

Now, when I ran into this the other day, I also wanted to apply a startup trace flag to the primary replica and restart the instance.  I also wanted to make sure that the AG wouldn’t try to failover when the instance restarted, so I temporarily changed the primary to manual failover (executed on CAP, screenshot for reference):

USE [master];
GO

ALTER AVAILABILITY GROUP [TestLocation]
     MODIFY REPLICA ON N'CAP\ROGERS' WITH (FAILOVER_MODE = MANUAL);
GO
Change Failover Mode for the AG Temporarily

Change Failover Mode for the AG Temporarily

I restarted the instance, confirmed my trace flag was in play, and then changed the FAILOVER_MODE back to automatic:

USE [master];
GO

ALTER AVAILABILITY GROUP [TestLocation]
     MODIFY REPLICA ON N'CAP\ROGERS' WITH (FAILOVER_MODE = AUTOMATIC);
GO

The last step is to join the WideWorldImporters database on the replica back to the AG:

ALTER DATABASE [WideWorldImporters]
     SET HADR AVAILABILITY GROUP = TestLocation;
GO

After joining the database back to the AG, be prepared to wait for the databases to synchronize before things look healthy.  Initially I saw this:

Secondary database joined to AG, but not synchronized

Secondary database joined to AG, but not synchronized

Transactions were still occurring on the primary between the time of the log being applied on the secondary (BUCKY) and the database being joined to the AG from the secondary.  You can check the dashboard to confirm this:

Secondary database added to AG, transactions being replayed on secondary

Secondary database added to AG, transactions being replayed on secondary

Once the transactions had been replayed, everything was synchronized and healthy:

Databases synchronized (dashboard on primary)

Databases synchronized (dashboard on primary)

Databases synchronized (connected to secondary)

Databases synchronized (connected to secondary)

Once the databases are synchronized, make sure to re-enable the jobs that run differential and log backups on the primary (CAP).  In the end, removing a database from a replica in an Availability Group (and then adding it back) is probably not something you will need to do on a regular basis.  This is a process worth practicing in a test environment at least once, so you’re comfortable with it should the need arise.

Plan Forcing in SQL Server

Last month I was in Portugal for their SQLSaturday event, and I spent a lot of time talking about Plan Forcing in SQL Server – both manual and automatic (via the Automatic Plan Correction feature). I had some really great questions from my pre-con and regular session and wanted to summarize a few thoughts on Plan Forcing functionality.

Forcing plans in SQL Server provides a very easy method for DBAs and developers to stabilize query performance.  But plan forcing is not a permanent solution.  Consider the premise on which plan forcing relies: multiple plans exist for a query and one of them provides the most consistent performance.  If I have high variability in query performance, ideally, I want to address that in the code or through schema changes (e.g. indexing).  Forcing a plan for a query is a lot like creating a plan guide – they are similar but they are two separate features – in that it’s a temporary solution.  I also view adding OPTION (RECOMPILE) as a temporary solution. Some of you might be shocked at that, but when I see a RECOMPILE on a query, I immediately ask why it was added, when it was added, and I start looking at what can be done to remove it.

Knowing that this is how I view plan forcing, how do I decide when to force a plan?  When the query has variability in performance.

Consider Query A, which generates multiple plans, but they’re all about the same in terms of duration, I/O, and CPU.  The performance across the different plans is consistent.  I won’t force a plan for that query.

Query with multiple, consistent plans

Query with multiple, consistent plans

Next consider Query B, which also generates different plans, and some are stable but a couple are over the place in terms of duration, I/O, and CPU.  Maybe a couple plans provide good performance, but the rest are awful.  Would I force one of the “good plans”?  Probably – but I’d do some testing first.

Query with multiple plans that have variable performance

Query with multiple plans that have variable performance

Understand that if I force a plan for a query, that’s the plan that’s going to get used unless forcing fails for some reason (e.g. the index no longer exists).  But does that plan work for all variations of the query?  Does that plan provide consistent performance for all the different input parameters that can be used for that query?  This requires testing…and oh by the way, concurrent with any testing/decision to force a plan I’m talking to the developers about ways to address this long-term.

Now, out of my entire workload, if I have many queries that have multiple plans, where do I start?  With the worst offenders.  If I’m resource-bound in some way (e.g. CPU or I/O), then I would look at queries with the highest resource use and start working through those.  But I also look for the “death by a thousand cuts” scenario – the queries which execute hundreds or thousands of times a minute.  As an aside, during the pre-con in Portugal one of the attendees had me look at a query in Query Store in the production environment.  There was concern because the query had multiple plans.  I pointed out that the query had executed 71,000 times in an hour…which is almost 20 times a second.  While I want to investigate multiple plans, I also want to know why a query executes so often.

Thus far, I’ve talked about a workload…one workload.  What about the environment where you support hundreds of SQL Server instances?  You can obviously take the approach I’ve detailed above, which requires a review of poor-performing queries with multiple plans and deciding which plan (if any) to force until development addresses the issue.  Or, if you’re running SQL Server 2017 Enterprise Edition, you could look at Automatic Plan Correction, which will force a plan for a query (without human intervention) if there’s a regression.  I wrote a post (Automatic Plan Correction in SQL Server) on SQLPerformance.com about this feature, so I’m not going to re-hash the details here.

Whether you force plans manually, or let SQL Server force them with the Automatic Plan Correction feature, I still view plan forcing as a temporary solution.  I don’t expect you to have plans forced for years, let alone months.  The life of a forced plan will, of course, depend on how quickly code and schema changes are ported to production.  If you go the “set it and forget it route”, theoretically a manually forced plan could get used for a very long time.  In that scenario, it’s your responsibility to periodically check to ensure that plan is still the “best” one for the query.  I would be checking every couple weeks; once a month at most.  Whether or not the plan remains optimal depends on the tables involved in the query, the data in the tables, how that data changes (if it changes), other schema changes that may be introduced, and more.

Further, you don’t want to ignore forced plans because there are cases where a forced plan won’t be used (you can use Extended Events to monitor this).  When you force a plan manually, forcing can still fail.  For example, if the forced plan uses an index and the index is dropped, or its definition is changed to the point where it cannot be used in plan in the same manner, then forcing will fail.  Important note: if forcing fails, the query will go through normal optimization and compilation and it will execute; SQL Server does not want your query to fail!  If you’re forcing plans and not familiar with the reasons that it can fail, note the last_force_failure_reason values listed for sys.query_store_plan.  If you have manually forced a plan for a query, and the force plan fails, it remains forced.  You have to manually un-force it to stop SQL Server from trying to use that plan.  As you can see, there are multiple factors related to plan forcing, which is why you don’t just force a plan and forget it.

This behavior is different if you’re using Automatic Plan Correction (APC).  As mentioned in the Automatic tuning documentation, if a plan is automatically forced, it will be automatically un-forced if:

  • forcing fails for any reason
  • if there is a performance regression using the forced plan
  • if there is a recompile due to a schema change or an update to statistics.

With APC, there is still work to be done – here you want to use Extended Events or sys.dm_db_tuning_recommendations to see what plans are getting forced, and then decide if you want to force them manually.  If you force a plan manually it will never be automatically un-forced.

There are a lot of considerations when you embrace plan forcing – I think it’s an excellent alternative to plan guides (much easier to use, not schema bound) and I think it’s absolutely worth a DBA or developer’s time to investigate what plan to force, and then use that as a temporary solution until a long-term fix can be put in place.  I hope this helps those of you that have been wary to give it a try!

ALTER DATABASE SET QUERY_STORE command is blocked

If you are trying to execute an ALTER DATABASE command to change a Query Store option (e.g. turn it off, change a setting) and it is blocked, take note of the blocking session_id and what that session_id is executing.  If you are trying to execute this ALTER command right after a failover or restart, you are probably blocked by the Query Store data loading.

As a reminder, when a database with Query Store enabled starts up, it loads data from the Query Store internal tables into memory (this is an optimization to make specific capabilities of Query Store complete quickly).  In some cases this is a small amount of data, in other cases, it’s larger (potentially a few GB), and as such, it can take seconds or minutes to load.  I have seen this take over 30 minutes to load for a very large Query Store (over 50GB in size).

Specifically, I was recently working with a customer with an extremely large Query Store.  The customer had enabled Trace Flag 7752, which I have written about, so that queries were not blocked while Query Store loaded asynchronously.  The tricky thing about that load is that there is no way to monitor the progress.  You can track when it starts loading and then when it finishes using Extended Events, but there is no progress bar to stare at a on a screen.  When trying to execute an ALTER DATABASE <dbname> SET QUERY_STORE statement while the load was occurring, the statement was blocked by a system session that was running the command Query Store ASYN.  The ALTER DATABASE <dbname> SET QUERY_STORE command did complete once the Query Store data had been loaded.

If you do not have Trace Flag 7752 enabled, then if you try to execute ALTER DATABASE <dbname> SET QUERY_STORE  after a restart or failover you might see the QDS_LOADDB wait_type for queries (again, this will depend the size of the Query Store).  Again, there is no way to monitor the load, and you will see the same behavior if you try to run ALTER DATABASE <dbname> SET QUERY_STORE: the command will not complete until the Query Store load has completed.

In summary, regardless of whether the Query Store data is loading synchronously or asynchronously, you will not be able to execute an ALTER DATABASE <dbname> SET QUERY_STORE statement until the load is complete.