Query Store Performance Overhead: What you need to know

“What is the performance overhead of enabling Query Store?”

I get asked this question almost every time I present on a topic related to Query Store.  What people are really asking is “Does enabling Query Store affect the performance of my queries?”  Where “my queries” are user queries, queries from the application, etc.

The short answer:

  • The majority of workloads won’t see an impact on system performance
    • Will there be an increase in resource use (CPU, memory)?  Yes.
    • Is there a “magic number” to use to figure out Query Store performance and the increase in resource use?  No, it will depend on the type of workload.  Keep reading.
  • An impact on system performance can be seen with ad-hoc workloads (think Entity Framework, NHibernate), but I still think it’s worth enabling. With an ad-hoc workload there are additional factors to consider when using Query Store.
  • You should be running the latest version CU for SQL Server 2017 and latest CU for SQL Server 2016 SP2 to get all performance-related improvements Microsoft has implemented specific to Query Store

The long answer…

One reason the SQL Server 2016 release was such a solid release was that it was data driven.  “Data Driven” was frequently used in Microsoft marketing materials for 2016, but it wasn’t hype; it was true.  At the time of the SQL Server release, Azure SQL Database had been in existence for over two years and Microsoft had been capturing telemetry and using that data to understand how features were being used, as well as improve existing features.

One of the features that benefited most from the insight provided by the telemetry data was Query Store, which was originally released in private preview for Azure SQL Database in early 2015.  As Query Store was implemented for more databases, the information captured was used to enhance its functionality and improve its performance.  Query Store was made publicly available in late 2015, and included in the SQL Server 2016 release.  The telemetry data was invaluable to Microsoft’s developers as they prepared Query Store for release, but the variety in size and workload that exist in on-premises solutions was not accurately represented.  Much of this was due to limitations in Azure tiers at the time and the limited number (comparatively) of companies that had embraced using a cloud solution.

Thus, while the initial internal thresholds for Query Store were determined based upon Azure SQL Database solutions and appropriate for most on-prem systems, they were not fully suited to every variation of an on-prem solution.  This is not atypical – it’s extremely difficult to develop software that accommodates every single workload in the world both in the cloud and on-prem.

This history is relevant when people ask about solution performance and Query Store.

First, understand that there are differences in how Query Store works in Azure SQL Database compared to on-prem.  A good example is the amount of space that you can allocate to Query Store within the user database (MAX_STORAGE_SIZE_MB). In Azure SQL Database the maximum value one can set for MAX_STORAGE_SIZE_MB is 10GB.  There is no a limit for SQL Server 2016 or 2017.  As a result of this limitation for Azure SQL DB, the amount of data that Query Store has to manage can be significantly less than what we see for an on-prem solution.  There are many production environments with a Query Store that is 10GB or less in size, but I know of Query Stores that are 200-250GB in size on disk, which typically indicates an anti-pattern with the workload.

Separate from storing the Query Store in the user database, data is also held in different memory buffers (e.g. query hash map, runtime statistics cache store).  Data is inserted into these memory buffers for new queries, updated for previously-executed queries, and while data is flushed to disk regularly, it is expected that data continuously resides in these buffers.  The data for the query hash map is consistent, but the volume of data in the runtime statistics cache store fluctuates depending on the workload.

There are multiple ways to characterize a workload, but in the case of Query Store, we’re most interested in the number of unique queries generated.  We tend to characterize workloads with a high number of unique queries as ad-hoc – those that use Entity Framework or NHibernate, for example.  But there are other variations, such as multi-versioned tables, which also create a significant number of unique queries.  To be clear, the following are unique queries:

SELECT e.FirstName, e.LastName, d.Name

FROM dbo.Employees e

JOIN dbo.Department d

ON e.department_id = d.department_id

WHERE e.LastName = ‘Harbaugh’;



SELECT e.FirstName, e.LastName, d.Name

FROM dbo.Employees e

JOIN dbo.Department d ON e.department_id = d.department_id

WHERE e.LastName = ‘Winovich’;



SELECT e.firstname, e.lastname, d.name

FROM dbo.Employees e

JOIN dbo.Department d

ON e.department_id = d.department_id

WHERE e.lastname = ‘Carr’;

Just like the plan cache, Query Store identifies each of the above queries as unique (even though they all have the same query_hash) based on the text, and assigns each its own query_text_id in Query Store.  This query_text_id, combined with context_settings_id, object_id, batch_sql_handle, and query_parameterization_type create a unique hash for each query which Query Store uses internally, and is stored in buffers in memory, along with the runtime statistics for each unique hash.  The more unique query texts in a workload, the more overhead there may be to manage the data.

Understand that if you have an ad hoc workload, you already have a system that is prone to performance issues due to high compiles, plan cache bloat, and variability in query performance across queries that are textually the same in terms of query_hash, but have different literal values (as shown above).  For an ad-hoc workload that is also high volume (high number of batch requests/sec), when you enable Query Store it can appear that a performance problem has been introduced.  It is tempting to look at any decrease in performance as a problem with Query Store.  However, it’s a function of the type of the workload and simply the cost of doing business for said workload.  If you want to capture Query Store data about an ad-hoc workload (to then identify query patterns and address them) then you’ll have to expect and plan for the overhead associated with it.

You can control, to a small degree, the number of queries captured using the QUERY_CAPTURE_MODE setting.  The default value of ALL means that every single executed will be captured. The value of AUTO means that only queries that exceed a threshold (set internally by Microsoft) will be captured.  As noted in my Query Store Settings post, AUTO is the recommendation for a production environment, particularly one that is ad-hoc.

The SQL Server team made performance-related improvements in Query Store in the SQL Server 2017 release, and these were back-ported to SQL Server 2016 SP2.  There have been a few additional fixes in SQL Server 2017, such as this one, in CU11.  I know of a couple people who have run into this issue, so if you’re on SQL Server 2017 and using Query Store, I definitely recommend applying the latest CU.

Final thoughts

Now we can answer, “Can enabling Query Store make some of your queries run slower?”  It depends on your workload, your version of SQL Server, and the settings you have enabled.  For those folks with a mostly procedure-type workload, I haven’t seen many issues.  For those with ad-hoc, high volume workloads, you are now aware of the potential overhead and you can plan accordingly.  If you’re on the fence about it, enable it during a low-volume time and monitor the system.  If you feel there’s a problem, turn it off.  But the data gathered on any system can be used to help make that system better, even if you have to do it incrementally.  Whether your workload is procedure-based, ad-hoc, or a mix, Query Store is an invaluable resource that can be used to capture query metrics, find queries that perform poorly or execute frequently, and force plans to stabilize query performance.

Baselines for SQL Server and Azure SQL Database

Last week I got an email from a community member who had read this older article of mine on baselining, and asked if there were any updates related to SQL Server 2016, SQL Server 2017, or vNext (SQL Server 2019). It was a really good question. I haven’t visited that article in a while and so I took the time to re-read it. I’m rather proud to say that what I said then still holds up today.

The fundamentals of baselining are the same as they were back in 2012 when that article was first published. What is different about today? First, there are a lot more metrics in the current release of SQL Server that you can baseline (e.g. more events in Extended Events, new DMVs, new PerfMon counters,  sp_server_diagnostics_component_results). Second, options for capturing baselines have changed. In the article I mostly talked about rolling your own scripts for baselining. If you’re looking to establish baselines for your servers you still have the option to develop your own scripts, but you also can use a third-party tool, and if you’re running SQL Server 2016+ or Azure SQL Database, you can use Query Store.

As much as I love Query Store, I admit that it is not all-encompassing in terms of baselining a server. It does not replace a third-party tool, nor does it fully replace rolling your own scripts. Query Store captures metrics specific to query execution, and you’re not familiar with this feature, feel free to check out my posts about it.

Consider this core question: What should we baseline in our SQL Server environment? If you have a third-party tool, the data captured is determined by the application, and some of them allow you to customize and capture additional metrics. But if you roll your own scripts, there are some fundamental things that I think you should capture such as instance configuration, file space and usage information, and wait statistics.

Beyond that, it really goes back to the question of what problem are you trying to solve? If you are looking at implementing In-Memory OLTP, then you want to capture information related to query execution times and frequency, locking, latching, and memory use. After you implement In-Memory OLTP, you look at those exact same metrics and compare the data. If you’re looking at using Columnstore indexes, you need to look at query performance as it stands right now (duration, I/O, CPU) and capture how it changes after you’ve added one or more Columnstore indexes. But to be really thorough you should also look at index usage for the involved tables, as well as query performance for other queries against those tables to see if and/or how performance changes after you’ve added the index. Very few things in SQL Server work truly in isolation, they’re all interacting with each other in some way…which is why baselining can be a little bit overwhelming and why I recommend that you start small.

Back to the original question: is there anything new to consider with SQL Server 2016 and higher? While third-party tools continue to improve and more metrics are available as new features are added and SQL Server continues to evolve, the only thing “really new” is the addition of Query Store and its ability to capture query performance metrics natively within SQL Server. Hopefully this helps as you either look at different third-party tools that you may want to purchase, or you look at rolling your own set of scripts.  If you’re interested in writing your own scripts, I have a set of references that might be of use here.

Lastly, you’ll note that I haven’t said much about Azure SQL Database, and that’s because it’s an entirely different beast.  If you have one or more Azure SQL Databases, then you may know that within the Portal there are multiple options for looking at system performance, including Intelligent Insights and Query Performance Insight.  Theoretically, you could still roll your own scripts in Azure SQL DB, but I would first explore what Microsoft provides to see if it meets your needs.  Have fun!

Removing a database from a replica in an Availability Group

I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica.  The storage on the replica is slower than that on the primary (not a desired scenario, but it is what it is) and the secondary has fallen behind before, but this time it was to the point where it made more sense remove the database from the replica and re-initialize, rather than wait several hours for it to catch up.  What I’m about detail is not an ideal solution.  In fact, your solution should be architected to avoid this scenario entirely (storage of equal capability for all involved nodes is essential).  But, stuff happens (e.g., a secondary database unexpectedly pausing), and the goal was to get the replica synchronized again with no downtime.

In my demo environment I have two nodes, CAP and BUCKY.  CAP is the primary, BUCKY is the replica, and there are two databases, AdventureWorks2012 and WideWorldImporters in my TestLocation AG:

Availability Group (named TestLocation) Configuration

Availability Group (named TestLocation) Configuration

In this case, my WideWorldImporters database is the one that’s behind on the secondary replica, so this is the database we want to remove and then re-initialize.  On the secondary (BUCKY) we will remove WideWorldImporters from the AG with this TSQL:

USE [master];
GO

ALTER DATABASE [WideWorldImporters]
     SET HADR OFF;
GO

You can also do this in the UI, if you right-click on the database within the AG and select Remove Secondary Database, but I recommend scripting it and then running it (screen shot for reference):

Removing WideWorldImporters from the AG via SSMS

Removing WideWorldImporters from the AG via SSMS

After removing the database, it will still be listed for the AG but it will have a red X next to it (don’t panic).  It will also be listed in the list of Databases, but it will have a status of Restoring…

WideWorldImporters database removed on the secondary replica

WideWorldImporters database removed on the secondary replica

If you check the primary, the WideWorldImporters database there is healthy:

Database and AG health on the primary

Database and AG health on the primary

You can still access WideWorldImporters as it’s part of the AG and using the Listener.  The system is still available, but I’m playing without a net.  If the primary goes down, I will have not have access to the WideWorldImporters database.  In this specific case, this was a risk I was willing to take (again, because the time to restore the database was less than the time it would take the secondary to catch up).  Also note that because this database is in an Availability Group by itself, the transaction log will be truncated when it’s backed up.

At this point, you want to kick off a restore of the most recent full backup of the database on the replica (BUCKY):

USE [master];
GO

RESTORE DATABASE [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Full.bak'
     WITH  FILE = 1,
     MOVE N'WWI_Primary' TO N'C:\Databases\WideWorldImporters.mdf',
     MOVE N'WWI_UserData' TO N'C:\Databases\WideWorldImporters_UserData.ndf',
     MOVE N'WWI_Log' TO N'C:\Databases\WideWorldImporters.ldf',
     MOVE N'WWI_InMemory_Data_1' TO N'C:\Databases\WideWorldImporters_InMemory_Data_1',
     NORECOVERY,
     REPLACE,
     STATS = 5;

GO

Depending on how long this takes, at some point I disable the jobs that run differential or log backups on the primary (CAP), and then manually kick off a differential backup on the primary (CAP).

USE [master];
GO

BACKUP DATABASE [WideWorldImporters]
     TO  DISK = N'C:\Backups\WWI_Diff.bak'
     WITH  DIFFERENTIAL ,
     INIT,
     STATS = 10;
GO

Next, restore the differential on the replica (BUCKY):

USE [master];
GO

RESTORE DATABASE [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Diff.bak'
     WITH  FILE = 1,
     NORECOVERY,
     STATS = 5;
GO

Finally, take a log backup on the primary (CAP):

USE [master];
GO

BACKUP LOG [WideWorldImporters]
     TO  DISK = N'C:\Backups\WWI_Log.trn'
     WITH NOFORMAT,
     INIT,
     STATS = 10;
GO

And then restore that log backup on the replica (BUCKY):

USE [master];
GO

RESTORE LOG [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Log.trn'
     WITH  FILE = 1,
     NORECOVERY,
     STATS = 5;
GO

At this point, the database is re-initialized and ready to be added back to the Availability Group.

Now, when I ran into this the other day, I also wanted to apply a startup trace flag to the primary replica and restart the instance.  I also wanted to make sure that the AG wouldn’t try to failover when the instance restarted, so I temporarily changed the primary to manual failover (executed on CAP, screenshot for reference):

USE [master];
GO

ALTER AVAILABILITY GROUP [TestLocation]
     MODIFY REPLICA ON N'CAP\ROGERS' WITH (FAILOVER_MODE = MANUAL);
GO
Change Failover Mode for the AG Temporarily

Change Failover Mode for the AG Temporarily

I restarted the instance, confirmed my trace flag was in play, and then changed the FAILOVER_MODE back to automatic:

USE [master];
GO

ALTER AVAILABILITY GROUP [TestLocation]
     MODIFY REPLICA ON N'CAP\ROGERS' WITH (FAILOVER_MODE = AUTOMATIC);
GO

The last step is to join the WideWorldImporters database on the replica back to the AG:

ALTER DATABASE [WideWorldImporters]
     SET HADR AVAILABILITY GROUP = TestLocation;
GO

After joining the database back to the AG, be prepared to wait for the databases to synchronize before things look healthy.  Initially I saw this:

Secondary database joined to AG, but not synchronized

Secondary database joined to AG, but not synchronized

Transactions were still occurring on the primary between the time of the log being applied on the secondary (BUCKY) and the database being joined to the AG from the secondary.  You can check the dashboard to confirm this:

Secondary database added to AG, transactions being replayed on secondary

Secondary database added to AG, transactions being replayed on secondary

Once the transactions had been replayed, everything was synchronized and healthy:

Databases synchronized (dashboard on primary)

Databases synchronized (dashboard on primary)

Databases synchronized (connected to secondary)

Databases synchronized (connected to secondary)

Once the databases are synchronized, make sure to re-enable the jobs that run differential and log backups on the primary (CAP).  In the end, removing a database from a replica in an Availability Group (and then adding it back) is probably not something you will need to do on a regular basis.  This is a process worth practicing in a test environment at least once, so you’re comfortable with it should the need arise.

Plan Forcing in SQL Server

Last month I was in Portugal for their SQLSaturday event, and I spent a lot of time talking about Plan Forcing in SQL Server – both manual and automatic (via the Automatic Plan Correction feature). I had some really great questions from my pre-con and regular session and wanted to summarize a few thoughts on Plan Forcing functionality.

Forcing plans in SQL Server provides a very easy method for DBAs and developers to stabilize query performance.  But plan forcing is not a permanent solution.  Consider the premise on which plan forcing relies: multiple plans exist for a query and one of them provides the most consistent performance.  If I have high variability in query performance, ideally, I want to address that in the code or through schema changes (e.g. indexing).  Forcing a plan for a query is a lot like creating a plan guide – they are similar but they are two separate features – in that it’s a temporary solution.  I also view adding OPTION (RECOMPILE) as a temporary solution. Some of you might be shocked at that, but when I see a RECOMPILE on a query, I immediately ask why it was added, when it was added, and I start looking at what can be done to remove it.

Knowing that this is how I view plan forcing, how do I decide when to force a plan?  When the query has variability in performance.

Consider Query A, which generates multiple plans, but they’re all about the same in terms of duration, I/O, and CPU.  The performance across the different plans is consistent.  I won’t force a plan for that query.

Query with multiple, consistent plans

Query with multiple, consistent plans

Next consider Query B, which also generates different plans, and some are stable but a couple are over the place in terms of duration, I/O, and CPU.  Maybe a couple plans provide good performance, but the rest are awful.  Would I force one of the “good plans”?  Probably – but I’d do some testing first.

Query with multiple plans that have variable performance

Query with multiple plans that have variable performance

Understand that if I force a plan for a query, that’s the plan that’s going to get used unless forcing fails for some reason (e.g. the index no longer exists).  But does that plan work for all variations of the query?  Does that plan provide consistent performance for all the different input parameters that can be used for that query?  This requires testing…and oh by the way, concurrent with any testing/decision to force a plan I’m talking to the developers about ways to address this long-term.

Now, out of my entire workload, if I have many queries that have multiple plans, where do I start?  With the worst offenders.  If I’m resource-bound in some way (e.g. CPU or I/O), then I would look at queries with the highest resource use and start working through those.  But I also look for the “death by a thousand cuts” scenario – the queries which execute hundreds or thousands of times a minute.  As an aside, during the pre-con in Portugal one of the attendees had me look at a query in Query Store in the production environment.  There was concern because the query had multiple plans.  I pointed out that the query had executed 71,000 times in an hour…which is almost 20 times a second.  While I want to investigate multiple plans, I also want to know why a query executes so often.

Thus far, I’ve talked about a workload…one workload.  What about the environment where you support hundreds of SQL Server instances?  You can obviously take the approach I’ve detailed above, which requires a review of poor-performing queries with multiple plans and deciding which plan (if any) to force until development addresses the issue.  Or, if you’re running SQL Server 2017 Enterprise Edition, you could look at Automatic Plan Correction, which will force a plan for a query (without human intervention) if there’s a regression.  I wrote a post (Automatic Plan Correction in SQL Server) on SQLPerformance.com about this feature, so I’m not going to re-hash the details here.

Whether you force plans manually, or let SQL Server force them with the Automatic Plan Correction feature, I still view plan forcing as a temporary solution.  I don’t expect you to have plans forced for years, let alone months.  The life of a forced plan will, of course, depend on how quickly code and schema changes are ported to production.  If you go the “set it and forget it route”, theoretically a manually forced plan could get used for a very long time.  In that scenario, it’s your responsibility to periodically check to ensure that plan is still the “best” one for the query.  I would be checking every couple weeks; once a month at most.  Whether or not the plan remains optimal depends on the tables involved in the query, the data in the tables, how that data changes (if it changes), other schema changes that may be introduced, and more.

Further, you don’t want to ignore forced plans because there are cases where a forced plan won’t be used (you can use Extended Events to monitor this).  When you force a plan manually, forcing can still fail.  For example, if the forced plan uses an index and the index is dropped, or its definition is changed to the point where it cannot be used in plan in the same manner, then forcing will fail.  Important note: if forcing fails, the query will go through normal optimization and compilation and it will execute; SQL Server does not want your query to fail!  If you’re forcing plans and not familiar with the reasons that it can fail, note the last_force_failure_reason values listed for sys.query_store_plan.  If you have manually forced a plan for a query, and the force plan fails, it remains forced.  You have to manually un-force it to stop SQL Server from trying to use that plan.  As you can see, there are multiple factors related to plan forcing, which is why you don’t just force a plan and forget it.

This behavior is different if you’re using Automatic Plan Correction (APC).  As mentioned in the Automatic tuning documentation, if a plan is automatically forced, it will be automatically un-forced if:

  • forcing fails for any reason
  • if there is a performance regression using the forced plan
  • if there is a recompile due to a schema change or an update to statistics.

With APC, there is still work to be done – here you want to use Extended Events or sys.dm_db_tuning_recommendations to see what plans are getting forced, and then decide if you want to force them manually.  If you force a plan manually it will never be automatically un-forced.

There are a lot of considerations when you embrace plan forcing – I think it’s an excellent alternative to plan guides (much easier to use, not schema bound) and I think it’s absolutely worth a DBA or developer’s time to investigate what plan to force, and then use that as a temporary solution until a long-term fix can be put in place.  I hope this helps those of you that have been wary to give it a try!

ALTER DATABASE SET QUERY_STORE command is blocked

If you are trying to execute an ALTER DATABASE command to change a Query Store option (e.g. turn it off, change a setting) and it is blocked, take note of the blocking session_id and what that session_id is executing.  If you are trying to execute this ALTER command right after a failover or restart, you are probably blocked by the Query Store data loading.

As a reminder, when a database with Query Store enabled starts up, it loads data from the Query Store internal tables into memory (this is an optimization to make specific capabilities of Query Store complete quickly).  In some cases this is a small amount of data, in other cases, it’s larger (potentially a few GB), and as such, it can take seconds or minutes to load.  I have seen this take over 30 minutes to load for a very large Query Store (over 50GB in size).

Specifically, I was recently working with a customer with an extremely large Query Store.  The customer had enabled Trace Flag 7752, which I have written about, so that queries were not blocked while Query Store loaded asynchronously.  The tricky thing about that load is that there is no way to monitor the progress.  You can track when it starts loading and then when it finishes using Extended Events, but there is no progress bar to stare at a on a screen.  When trying to execute an ALTER DATABASE <dbname> SET QUERY_STORE statement while the load was occurring, the statement was blocked by a system session that was running the command Query Store ASYN.  The ALTER DATABASE <dbname> SET QUERY_STORE command did complete once the Query Store data had been loaded.

If you do not have Trace Flag 7752 enabled, then if you try to execute ALTER DATABASE <dbname> SET QUERY_STORE  after a restart or failover you might see the QDS_LOADDB wait_type for queries (again, this will depend the size of the Query Store).  Again, there is no way to monitor the load, and you will see the same behavior if you try to run ALTER DATABASE <dbname> SET QUERY_STORE: the command will not complete until the Query Store load has completed.

In summary, regardless of whether the Query Store data is loading synchronously or asynchronously, you will not be able to execute an ALTER DATABASE <dbname> SET QUERY_STORE statement until the load is complete.

Query Store Training – Portugal

I am so excited to announce that I am presenting a full day of Query Store Training, in-person, this September in Lisbon, Portugal! The SQLskills team will be in London for two weeks in September for a set of Immersion Events (IEPTO1, IEAzure, IECAG, and IEPTO2). After I’ve finished my teaching for IEPTO2 I’m heading over to Lisbon for a full day (Friday, September 21, 2018)  on Query Store in advance of SQLSaturday Portugal.

This workshop has continued to evolve since its first inception at the PASS Summit last fall – specifically, I’ve added more content around performance and workload analysis, but every time I get asked a new question that I think is relevant or interesting, it gets added into a slide or demo. You can read the full abstract below, and can purchase your ticket here, and if you have any questions about the workshop please email me or post a comment!

*Note: If you’re interested in the Immersion Events listed above, please know that we probably won’t offer them in Europe again until 2020, so if you’re interested please talk to your manager and get signed up. We would love to see you!

Using Query Store to Easily Troubleshoot and Stabilize Your Workload

– Have you upgraded to SQL Server 2016 or higher, but still have databases using the old Cardinality Estimator?
– Do you know that you have queries with inconsistent performance, but you’re just not sure how to find them, or fix them, quickly?
– Are you tired of flailing around in SQL Server, querying DMV after DMV to figure out the *real* problem with performance?

Query Store can help.

We’ll cover Query Store end-to-end in this full day workshop built using real-world examples based on customer issues resolved over the last two years. You’ll understand how to configure it, what data it captures, and how to use it to analyze performance, find regressions, and force plans. The demos will teach you how to find common patterns in query performance using T-SQL, and how to understand your workload.

This class is applicable for those running SQL Server 2016 or higher (or planning to upgrade), or Azure SQL Database, and will provide practical and applicable information you can use whether you’re a new or veteran DBA, a developer that has to troubleshoot query performance, or an application administrator just trying to keep the system afloat. You’ll learn how to find and leverage important information in Query Store to make solving common performance problems easier the moment you walk back into the office.

Monitoring Space Used by Query Store

Last week I presented a session on Query Store and when talking about the settings I mentioned that monitoring space used by Query Store is extremely important when you first enable it for a database.  Someone asked me how I would do that and as I provided an explanation I realized that I should document my method…because I give the same example every time and I would be nice to have the code.

For those of you not familiar with the Query Store settings, please check out my post which lists each one, the defaults, and what I would recommend for values and why.  When discussing MAX_STORAGE_SIZE_MB, I mention monitoring via sys.database_query_store_options or Extended Events.  As much as I love Extended Events, there isn’t an event that fires based on a threshold exceeded.  The event related to size is query_store_disk_size_over_limit, and it fires when the space used exceeds the value for MAX_STORAGE_SIZE_MB, which is too late.  I want to take action before the maximum storage size is hit.

Therefore, the best option I’ve found is to create an Agent job which runs on a regular basis (maybe every four or six hours initially) that checks current_storage_size_mb in sys.database_query_store_options and calculates the space used by Query Store as a percentage of the total allocated, and then if that exceeds the threshold you set, send an email.  The code that you can put into an Agent job is below.  Please note you want to make sure the job runs in the context of the user database with Query Store enabled (as sys.database_query_store_options is a database view), and configure the threshold to a value that makes sense to your MAX_STORAGE_SIZE_MB.  In my experience, 80% has been a good starting point, but feel free to adjust as you see fit!

Once your Query Store size has been tweaked and stabilized, I would leave this job in place as a safety to alert you should anything change (e.g. someone else changes a Query Store setting which indirectly affects the storage used).

/* Change DBNameHere as appropriate */
USE [DBNameHere]

/* Change Threshold as appropriate */
DECLARE @Threshold DECIMAL(4,2) = 80.00
DECLARE @CurrentStorage INT
DECLARE @MaxStorage INT

SELECT @CurrentStorage = current_storage_size_mb, @MaxStorage = max_storage_size_mb
FROM sys.database_query_store_options

IF (SELECT CAST(CAST(current_storage_size_mb AS DECIMAL(21,2))/CAST(max_storage_size_mb AS DECIMAL(21,2))*100 AS DECIMAL(4,2))
FROM sys.database_query_store_options) >= @Threshold
BEGIN

     DECLARE @EmailText NVARCHAR(MAX) = N'The Query Store current space used is ' + CAST(@CurrentStorage AS NVARCHAR(19)) + 'MB
     and the max space configured is ' + CAST(@MaxStorage AS NVARCHAR(19)) + 'MB,
     which exceeds the threshold of ' + CAST(@Threshold AS NVARCHAR(19) )+ '%.
     Please allocate more space to Query Store or decrease the amount of data retained (stale_query_threshold_days).'

     /* Edit profile_name and recipients as appropriate */
     EXEC msdb.dbo.sp_send_dbmail
     @profile_name = 'SQL DBAs',
     @recipients = 'DBAs@yourcompany.com',
     @body = @EmailText,
     @subject = 'Storage Threshold for Query Store Exceeded' ;
END

Updating Statistics with Ola Hallengren’s Script

I am a HUGE fan of updating statistics as part of regular maintenance.  In fact, if you don’t know if you have a step or job that updates out of statistics on a regular basis, go check now!  This post will still be here when you get back 😊

At any rate, for a long time the default options for updating statistics were pretty much a sledgehammer.  Within the maintenance plan options, the Update Statistics Task only provides the option to update Index statistics, Column statistics, or both.  You can also specify whether it is a full scan or a sample for the update, but that’s about it:

Update Statistics Task (Maintenance Plan)

Update Statistics Task (Maintenance Plan)

I don’t like this option because it means that statistics that have had little or no change will be updated.  I could have a 10 million row table where only 1000 rows change, and yet the statistics for that table will update.  This is a waste of resources.  For a small database, or system that’s not 24×7, that isn’t such a big deal.  But in a database with multiple 10 million row tables, it is a big deal.

The sp_updatestats command isn’t a favorite of mine either.  I’ve written about that here, so I won’t re-hash it.

If you have used Ola Hallengren’s scripts for maintenance, you hopefully know that it will also update statistics using the @UpdateStatistics parameter.  The default value for this is NULL, which means do not update statistics.  To be clear, if you drop in Ola’s scripts and have it create the jobs for you, and then you start running the “IndexOptimize – USER_DATABASES” job, by default you’re not updating statistics.  The code the IndexOptimize – USER_DATABASES job has, by default, is:

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@LogToTable = 'Y'

If you want to have the job also update statistics, you need:

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@UpdateStatistics = 'ALL',
@LogToTable = 'Y'

With this variation, we are updating index and column statistics, which is great.  But…we are updating them regardless of whether it’s needed.  Statistic with no rows modified? Update it.  Statistic with 10 rows modified? Update it.

There has always been an option to only update statistics that have changed, this is the @OnlyModifiedStatistics option, and this gets us behavior just like sp_updatestats.

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@UpdateStatistics = 'ALL',
@OnlyModifiedStatistics = 'Y',
@LogToTable = 'Y'

With this option, if no rows have changed, the statistic will not be updated.  If one or more rows have changed, the statistic will be updated.

Since the release of SP1 for 2012, this has been my only challenge with Ola’s scripts.  In SQL Server 2008R2 SP2 and SQL Server 2012 SP1 they introduced the sys.dm_db_stats_properties DMV, which tracks modifications for each statistic.  I have written custom scripts to use this information to determine if stats should be updated, which I’ve talked about here.  Jonathan has also modified Ola’s script for a few of our customers to look at sys.dm_db_stats_properties to determine if enough data had changed to update stats, and a long time ago we had emailed Ola to ask if he could include an option to set a threshold.  Good news, that option now exists!

Using Ola’s script to update statistics based on a threshold of change

With the IndexOptimize stored procedure Ola now includes the option of @StatisticsModificationLevel.  You can use this to set a threshold for modifications, so that only statistics with a specific volume of change are updated.  For example, if I want statistics updated if 5% of the data has changed, use:

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@UpdateStatistics = 'ALL',
@StatisticsModificationLevel= '5',
@LogToTable = 'Y'

Take note: the option @OnlyModifiedStatistics option is not included here…you cannot use both options, it has to be one or the other.

This is great!  I can further customize this for different tables.  Consider a database that has a very volatile table, maybe dbo.OrderStatus, where auto-update may or may not kick in during the day, so I want to make sure stats are updated nightly:

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@Indexes = 'ALL_INDEXES, -SalesDB.dbo.OrderStatus',
@UpdateStatistics = 'ALL',
@StatisticsModificationLevel= '10',
@LogToTable = 'Y'

This will address fragmentation and update statistics for all tables in the SalesDB database except dbo.OrderStatus, and it will update statistics if 10% or more of the rows have changed.

I would then have a second job to address fragmentation and stats for OrderStatus:

EXECUTE [dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@Indexes = 'SalesDB.dbo.OrderStatus',
@UpdateStatistics = 'ALL',
@StatisticsModificationLevel= '1',
@LogToTable = 'Y'

For the dbo.OrderStatus table, statistics would be updated when only 1% of the data had changed.

I love the flexibility this provides!

You might be wondering why I chose 1%…take a close look at this important note which is included in Ola’s documentation:

Statistics will also be updated when the number of modified rows has reached a decreasing, dynamic threshold, SQRT(number of rows * 1000)

This is critical to understand because if the threshold I have set for @StatisticsModificationLevel ends up having a number of rows HIGHER than the formula above, statistics will update sooner than I expect.

For example, if I have 1 million rows in a table and I have @StatisticsModificationLevel = 10, then 10% of the rows, or 100,000, have to change in order to update statistics.  HOWEVER, if you plug 1 million into SQRT(1,000,000 * 1000), you get 31,623, which means Ola’s script will update statistics after 31,623 rows have changed…well before 100,000.

This may be important for some of you to understand in terms of these thresholds, so I dropped the information into a table to make it easier to comprehend (at least, it’s easier for me!).

Thresholds for Statistics Updates (percentage and SQRT algorithm)

Thresholds for Statistics Updates (percentage and SQRT algorithm)

Using my original example, if dbo.OrderStatus has about one million rows, then with 1% as the threshold, only 10,000 rows need to change before stats are updated.  If the SQRT algorithm were used, over 30,000 rows would need to change before stats were updated, and depending on the data skew, that might be too high.

Understand that as tables get larger, statistics will likely be updated before the set percentage value is reached because the SQRT algorithm has a lower threshold.  (Yes, I’m driving this point home.)  Consider a table with 10 million rows.  If I set the threshold to 5%, I would expect statistics to update after 500,000 modifications, but in fact they will update after 100,000.

If you’re wondering where the SQRT algorithm comes from, please review Microsoft’s Statistics documentation.  This threshold was originally introduced with trace flag 2371 to lower the threshold for automatic updates.  It is applied by default started in SQL Server 2016 when using compatibility level 130.  My assumption is that Ola determined this was a good threshold to use as a fail-safe/catch-all for his script, and I think it was smart move on his part.  In general, I’d rather have statistics update too often, rather than not often enough.  However, using the new @StatisticsModificationLevel option gives us better control than we’ve had previously, unless we write a custom script (which is still an option…do what works best for you!).

Can you force a plan for a different query with Query Store?

This is question I’ve gotten a few times in class…Can you force a plan for a different query with Query Store?

tl;dr

No.

Assume you have two similar queries, but they have different query_id values in Query Store.  One of the queries has a plan that’s stable, and I want to force that plan for the other query.  Query Store provides no ability to do this in the UI, but you can try it with the stored procedure.  Let’s take a look…

Testing

Within WideWorldImporters we’ll execute an ad-hoc query with two different input values:

USE [master];
GO
ALTER DATABASE [WideWorldImporters] SET QUERY_STORE = ON;
GO
ALTER DATABASE [WideWorldImporters] SET QUERY_STORE (OPERATION_MODE = READ_WRITE);
GO

USE [WideWorldImporters];
GO

DECLARE @CustomerID INT;
SET @CustomerID = 972;

SELECT o.OrderDate, o.ContactPersonID, ol.StockItemID, ol.Quantity
FROM Sales.Orders o
JOIN Sales.OrderLines ol
ON o.OrderID = ol.OrderID
WHERE o.CustomerID = @CustomerID;
GO

DECLARE @CustomerID2 INT;
SET @CustomerID2 = 972;

SELECT o.ContactPersonID, o.OrderDate, ol.StockItemID, ol.Quantity
FROM Sales.Orders o
JOIN Sales.OrderLines ol
ON o.OrderID = ol.OrderID
WHERE o.CustomerID = @CustomerID2;
GO

Let’s see what’s in Query Store:


SELECT qt.query_text_id, q.query_id, qt.query_sql_text, p.plan_id, TRY_CAST(p.query_plan AS XML)
FROM sys.query_store_query_text qt
JOIN sys.query_store_query q
ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan p
ON q.query_id = p.query_id
WHERE qt.query_sql_text LIKE '%Sales.Orders%';
GO

Query information from Query Store

Query information from Query Store

We see that we have two different queries and one plan for each. We can force the plan for the first query:

EXEC sp_query_store_force_plan @query_id = 3, @plan_id = 3;
GO

This works.  If we try to force that same plan for the other query:

EXEC sp_query_store_force_plan @query_id = 4, @plan_id = 3;
GO
Error when trying to force a different plan for a query

Error when trying to force a different plan for a query

Trying to force plan_id 3 for query_id 4 throws this error:

Msg 12406, Level 11, State 1, Procedure sp_query_store_force_plan, Line 1 [Batch Start Line 34]
 Query plan with provided plan_id (2) is not found in the Query Store for query (4). Check the plan_id value and rerun the command.

Summary
Within Query Store, the relationship between query_id and plan_id is managed internally (i.e. there are no foreign key constraints for the underlying tables), and there is a validation that any plan_id that you want to force for a query_id must have been generated for that specific query.

In this type of scenario, you have to get the plan shape you want for the query, which may require trying different input parameters.  The example I’ve provided is very simple, but when in doubt, check the input parameters for the plan that you want, then try those with the other query (that doesn’t yet have the plan you want to force).  Of course, if you have to use a query or index hint to get the plan that you want, then it’s going to be a little trickier to get the plan you want for the original query.  Good luck!

Query Store and the Plan Cache Flushing

I’ve had two comments recently on my blog about Query Store causing the plan cache to be flushed. There was a known issue related to the plan cache flushing after Query Store was enabled, but this was fixed in CU2 for SQL Server 2016 SP1. So I did some testing and here is what I think is causing the confusion:

When you enable Query Store, which is done with an ALTER DATABASE SET statement, the plan cache for the database is flushed.

Now, before anyone writes up a UserVoice item, understand that there are several ALTER DATABASE SET commands that cause the plan cache for a database to be flushed. For example, taking a database OFFLINE causes the database plan cache to be flushed. That one seems intuitive, right?  So why is the plan cache cleared when you enable Query Store, or change one of the settings?  To ensure that new Query Store data is not lost.  This relates to the internals of how Query Store works, which aren’t essential to dig into, the point is that this behavior is known by Microsoft and expected.

If you review the ALTER DATABASE SET documentation, and specifically review the Query Store options, you won’t find any mention of the database plan cache clearing.  But you can test it to see that it occurs…

Testing

First, disable Query Store for the WideWorldImporters database, and then free the plan cache:

USE [master];
GO
ALTER DATABASE [WideWorldImporters] SET QUERY_STORE = OFF;
GO

DBCC FREEPROCCACHE;
GO

USE [WideWorldImporters];
GO

SELECT o.OrderDate, o.ContactPersonID, ol.StockItemID, ol.Quantity
FROM Sales.Orders o
JOIN Sales.OrderLines ol
	ON o.OrderID = ol.OrderID
WHERE o.CustomerID = 972;
GO
SELECT o.OrderDate, o.ContactPersonID, ol.StockItemID, ol.Quantity
FROM Sales.Orders o
JOIN Sales.OrderLines ol
	ON o.OrderID = ol.OrderID
WHERE o.CustomerID = 123;
GO

Now query the plan cache to confirm those plans are in cache:


SELECT t.dbid, t.text, s.creation_time, s.execution_count, p.query_plan
FROM sys.dm_exec_query_stats s
CROSS APPLY sys.dm_exec_query_plan(s.plan_handle) p
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) t
WHERE t.text LIKE '%Sales%';
GO
SQL Server's plan cache after initial query execution

SQL Server’s plan cache after initial query execution

Great, they’re there. Now enable Query Store, then check the plan cache again.


USE [master];
GO
ALTER DATABASE [WideWorldImporters] SET QUERY_STORE = ON;
GO
ALTER DATABASE [WideWorldImporters] SET QUERY_STORE (OPERATION_MODE = READ_WRITE);
GO

SELECT t.dbid, t.text, s.creation_time, s.execution_count, p.query_plan
FROM sys.dm_exec_query_stats s
CROSS APPLY sys.dm_exec_query_plan(s.plan_handle) p
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) t
WHERE t.text LIKE '%Sales%';
GO
SQL Server's plan cache after enabling Query Store

SQL Server’s plan cache after enabling Query Store (plans have been cleared)

 

 

 

 

 

 

 

The plan cache for the database has been cleared. Note that this only clears the plan cache for that database – plans for other databases still remain in cache. Run a few more queries to add some plans back to the plan cache, and confirm they’re there.


USE [WideWorldImporters];
GO

SELECT o.OrderDate, o.ContactPersonID, ol.StockItemID, ol.Quantity
FROM Sales.Orders o
JOIN Sales.OrderLines ol
	ON o.OrderID = ol.OrderID
WHERE o.CustomerID = 972;
GO
SELECT o.OrderDate, o.ContactPersonID, ol.StockItemID, ol.Quantity
FROM Sales.Orders o
JOIN Sales.OrderLines ol
	ON o.OrderID = ol.OrderID
WHERE o.CustomerID = 123;
GO

SELECT t.dbid, t.text, s.creation_time, s.execution_count, p.query_plan
FROM sys.dm_exec_query_stats s
CROSS APPLY sys.dm_exec_query_plan(s.plan_handle) p
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) t
WHERE t.text LIKE '%Sales%';
GO
SQL Server's plan cache after enabling Query Store AND running some queries

SQL Server’s plan cache after enabling Query Store AND running some queries

This time, change one of the settings for Query Store. It’s already enabled, but perhaps we want to change the INTERVAL_LENGTH_MINUTES setting from the default of 60 minutes to 30 minutes.


USE [master]
GO
ALTER DATABASE [WideWorldImporters] SET QUERY_STORE (INTERVAL_LENGTH_MINUTES = 30)
GO

SELECT t.dbid, t.text, s.creation_time, s.execution_count, p.query_plan
FROM sys.dm_exec_query_stats s
CROSS APPLY sys.dm_exec_query_plan(s.plan_handle) p
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) t
WHERE t.text LIKE '%Sales%';
GO
SQL Server's plan cache after changing a Query Store setting (plans have been cleared)

SQL Server’s plan cache after changing a Query Store setting (plans have been cleared)

In checking the plan cache again, the ALTER DATABASE SET statement cleared the database’s cache.

Summary

As you can see, the database plan cache is cleared after you enable Query Store, or change any settings related to Query Store. This is the same behavior we see with other ALTER DATABASE SET commands (e.g. changing the recovery model).  Unfortunately, this is not documented, nor is anything written to the ERRORLOG.

Of note: I don’t expect that you are changing settings often (if you are, I’d like to understand that thought process, as once you find the right values for space and interval, I expect those settings to be static…and if you’re not sure where to start, feel free to check out my post discussing the different options). I also don’t expect that you are turning Query Store on and off throughout the day; that completely defeats the purpose of the feature. It should be enabled, and left enabled, all the time. You don’t know when a problem might occur, right?