Edit: November 12, 2020 – Updated information about Query Store performance overhead can be found in this post.
“What is the performance overhead of enabling Query Store?”
I get asked this question almost every time I present on a topic related to Query Store. What people are really asking is “Does enabling Query Store affect the performance of my queries?” Where “my queries” are user queries, queries from the application, etc.
The short answer:
- The majority of workloads won’t see an impact on system performance
- Will there be an increase in resource use (CPU, memory)? Yes.
- Is there a “magic number” to use to figure out Query Store performance and the increase in resource use? No, it will depend on the type of workload. Keep reading.
- An impact on system performance can be seen with ad-hoc workloads (think Entity Framework, NHibernate), but I still think it’s worth enabling. With an ad-hoc workload there are additional factors to consider when using Query Store.
- You should be running the latest version CU for SQL Server 2017 and latest CU for SQL Server 2016 SP2 to get all performance-related improvements Microsoft has implemented specific to Query Store
The long answer…
One reason the SQL Server 2016 release was such a solid release was that it was data driven. “Data Driven” was frequently used in Microsoft marketing materials for 2016, but it wasn’t hype; it was true. At the time of the SQL Server release, Azure SQL Database had been in existence for over two years and Microsoft had been capturing telemetry and using that data to understand how features were being used, as well as improve existing features.
One of the features that benefited most from the insight provided by the telemetry data was Query Store, which was originally released in private preview for Azure SQL Database in early 2015. As Query Store was implemented for more databases, the information captured was used to enhance its functionality and improve its performance. Query Store was made publicly available in late 2015, and included in the SQL Server 2016 release. The telemetry data was invaluable to Microsoft’s developers as they prepared Query Store for release, but the variety in size and workload that exist in on-premises solutions was not accurately represented. Much of this was due to limitations in Azure tiers at the time and the limited number (comparatively) of companies that had embraced using a cloud solution.
Thus, while the initial internal thresholds for Query Store were determined based upon Azure SQL Database solutions and appropriate for most on-prem systems, they were not fully suited to every variation of an on-prem solution. This is not atypical – it’s extremely difficult to develop software that accommodates every single workload in the world both in the cloud and on-prem.
This history is relevant when people ask about solution performance and Query Store.
First, understand that there are differences in how Query Store works in Azure SQL Database compared to on-prem. A good example is the amount of space that you can allocate to Query Store within the user database (MAX_STORAGE_SIZE_MB). In Azure SQL Database the maximum value one can set for MAX_STORAGE_SIZE_MB is 10GB. There is no a limit for SQL Server 2016 or 2017. As a result of this limitation for Azure SQL DB, the amount of data that Query Store has to manage can be significantly less than what we see for an on-prem solution. There are many production environments with a Query Store that is 10GB or less in size, but I know of Query Stores that are 200-250GB in size on disk, which typically indicates an anti-pattern with the workload.
Separate from storing the Query Store in the user database, data is also held in different memory buffers (e.g. query hash map, runtime statistics cache store). Data is inserted into these memory buffers for new queries, updated for previously-executed queries, and while data is flushed to disk regularly, it is expected that data continuously resides in these buffers. The data for the query hash map is consistent, but the volume of data in the runtime statistics cache store fluctuates depending on the workload.
There are multiple ways to characterize a workload, but in the case of Query Store, we’re most interested in the number of unique queries generated. We tend to characterize workloads with a high number of unique queries as ad-hoc – those that use Entity Framework or NHibernate, for example. But there are other variations, such as multi-versioned tables, which also create a significant number of unique queries. To be clear, the following are unique queries:
d.Name FROM dbo.Employees e JOIN dbo.Department d ON e.department_id = d.department_id WHERE e.LastName = ‘Harbaugh’;
d.Name FROM dbo.Employees e JOIN dbo.Department d
ON e.department_id = d.department_id
WHERE e.LastName = ‘Winovich’;
d.Name FROM dbo.Employees e JOIN dbo.Department d ON e.department_id = d.department_id WHERE e.lastname = ‘Carr’;
Just like the plan cache, Query Store identifies each of the above queries as unique (even though they all have the same query_hash) based on the text, and assigns each its own query_text_id in Query Store. This query_text_id, combined with context_settings_id, object_id, batch_sql_handle, and query_parameterization_type create a unique hash for each query which Query Store uses internally, and is stored in buffers in memory, along with the runtime statistics for each unique hash. The more unique query texts in a workload, the more overhead there may be to manage the data.
Understand that if you have an ad hoc workload, you already have a system that is prone to performance issues due to high compiles, plan cache bloat, and variability in query performance across queries that are textually the same in terms of query_hash, but have different literal values (as shown above). For an ad-hoc workload that is also high volume (high number of batch requests/sec), when you enable Query Store it can appear that a performance problem has been introduced. It is tempting to look at any decrease in performance as a problem with Query Store. However, it’s a function of the type of the workload and simply the cost of doing business for said workload. If you want to capture Query Store data about an ad-hoc workload (to then identify query patterns and address them) then you’ll have to expect and plan for the overhead associated with it.
You can control, to a small degree, the number of queries captured using the QUERY_CAPTURE_MODE setting. The default value of ALL means that every single executed will be captured. The value of AUTO means that only queries that exceed a threshold (set internally by Microsoft) will be captured. As noted in my Query Store Settings post, AUTO is the recommendation for a production environment, particularly one that is ad-hoc.
The SQL Server team made performance-related improvements in Query Store in the SQL Server 2017 release, and these were back-ported to SQL Server 2016 SP2. There have been a few additional fixes in SQL Server 2017, such as this one, in CU11. I know of a couple people who have run into this issue, so if you’re on SQL Server 2017 and using Query Store, I definitely recommend applying the latest CU.
Now we can answer, “Can enabling Query Store make some of your queries run slower?” It depends on your workload, your version of SQL Server, and the settings you have enabled. For those folks with a mostly procedure-type workload, I haven’t seen many issues. For those with ad-hoc, high volume workloads, you are now aware of the potential overhead and you can plan accordingly. If you’re on the fence about it, enable it during a low-volume time and monitor the system. If you feel there’s a problem, turn it off. But the data gathered on any system can be used to help make that system better, even if you have to do it incrementally. Whether your workload is procedure-based, ad-hoc, or a mix, Query Store is an invaluable resource that can be used to capture query metrics, find queries that perform poorly or execute frequently, and force plans to stabilize query performance.
34 thoughts on “Query Store Performance Overhead: What you need to know”
ever seen query store bury a server Had a prod 2016 server only one database become unusable (could not open in mgmgt studio or in query window with that database name) rebooted instance and vm host, nothing in sql log or windows or cluster. One thing that did show was the query mode had switched to read only, plenty of space on server
I’ve seen performance issues with Query Store enabled when the workload is extremely ad hoc and if it’s an older build of SQL Server. Definitely check out my best practices post (https://www.sqlskills.com/blogs/erin/query-store-best-practices/) to verify configuration, etc. If it switched to read-only it sounds like the space used by Query Store exceeded the value set for MAX_STORAGE_SIZE, and at that point, Query Store would try to run clean up. There were issues with earlier versions of SQL Server 2016, which is is why I recommend the latest SP and CU, but even then, if you have an extremely ad hoc workload, you may not want to run QS.
Have you heard anything about performance issue (takes up to 30-50% CPU) during Cleanup process in Query Store(QS)?
The worst part of that you cannot stop it (it run under system sessions) and you cannot do anything with QS (clean all, change settings or turn it off). A current_storage_size_mb exceeds max_storage_size_mb in times during these process.
From my experience, it happens only when two Cleanup Modes (by date and by size) run at the same time (guessing).
A session with command “query store background flush db” takes hours (several hours for 2Gb max size QS) and blocks other process with QS.
There is an issue related to that, I talk about it here:
What version of SQL Server are you running?
Thanks for the quick response.
Why a “query store background flush db” process blocked everything? It’s a pretty strange design.
When you say “it blocks everything”, do you mean you experience systemic blocking across the entire instance, or just that it blocks changes to Query Store? If it’s the latter, that is by design.
It blocks only QS but QS takes a lot of CPU during this operation, and there is not any way to fix it quickly.
It may be beneficial to size Query Store a bit larger, and also change the retention so that the size-based cleaned up doesn’t have to run. The clean up policies are different, and run differently, and the would not run simultaneously. Depending on your workload (e.g. ad hoc) you may need a larger Query Store.
I’ve increased from 2 to 10Gb, saving just for 5 days. Thanks for your advice.
What is frustrating, why Cleanup process takes so long (6-8 hours when max size limit is just 2Gb). It usually grows up to 6Gb during this process without changing QS to READ ONLY mode. It looks like s/th is wrong in design.
I can’t explain why the cleanup process takes so long without looking at your system. It doesn’t sound correct, and if it’s an issue I’d recommend contacting support as you’re running 13.0.5337.0 which has the fix for the issue I mentioned earlier.
When enabling and disabling querystore (and when it goes to read only mode), would this cause any blocking on the DB nowadays?
We enabled querystore in 2017 and it literally caused our prod DB to grind to a halt due to the amount of ad hoc queries we have. Max size was set to 1GB, but we have worked out that our max size should maybe be 54GB. I think the cleanup process was constantly running as we never reached 1GB and auto setting to read only
We were seeing exclusive locks on select queries and when we changed QS to read only to stop collecting the data, we saw blocking..I expect we should’ve changed capture mode to none and then disabled QS at a quieter time?
We were also thinking that we may need to set paramterization to forced on the DB, but I expect AUTO should be good for QS to handle this (we had it set to ALL in 2017)
Yes, Query Store can struggle in an ad hoc workload. Changing ANY Query Store requires a lock over the Query Store database object, which can block other Query Store background tasks over that database. So yes, any changes should be made when the system is not busy, and yes, you should have the catpure mode set to AUTO. I talk about best practices in this post: https://www.sqlskills.com/blogs/erin/query-store-best-practices/, including settings. There are some ad hoc workloads for which I would not recommend Query Store, even if they are running the latest version of SQL Server. Also, the “ideal” size for a Query Store is 10GB or less, you may have to change retention time in order to keep the QS that small. There are changes in SQL 2019 that allow you to fine tune what Query Store captures. Hope that helps.
Yes that is great…thanks for the speedy reply..I’ve been reading all your query store posts…they are a fantastic read and really helpful…these have been my go to blogs for Querystore info.
Glad to hear someone is reading this stuff and it has helped! 🙂
We have put querystore into our pre prod environment and we are doing load testing. We are seeing quite a performance hit with querystore on.
When we are testing with it off, I ran
ALTER DATABASE [DBName] SET QUERY_STORE (OPERATION_MODE = READ_WRITE, QUERY_CAPTURE_MODE = NONE)
but it is being blocked by an sa spid running
QUERY STORE BACK with WRITELOG waits
This has been running for over 20 minutes now…have you ever seen this before?
Yes, when you run ALTER DATABASE to change a setting related to Query Store, it has to acquire a lock, and that can be blocked by another operation from Query Store. I wrote about an example here:
I’m a little bit confused, however, as you state you are testing with Query Store OFF, and trying to turn it ON first time, but NOT capture any queries, it is blocked?
Sorry I mean we have tested it a few times with no issues, both on and off.
Today we tested it off…then I turned it on, tested again, and tried to turn off…but we get this system process with status ‘background’ last running ‘QUERY STORE BACK’ which is not allowing me to turn QS off. We have not had this issue in the past when turning it off and on…this spid has been active for 6 hours now but it’s not running anything and this is what is not letting me turn querystore off. I can’t kill this either as it’s a system process spid
Both traceflags 7745, 7752 are enabled and we are on the SQL2016 Sp2 CU8
As you noted in another comment, you probably need the latest CU and may be running into a known issue. Since this is a TEST/DEV environment, I would restart, turn off QS, update to the latest CU, then run that exact same workload/test again.
Ah, I see this
We are above the CU that this problem was fixed in
Maybe we have come across a new issue as I can’t see the latest CU10 mentioning anything about QS fixes….I’ll try it and see.
If you’re already on CU8, then from my review of the release notes, I don’t see any other QS fixes in later releases. However, as you said, perhaps you are running into a different issue. If it’s something that you can recreate then I would definitely recommend opening a case with Microsoft.
Thank you for detailed explanation of the QS behavior.
We are experiencing the similar issue, as you mentioned QueryStore might consume CPU while performing data flush. We are on SQl2016(SP2-CU12)every time (15 minute interval) QS does “QUERY STORE BACKGROUND FLUSH DB” we are noticing a significant utilization in CPU and DiskIO causing application to session timeout.
As others mentioned it is not a crazy big QueryStore, it is only 800MB but that 15 minutes flush shows impact on a 12 core machine. Everyone suggests to turnoff QS. But is there any recommendation without shutting down QS.
Thank you in Advance.
What data demonstrates the increased use in CPU and DiskIO? Are you looking at PerfMon, DMVs, or something else? I’d be interesting in trying to collect waits during that time to see if that lines up with additional data. Is your workload ad hoc or procedural? And have you looked to see how much memory the Query Store memory clerks are consuming and how that changes before and after the flush?
I am capturing cpu, waittype, command and physical_io from sysprocesses every 2 seconds and dumping it into a table to check if there is any system processes that is causing CPU spike. Like a clock exactly at 20 minutes(we changed from 15 minute to 20 min to rule out QS is not the issue) there is uptick in the CPUTime and Physical_io until the QS does the QUERY STORE BACKGROUND FLUSH DB command and it will go back to normal once it is done. I have posted the results in the below share.
All the workload on the primary is Procedural and parameterized. And every time we see the CPU tick, MS JDBC driver trying to establish a session gets session time out.
I did not monitor the memory clerks. But I will keep you posted on it.
Ok, and how long does that spike last? I see two sets of 8 entries in the spreadsheet for QUERY STORE BACK command, and they all show a LastBatchTime that’s the same (either 10:18AM or 10:38AM). Since I do not see another set of entries, and you said you are capturing every 2 seconds, does that mean that the spike is less than 2 seconds?
Spike lasts anywhere from 5 seconds to 40 seconds depends on the how long the command is executing on the server. I guess my event capture is not that perfect to show the complete events. This time I used perform counters. I have also included the available QS counters, buffer manager, CPU and AAG counters (if any helpful). I just filtered it out to just during the CPU spike duration of 17 seconds.