Query Store Best Practices

I’m a huge fan of Query Store, which regular readers may know, but there’s a need to write a bit more about Query Store best practices.  This isn’t a “you must use this feature” post, this is a “here is what you must know if you want to use this feature” post.

I have a lot of content about Query Store, but maybe what’s really important gets lost amongst everything else.  Glenn prompted me to write this, after I worked with two customers last week that ran into issues related to their configuration of Query Store.  Listed below are the things you must know before you enable Query Store.  If you already have Query Store enabled and running, I recommend reviewing the information to ensure you are following best practices.

Settings

  • Review my post on Query Store Settings to understand what different settings exist for Query Store, what values are recommended for each setting, and why.
    • Most important: QUERY_CAPTURE_MODE set to AUTO, MAX_STORAGE_SIZE_MB set to 10GB at the absolute max, something less ideally (you may need to adjust CLEANUP_POLICY to keep less data, depending on your workload).

Trace Flags

  • Review my post on Query Store Trace Flags to understand the two Trace Flags which are Query Store-related, what they do, and why you want to enable them.
    • I definitely recommend both 7752 and 7745.  If you have 7752 enabled and you are trying to make changes to Query Store configuration after a restart, please review my post discussing the Query Store data load.  There is no shortcut to get this to load faster, and no way to kill it.  This is why proper settings are important.

Performance

  • Review my post on Query Store Performance Overhead. If you have a high-volume, ad hoc workload, it is quite possible that you should not enable Query Store.  Jonathan told me about a customer of his that enabled Query Store and within two (2) hours, they had filled up 10GB of Query Store.  We recommended that they not use Query Store at this time.  There are improvements coming in SQL Server 2019 that will hopefully help manage Query Store with this type of workload.

Version

  • I highly recommend that you run SQL Server 2016 SP2 CU7 or SQL Server 2017 CU15 (current CUs at the time of writing, I’ll try to keep this updated). There is a very important fix in CU7 that I strongly suggest you have in place.  I don’t see this fix in a CU for SQL Server 2017 yet, so it’s possible you could run into a problem if you have an ad hoc workload.

Ad hoc?

 

If you still have questions about how to configure Query Store or what to watch out for, please leave a comment!  I’m happy to share as much information and experience as possible.

Important Query Store Fixes – Summer 2019

In January I blogged about important Query Store fixes that were available in current CUs. Check my January post for fixes through SQL Server 2016 SP2 CU5 and SQL Server 2017 CU13.

I’m using this post to track Query Store fixes/enhancements since those releases, so you can find updates here as additional CUs are released throughout the summer.

SQL Server 2016 SP2 CU7

FIX: “Non-yielding Scheduler” occurs when you clean up in-memory runtime statistics of Query Store in SQL Server 2016

I definitely recommend that you update to this CU if you are using Query Store with SQL Server 2016 SP2, especially if you have an ad hoc or mixed workload.

Remember that runtime statistics and some query information is stored in-memory, and this CU addresses an issue with cleanup of the runtime statistics, based on the CLEANUP_POLICY and MAX_STORAGE_SIZE_MB.

I have seen issues where the space used for Query Store exceeds the MAX_STORAGE_SIZE_MB settings, and this CU should help address that behavior.  Again, it’s important to make sure your Query Store is sized appropriately, based on the type of workload you have (ad hoc will require more space) and how long you want to keep your data.  While it’s nice to keep 30 days of data, in some scenarios that doesn’t fit within the 10GB “ideal” size for Query Store.

I don’t see this fix in a 2017 CU.

Troubleshooting a Change in Query Performance

This is tale of troubleshooting…

When you unexpectedly or intermittently encounter a change in query performance, it can be extremely frustrating and troublesome for a DBA or developer. If you’re not using Query Store, a third-party application, or your own method to capture query data, then you probably don’t have information about what query performance looked like when things were good…you just know how it’s running now. I was working with a customer of Tim’s last week that had been chasing this exact problem for, in their words, years. They had recently upgraded to SQL Server 2016, and the problem was still occurring.

First Steps When There’s a Change in Query Performance

If there is a change in query performance, we typically look at differences in the execution statistics, as well as differences in the plan in terms of the shape and operations. The execution statistics confirm that performance is worse; duration is higher, CPU and/or IO is higher. With this client, after we verified the execution data, we used Query Store to compare the plans, and they were different. When I see different plans I immediately check the compiled values (input parameters) for each plan. Typically, I expect the different plans are a result of different values – because the query is sensitive to different input parameters. Interestingly enough, in the case of this customer, the different plans had the exact same parameters. My next step? Statistics.

The customer had told me that the only way they could consistently resolve the issue was to rebuild indexes for a few of the tables that were involved in the query. Rebuilding indexes causes an update to statistics with a FULLSCAN. I asked if they had ever tried just updating statistics with FULLSCAN to fix the issue, and they reported that it caused problems when they did. I tabled that issue, as I had a theory about what was happening and wanted to prove that out before digging into a secondary issue.

Collecting Query-Related Data

I set up an Extended Events session to capture the auto_stats event (which fires when statistics automatically update) and filtered on the objects in question.

I also walked them through the statistics information within SQL Server. This was extremely fun. I love statistics, and I love explaining the output of DBCC SHOW_STATISTICS to people. When I’m teaching, I can typically see the light bulbs come on for people as I talk through the stats header, the density vector, and the histogram (especially the histogram!). On the call, I could hear them say “oh!” and “that makes so much more sense!” as I was talking…so fun. But I digress. So, we looked at the statistics for the tables in question, and we could see that because they had rebuilt indexes earlier in the day to resolve the issue, the sample was 100% (from the FULLSCAN). Query performance was good at this point, so we had our baseline.

Then I showed them how to view the stats header information for all statistics for the tables in question using the sys.dm_db_stats_properties DMV and requested that they query the DMV if they had a change in query performance.  I explained that they should look to see if statistics had updated, and if so, take note of the sample size.

Sidebar

My hypothesis was that stats were getting automatically updated because Auto Update Statistics was enabled for the database, and enough data had changed to fire the update (realize they are on 2016 with 130 compatibility mode, which means the threshold is lower than it used to be…20% + 500 rows). After statistics updated, all queries that used those statistics were recompiled, and because the sample was the default, it wasn’t an accurate representation of the data in the table, and so a different plan was being generated, even though the same input parameters were used.

Back to the story

I took a few moments and explained my hypothesis, and then we looked at the data in one of the tables. We looked at the distribution of data in each of the leading columns in the indexes, and in one of them, there was extreme skew.  If query performance degraded, and statistics had updated with the default sample, I requested that they run UPDATE STATISTICS with FULLSCAN to see if that resolved the issue.

Proof!

Sure enough, within the next 24 hours query performance changed and statistics for one of the indexes had been updated automatically with the default sample. I correlated this finding with the  the Extended Events output as an extra validation step.  When they updated statistics with the FULLSCAN, the “good” query plan was used again, query performance improved, and CPU dropped immediately.

A Better Solution

Now that we knew the issue, we had to decide how to address it. I recommended that they manually update statistics again with the FULLSCAN, but this time with the PERSIST_SAMPLE_PERCENT option set to ON. With this option enabled, statistics will retain the sample rate used with the manual update of statistics, even if it’s an automatic update. In this scenario, we wanted to ensure that statistics would always be updated with a FULLSCAN.

The customer was slightly concerned about the implications of the FULLSCAN update running during the day, because the tables were large and the update would generate overhead. This is absolutely a valid concern for tables that are extremely large.  To address this, we included the NORECOMPUTE option in the UPDATE STATISTICS command for the statistic in question, ensuring that the statistic would not automatically update even if the number of modifications crossed the threshold.

I explained to the customer that they would definitely need to run a SQL Agent job on a regular basis to update statistics for those which used the NORECOMPUTE option so that the optimizer had updated information about the distribution of data. Using NORECOMPUTE is rare, I find, but it definitely has its use cases. It’s essential to have a job to manually update those stats if you use that option for any of your statistics.

The Take Away

I haven’t heard from the customer since we determined the problem and created a solution, and I take that as a good sign. It’s ironic, because so often when I’m talking about changes in query performance, I mention that updating statistics is not the solution. It can just look like the solution because it causes a recompile and for a parameter-sensitive query, it often “fixes” the issue. But it hides the root problem in that scenario, and updating statistics (via the index rebuild) was hiding the root problem here as well. It wasn’t that the query was parameter sensitive, it’s that the optimizer needed a histogram from a FULLSCAN of the column, due to the skew in the data.

Most of the time, the default sample for a stats update is good enough. But when it isn’t, then you either need a FULLSCAN, or if that doesn’t do it, you need to look at filtered statistics across ranges of the data to give the optimizer better information.