Important Query Store Fixes – Summer 2019

In January I blogged about important Query Store fixes that were available in current CUs. Check my January post for fixes through SQL Server 2016 SP2 CU5 and SQL Server 2017 CU13.

I’m using this post to track Query Store fixes/enhancements since those releases, so you can find updates here as additional CUs are released throughout the summer.

SQL Server 2016 SP2 CU7

FIX: “Non-yielding Scheduler” occurs when you clean up in-memory runtime statistics of Query Store in SQL Server 2016

I definitely recommend that you update to this CU if you are using Query Store with SQL Server 2016 SP2, especially if you have an ad hoc or mixed workload.

Remember that runtime statistics and some query information is stored in-memory, and this CU addresses an issue with cleanup of the runtime statistics, based on the CLEANUP_POLICY and MAX_STORAGE_SIZE_MB.

I have seen issues where the space used for Query Store exceeds the MAX_STORAGE_SIZE_MB settings, and this CU should help address that behavior.  Again, it’s important to make sure your Query Store is sized appropriately, based on the type of workload you have (ad hoc will require more space) and how long you want to keep your data.  While it’s nice to keep 30 days of data, in some scenarios that doesn’t fit within the 10GB “ideal” size for Query Store.

I don’t see this fix in a 2017 CU.

Troubleshooting a Change in Query Performance

This is tale of troubleshooting…

When you unexpectedly or intermittently encounter a change in query performance, it can be extremely frustrating and troublesome for a DBA or developer. If you’re not using Query Store, a third-party application, or your own method to capture query data, then you probably don’t have information about what query performance looked like when things were good…you just know how it’s running now. I was working with a customer of Tim’s last week that had been chasing this exact problem for, in their words, years. They had recently upgraded to SQL Server 2016, and the problem was still occurring.

First Steps When There’s a Change in Query Performance

If there is a change in query performance, we typically look at differences in the execution statistics, as well as differences in the plan in terms of the shape and operations. The execution statistics confirm that performance is worse; duration is higher, CPU and/or IO is higher. With this client, after we verified the execution data, we used Query Store to compare the plans, and they were different. When I see different plans I immediately check the compiled values (input parameters) for each plan. Typically, I expect the different plans are a result of different values – because the query is sensitive to different input parameters. Interestingly enough, in the case of this customer, the different plans had the exact same parameters. My next step? Statistics.

The customer had told me that the only way they could consistently resolve the issue was to rebuild indexes for a few of the tables that were involved in the query. Rebuilding indexes causes an update to statistics with a FULLSCAN. I asked if they had ever tried just updating statistics with FULLSCAN to fix the issue, and they reported that it caused problems when they did. I tabled that issue, as I had a theory about what was happening and wanted to prove that out before digging into a secondary issue.

Collecting Query-Related Data

I set up an Extended Events session to capture the auto_stats event (which fires when statistics automatically update) and filtered on the objects in question.

I also walked them through the statistics information within SQL Server. This was extremely fun. I love statistics, and I love explaining the output of DBCC SHOW_STATISTICS to people. When I’m teaching, I can typically see the light bulbs come on for people as I talk through the stats header, the density vector, and the histogram (especially the histogram!). On the call, I could hear them say “oh!” and “that makes so much more sense!” as I was talking…so fun. But I digress. So, we looked at the statistics for the tables in question, and we could see that because they had rebuilt indexes earlier in the day to resolve the issue, the sample was 100% (from the FULLSCAN). Query performance was good at this point, so we had our baseline.

Then I showed them how to view the stats header information for all statistics for the tables in question using the sys.dm_db_stats_properties DMV and requested that they query the DMV if they had a change in query performance.  I explained that they should look to see if statistics had updated, and if so, take note of the sample size.

Sidebar

My hypothesis was that stats were getting automatically updated because Auto Update Statistics was enabled for the database, and enough data had changed to fire the update (realize they are on 2016 with 130 compatibility mode, which means the threshold is lower than it used to be…20% + 500 rows). After statistics updated, all queries that used those statistics were recompiled, and because the sample was the default, it wasn’t an accurate representation of the data in the table, and so a different plan was being generated, even though the same input parameters were used.

Back to the story

I took a few moments and explained my hypothesis, and then we looked at the data in one of the tables. We looked at the distribution of data in each of the leading columns in the indexes, and in one of them, there was extreme skew.  If query performance degraded, and statistics had updated with the default sample, I requested that they run UPDATE STATISTICS with FULLSCAN to see if that resolved the issue.

Proof!

Sure enough, within the next 24 hours query performance changed and statistics for one of the indexes had been updated automatically with the default sample. I correlated this finding with the  the Extended Events output as an extra validation step.  When they updated statistics with the FULLSCAN, the “good” query plan was used again, query performance improved, and CPU dropped immediately.

A Better Solution

Now that we knew the issue, we had to decide how to address it. I recommended that they manually update statistics again with the FULLSCAN, but this time with the PERSIST_SAMPLE_PERCENT option set to ON. With this option enabled, statistics will retain the sample rate used with the manual update of statistics, even if it’s an automatic update. In this scenario, we wanted to ensure that statistics would always be updated with a FULLSCAN.

The customer was slightly concerned about the implications of the FULLSCAN update running during the day, because the tables were large and the update would generate overhead. This is absolutely a valid concern for tables that are extremely large.  To address this, we included the NORECOMPUTE option in the UPDATE STATISTICS command for the statistic in question, ensuring that the statistic would not automatically update even if the number of modifications crossed the threshold.

I explained to the customer that they would definitely need to run a SQL Agent job on a regular basis to update statistics for those which used the NORECOMPUTE option so that the optimizer had updated information about the distribution of data. Using NORECOMPUTE is rare, I find, but it definitely has its use cases. It’s essential to have a job to manually update those stats if you use that option for any of your statistics.

The Take Away

I haven’t heard from the customer since we determined the problem and created a solution, and I take that as a good sign. It’s ironic, because so often when I’m talking about changes in query performance, I mention that updating statistics is not the solution. It can just look like the solution because it causes a recompile and for a parameter-sensitive query, it often “fixes” the issue. But it hides the root problem in that scenario, and updating statistics (via the index rebuild) was hiding the root problem here as well. It wasn’t that the query was parameter sensitive, it’s that the optimizer needed a histogram from a FULLSCAN of the column, due to the skew in the data.

Most of the time, the default sample for a stats update is good enough. But when it isn’t, then you either need a FULLSCAN, or if that doesn’t do it, you need to look at filtered statistics across ranges of the data to give the optimizer better information.

SQL Server Plan Cache Limits

Last week on SQLPerformance.com I blogged about ad hoc workloads and how they impact performance, noting plan cache bloat, and I had a comment asking about SQL Server plan cache limits.  Kimberly mentions them in her classic post Plan cache and optimizing for adhoc workloads, but I thought I’d put the information into a blog post because I always have trouble finding the original references: Plan Caching in SQL Server 2008 (which is still accurate for SQL Server 2019, as far as I know) and Plan Cache Internals.

There are two limitations related to the plan cache:

  • Number of entries
  • Total size in MB (based on server memory)

By default, the plan cache is limited to 160,036 total entries (40,009 entries per bucket), and size based on max server memory (for SQL Server 2008+ and SQL Server 2005 SP2):

75% of visible target memory from 0 to 4GB                                                                                                             + 10% of visible target memory from 4GB to 64GB                                                                                                   + 25% of visible target memory > 64GB

Here’s the math:

 

SQL Server default plan cache size limits

SQL Server default plan cache size limits

 

If you have an ad hoc workload, you’ll often hit the max number of entries before you hit the space limit, particularly if you have the optimize for ad hoc workloads server option enabled, which stores the plan stub for an ad hoc query on initial execution, rather than the entire plan (the plan stub consumes less space, so this is an attempt to reduce bloat).

For pure ad hoc workloads, if you’re seeing SOS_CACHESTORE spinlock contention, you can use trace flag 174 to increase the number of entries to 160,001 per bucket (640,004 total), which is applicable for:

  • SQL Server 2012 SP1 CU14+
  • SQL Server 2012 SP2 CU5+
  • SQL Server 2014 CU2+
  • SQL Server 2014 SP1 CU1+
  • SQL Server 2016+

Very rarely, I have seen instances where customers have used trace flag 8032 to increase the total size of the plan cache.  This trace flag uses the plan cache limits for SQL Server 2005 RTM:

75% of visible target memory from 0 to 4GB                                                                                                             + 50% of visible target memory from 4GB to 64GB                                                                                                   + 25% of visible target memory > 64GB

The math is below, but note that this can significantly affect the memory available for the buffer pool and this is something I don’t typically recommend…use with caution.

SQL Server plan cache limits with TF 8032

SQL Server plan cache limits with TF 8032

Summary

If you’re seeing plan cache bloat due to an ad hoc workload, take note of the space consumed by your plan cache and consider enabling the optimize for ad hoc setting, and possibly enabling TF 174.  Both of those are short term fixes – ideally you look to parameterize stable queries that are executed most frequently to reduce the number of entries in the plan cache.