This is tale of troubleshooting…
When you unexpectedly or intermittently encounter a change in query performance, it can be extremely frustrating and troublesome for a DBA or developer. If you’re not using Query Store, a third-party application, or your own method to capture query data, then you probably don’t have information about what query performance looked like when things were good…you just know how it’s running now. I was working with a customer of Tim’s last week that had been chasing this exact problem for, in their words, years. They had recently upgraded to SQL Server 2016, and the problem was still occurring.
First Steps When There’s a Change in Query Performance
If there is a change in query performance, we typically look at differences in the execution statistics, as well as differences in the plan in terms of the shape and operations. The execution statistics confirm that performance is worse; duration is higher, CPU and/or IO is higher. With this client, after we verified the execution data, we used Query Store to compare the plans, and they were different. When I see different plans I immediately check the compiled values (input parameters) for each plan. Typically, I expect the different plans are a result of different values – because the query is sensitive to different input parameters. Interestingly enough, in the case of this customer, the different plans had the exact same parameters. My next step? Statistics.
The customer had told me that the only way they could consistently resolve the issue was to rebuild indexes for a few of the tables that were involved in the query. Rebuilding indexes causes an update to statistics with a FULLSCAN. I asked if they had ever tried just updating statistics with FULLSCAN to fix the issue, and they reported that it caused problems when they did. I tabled that issue, as I had a theory about what was happening and wanted to prove that out before digging into a secondary issue.
Collecting Query-Related Data
I set up an Extended Events session to capture the auto_stats event (which fires when statistics automatically update) and filtered on the objects in question.
I also walked them through the statistics information within SQL Server. This was extremely fun. I love statistics, and I love explaining the output of DBCC SHOW_STATISTICS to people. When I’m teaching, I can typically see the light bulbs come on for people as I talk through the stats header, the density vector, and the histogram (especially the histogram!). On the call, I could hear them say “oh!” and “that makes so much more sense!” as I was talking…so fun. But I digress. So, we looked at the statistics for the tables in question, and we could see that because they had rebuilt indexes earlier in the day to resolve the issue, the sample was 100% (from the FULLSCAN). Query performance was good at this point, so we had our baseline.
Then I showed them how to view the stats header information for all statistics for the tables in question using the sys.dm_db_stats_properties DMV and requested that they query the DMV if they had a change in query performance. I explained that they should look to see if statistics had updated, and if so, take note of the sample size.
My hypothesis was that stats were getting automatically updated because Auto Update Statistics was enabled for the database, and enough data had changed to fire the update (realize they are on 2016 with 130 compatibility mode, which means the threshold is lower than it used to be…20% + 500 rows). After statistics updated, all queries that used those statistics were recompiled, and because the sample was the default, it wasn’t an accurate representation of the data in the table, and so a different plan was being generated, even though the same input parameters were used.
Back to the story
I took a few moments and explained my hypothesis, and then we looked at the data in one of the tables. We looked at the distribution of data in each of the leading columns in the indexes, and in one of them, there was extreme skew. If query performance degraded, and statistics had updated with the default sample, I requested that they run UPDATE STATISTICS with FULLSCAN to see if that resolved the issue.
Sure enough, within the next 24 hours query performance changed and statistics for one of the indexes had been updated automatically with the default sample. I correlated this finding with the the Extended Events output as an extra validation step. When they updated statistics with the FULLSCAN, the “good” query plan was used again, query performance improved, and CPU dropped immediately.
A Better Solution
Now that we knew the issue, we had to decide how to address it. I recommended that they manually update statistics again with the FULLSCAN, but this time with the PERSIST_SAMPLE_PERCENT option set to ON. With this option enabled, statistics will retain the sample rate used with the manual update of statistics, even if it’s an automatic update. In this scenario, we wanted to ensure that statistics would always be updated with a FULLSCAN.
The customer was slightly concerned about the implications of the FULLSCAN update running during the day, because the tables were large and the update would generate overhead. This is absolutely a valid concern for tables that are extremely large. To address this, we included the NORECOMPUTE option in the UPDATE STATISTICS command for the statistic in question, ensuring that the statistic would not automatically update even if the number of modifications crossed the threshold.
I explained to the customer that they would definitely need to run a SQL Agent job on a regular basis to update statistics for those which used the NORECOMPUTE option so that the optimizer had updated information about the distribution of data. Using NORECOMPUTE is rare, I find, but it definitely has its use cases. It’s essential to have a job to manually update those stats if you use that option for any of your statistics.
The Take Away
I haven’t heard from the customer since we determined the problem and created a solution, and I take that as a good sign. It’s ironic, because so often when I’m talking about changes in query performance, I mention that updating statistics is not the solution. It can just look like the solution because it causes a recompile and for a parameter-sensitive query, it often “fixes” the issue. But it hides the root problem in that scenario, and updating statistics (via the index rebuild) was hiding the root problem here as well. It wasn’t that the query was parameter sensitive, it’s that the optimizer needed a histogram from a FULLSCAN of the column, due to the skew in the data.
Most of the time, the default sample for a stats update is good enough. But when it isn’t, then you either need a FULLSCAN, or if that doesn’t do it, you need to look at filtered statistics across ranges of the data to give the optimizer better information.