Last week I presented a session, Demystifying Statistics in SQL Server, at the PASS Community Summit, and I had a lot of great questions; so many that I’m creating multiple posts to answer them. This first post is dedicated to questions specific to creating statistics in SQL Server.
- Q: Any book recommendations that teach most of this info?
- The online documentation, Statistics, is a good place to start.
- You can download, for free, a PDF version of Inside the SQL Server Query Optimizer by Benjamin Nevarez, which includes information specific to statistics. It is from 2010, but some information remains relevant.
- Q: If there are any AG distinctions you can make that would be awesome
- There is nothing different about how statistics are created or updated for a read-write database on the primary replica in an Availability Group.
- However, with a read-only replica, there are statistics that can get created to support queries against the read-only copy of the database. SQL Server cannot create statistics IN a read-only database, so it creates them in tempdb instead.
- Additional reading: Offload read-only workload to secondary replica of an Always On availability group
- Q: Are all stats created equal between MI and Azure SQL?
- There are no documented differences at this time related to statistics and their creation/existence for on-premises installations, Azure SQL, and Managed Instance.
- Q: Should I use these settings with SharePoint database?
- During the session I shared the following information about database settings related to statistics:
- AUTO_CREATE_STATISTICS – enabled by default, recommend to leave enabled
- AUTO_UPDATE_STATISTICS – enabled by default, recommend to leave enabled
- AUTO_UPDATE_STATISTICS_ASYNC – disabled by default, enable if desired, understand what it does
- AUTO_CREATE_STATISTICS INCREMENTAL – disabled by default, recommend to enable if using partitioning
- For SharePoint, the documentation states to disable AUTO_CREATE_STATISTICS.
- During the session I shared the following information about database settings related to statistics:
- Q: You mentioned stats are created when columns are queried or joined on. Do you specifically mean columns in the WHERE clause and ON? Or will columns in SELECT get auto created stats?
- A statistic will be automatically created for a column in a predicate (WHERE clause) or JOIN, if that column is not the leading column in an existing index for which a statistic already exists. Statistics are not created for columns in the SELECT clause.
- Q: How can we understand which statistics were created for search and for join (if they have the same naming convention)?
- All column statistics created automatically, whether because they were in the WHERE clause or JOIN, follow the same naming convention, and there is no way to know if they were created because they were in the query predicate or used in a join condition.
- Q: Do you know why MS disabled “auto create incremental statistics” by default? Is there any downside to enabling it by default?
- I do not know why it is disabled by default; my guess would be that they would rather allow folks to opt-in. A high percentage of databases do not use partitioning, so this is useful for a small number of implementations. I only recommend enabling it if you are using partitioning.
- Q: If you rename an index, does the stat also rename?
- Yes.
- Q: Can we assume that a lot of these automatic created stats means a lack of indexes for a given table?
- I think that’s a very broad assumption and not entirely accurate. Imagine a table with 30 columns, and 5 different indexes that lead on 5 different columns. If every query has at least one of those columns in the predicate, but there are other columns also in the predicate, and those columns can be different because users can query on different criteria…then you could have a column level statistic that was automatically created for each of the other 25 columns. But if the 5 indexes mostly satisfy the queries, you wouldn’t necessarily need more indexes, and the 25 auto-created stats are not a bad thing.
- Q: Can you provide an example of when you want to create a stat manually instead of an index (which would then create a statistic anyway).
- Not every column needs to be indexed – even if it is in predicate. There is a LOT more to that statement, but in general, the only time I create statistics is when I create filtered statistics (see next question for more info).
- Q: If the table is large, then there are not enough steps to accurately described by 200 evenly spaced steps. Is there a pattern for dealing with is?
- Create filtered statistics to better represent the data distribution in the table. Kimberly has a great session (Skewed Data, Poor Cardinality Estimates, and Plans Gone Bad) that discusses this and a set of scripts that I recommend.
- Q: Are there reasons/cases when you should create custom stats? Can these custom stats be used as a tool to help speed up query performance?
- As noted above, in cases where tables are very large and the histogram does not accurately represent the distribution of data in a column, filtered statistics can provide the optimizer with significantly more information.
- Sometimes it’s helpful to create filtered statistics for partitioned tables. Even though you can update statistics at the partition level, those stats are folded into the histogram for the table, and those partition-level stats are not used by the Query Optimizer. But if you create a filtered statistic for each partition, that stat is used by the optimizer.
- Q: What is the opinion on duplicate auto created stats, then deploy supporting index that has the same leading column. Tried to drop as they seem to interfere with execution plans, but when using Kendal’s script, getting access error under sa rights on SQL 2019 EE.
- I’ve seen this scenario occur – where a column statistic exists because it was created automatically, and then later on an index is created that leads on that column, so now there are two statistics in the system for the same column. These do not necessarily get updated at the same time, so the information for one can be more accurate than the other.
- I recommend dropping duplicate statistics (in this case, the original, column statistic). I pulled the code from Kendal’s script and recreated a scenario, then ran it and did not have an issue on SQL Server 2019. I’d have to see the exact error to hypothesize the issue.
- Jason Brimhall (who was the moderator for my session and did an amazing job), wondered if maybe the script was pulling an index statistic for the DROP statement in error. You would not be able to drop an index statistic (you could only drop the index, which would remove the statistic).
Up next: questions related to updating statistics.
One thought on “Stats Q&A 1: Creating SQL Server Statistics”