This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

When you execute a query that’s going to process a single row, the plan to access that row might be very simple – use an index to find the row and then look up the data. If you execute a query that’s going to process thousands of rows, the plan to gather that data might be more complicated. The real question here isn’t the plan itself but how did SQL Server know that there was going to be one row or thousands of rows to access? To create a good plan, SQL Server needs to know your data (before it goes to the data) in order to access the data efficiently.  This is why statistics exist.

What are statistics?

Statistics are objects in the database, stored as a BLOB (binary large object). Generally, you don’t create them directly; they are created with indexes or auto-created by SQL Server when the query optimizer (the system that decides the most efficient way to access the data) needs better information about your data than what it has currently. The latter creation scenario is tied to a database option: auto create statistics. This database option is on by default and for Accidental DBAs, I recommend that this stay on. As for manually creating statistics, there are cases where creating statistics can be extremely useful but they tend to be warranted for VLTs (very large tables). For today, I’ll save that discussion as it’s out of scope for a typical Accidental DBA.*

Statistics give information about the data distribution of the keys described by that statistic (in key order). Statistics exist for all indexes and [column-level] statistics can exist on their own. For example, let’s review the AdventureWorks2012 database. The person.person table has an index called IX_Person_LastName_FirstName_MiddleName on the LastName, FirstName, and MiddleInitial columns of the table. What do the statistics on this index tell me?

USE AdventureWorks2012;
go
DBCC SHOW_STATISTICS ('Person.Person', 'IX_Person_LastName_FirstName_MiddleName');
go

There are 3 results sets returned from the DBCC SHOW_STATISTICS command.

The header

Name Updated Rows Rows Sampled Steps Density Average key   length String Index Filter   Expression Unfiltered Rows
IX_Person_LastName_FirstName_MiddleName Oct 31 2012 12:47PM 19972 8935 200 0.6730038 28.32502 YES NULL 19972

The most important information from the header is when the statistics were last Updated (or when they were created if they’ve never been updated). The second most important is the Rows vs. Rows Sampled columns. Neither of these directly indicates a problem but if queries against this table are not performing and the estimates the queries are using for optimization are not correct, it could be the statistics that are incorrect.

The density vector

All   density Average Length Columns
0.001362398 11.22798 LastName
5.05E-05 23.09927 LastName, FirstName
5.03E-05 24.32502 LastName, FirstName,   MiddleName
5.01E-05 28.32502 LastName, FirstName,   MiddleName, BusinessEntityID

The density vector tells us information about the average distribution of our data. If you multiply the All density * Rows (of the table) you can get some insight into the average distribution of the column(or columns) described by Columns above.

Using LastName alone: 0.001362398 * 19972 = 27.209812856. What this tells me is that the Average number of rows returned for queries that supply JUST a LastName is 27.

Using LastName & FirstName: 5.05E-05 * 19972 = 1.008586. What this tells me is that the combination of LastName and FirstName is almost unique. If I supply BOTH a FirstName and a LastName in my query (using equality), then I should get back 1 row.

This is interesting information – especially for the combinations of the columns beyond the first – because this tells us how much more selective a query can be if we add these additional columns in our WHERE clauses. But, it’s not perfect for LastName alone because we all know that each last name is not going to return 27 rows, right? And, this is where the histogram comes in…

The histogram

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
Abbas 0 1 0 1
Adams 7.016288 71.19462 3 2.326017
Alan 9.355051 11.12416 3 3.101357
Alexander 18.7101 111.2416 6 3.103471
Zhu 72.50165 68.96979 1 71.9106
Zugelder 23.38763 1 4 5.817025
Zwilling 0.1117823 3.552485 0 27.20037

The histogram can contain up to 201 rows (see the Steps column in the statistics header). These 201 rows are made up of up-to 200 distinct (and actual values) from the table itself AND one row if this leading column allows Nulls. In this case, because our LastName column does not allow Nulls, our histogram has 200 rows (side note: even if your leading column has more than 200 values, it does not guarantee that SQL Server will have 200 steps).

The histogram tells us the most detailed information about our leading column (often referred to as the “high-order element” of the index). It’s surprisingly easy to read:

Abbas 0 1 0 1
Adams 7.016288 71.19462 3 2.326017
Alan 9.355051 11.12416 3 3.101357

For the LastName Abbas there is 1 row equal to this value (EQ_ROWS) and no rows prior to it (no rows in the range).

For the LastName of Adams, there are 71 rows that equal this value (EQ_ROWS) and 7 rows between Abbas and Adams (not including the rows that equal Abbas [1] and Adams [71]) and between these values there are 3 other LastName values. The average number of rows per name between these values is 2.32.

What does this tell me – it tells me that any query requesting rows with a LastName value between Abbas and Adams, will have an estimate of 2.32 rows.

Are statistics accurate?

Well… it depends. There are many factors that affect the accuracy of a statistic. Size of the table, skew of the data, volatility of the table – they all affect the accuracy. At the time of creation, they can be incredibly accurate. But, as data gets modified, they might become less accurate. Because of the nature of how they’re created and what they represent, there’s no way to keep them up to date as individual rows are modified. The only way to update them is when you’re viewing a large amount of the data. When an index is rebuilt, SQL Server updates the index’s statistic with the equivalent of a full scan of the data. Statistics on an index are most accurate after an index rebuild. However, an index reorganize does not update statistics at all because the entire table is not analyzed in one go (only pages with fragmentation are reorganized). So, if you find that your index maintenance scripts are regularly reorganizing indexes then you’ll want to make sure that you also add in statistics maintenance. And, your statistics maintenance should not only include statistics on indexes but any of the other statistics that SQL Server may have created.

Statistics Maintenance

Now that you know statistics provide a valuable role in optimization, it’s also important that this information be accurate. Just as Jonathan mentioned in his post yesterday (The Accidental DBA (Day 14 of 30): Index Maintenance), I also often recommend custom scripts. And, Ola’s scripts even have an option where you only update statistics where data has changed. So, if you run the statistics maintenance after having run index maintenance (and no data has been modified since), then you will only update statistics where there has been data change. This is the most efficient way to update only the statistics that might need to be changed.

Summary

In order for the query optimizer to do a good job, it has to have accurate and up-to-date statistics. My general recommendations for Accidental DBAs is that they should leave both the auto create statistics option and the auto update statistics option on (they are both on by default). However, I would also recommend a specific maintenance routine that updates the statistics manually – off hours – so that the default auto updating mechanism isn’t your primary method for updating statistics. For the optimizer to do a good job at optimizing your queries, statistics have to both exist and be accurate. Proper statistics maintenance is a critical task for helping the optimizer do its job. While there are multiple options for automating this task, custom scripts provide the best method of minimizing the work to be done by performing the most efficient updates based only on data change.

This has only been a short introduction into statistics; there’s a lot more to them. If you’re interested in reading more about statistics check out the whitepaper, Statistics Used by the Query Optimizer in Microsoft SQL Server 2008. Then, check out our SQLskills blogs by category: Statistics.

Thanks for reading!
kt

*Sidenote, as an Accidental DBA, if you have individual tables reaching 100GB or more, we’ll, you might want to talk to management to allow you more time, more knowledge, more administration, more tweaking/tuning of these larger tables. Maybe it’s time your environment considered a full-time DBA (maybe you?) with more dedicated time to managing/tuning your servers. It’s hard to wear so many hats in general but if you’re starting to administer databases with large tables and large databases (1TB+), then maybe you’re not an Accidental DBA any more?