One of the areas I managed in SQL Server had to do with the code that automatically builds statistics and uses them during query optimization to come up with a good plan. Today I’m going to talk a bit about how statistics are built and how this works with parallelism and partitioning.
First things first. There are two ways in which statistics are created in SQL Server:
1. You create an index.
2. You run a query that needs statistics when they do not exist and the server has “auto-create statistics” enabled. (That’s the simple definition – there are actually caveats I am skipping)
Why does creating an index also create statistics? Well, the main cost in creating statistics is reading all the pages into memory. The thought is that if you want to create an index, you have already paid the expensive cost and the server might as well go ahead and create the statistics object for you instead of re-reading those pages later. So, you get these “for free” (well, almost free).
When you run a query that tries to perform an operation where cardinality estimates would be improved by having statistical information about a column, the server can try to create statistics during the compilation process. Since compilation time needs to be kept to a minimum, these are usually done over a _sample_ of the pages to avoid making the compilation of a simple query as expensive as an index build.
So the basic plan to create an index is a singlethreaded plan that is something like this:
INSERT (New B-Tree)
SCAN(Heap or some index)
So during this plan’s execution, there is an implicit side-effect to also create this statistics object.
For auto-stats, there is a separate query that is run. The syntax is not public, but you can see artifacts of this if you look at the profiler and you’ve turned on the various auto-stats and plan outputs:
drop table stat1 create table stat1(col1 int, col2 int, col3 binary(6000)) declare @i int set @i=0 while @i < 1000 begin insert into stat1(col1, col2) values (rand()*1000, rand()*1000) set @i=@i+1 end select * from stat1 where col2 > 5
|–Sort(ORDER BY:([t1].[dbo].[stat1].[col2] ASC))
What is this? Well, this is the plan that is used to generate the statistics object. It scans a table, sorts it (into ascending order), and then feeds it into this magical thing called statman.
Now, the details of statman are undocumented, but you can infer that it is a special internal aggregate function that is being run inside of a group by operation (stream aggregate in the plan). This means that it is collapsing all the rows into some BLOB and then it does something with this.
Next time I hope to talk about parallel statistics build.