I started off tonight playing with the new page compression feature.  So far I like it.  I haven’t found something yet about which I want to post (which is code for “I’m still looking for the seams ;)”), but I have some other things you can try to learn a few things about how the SQL Server QP makes assumptions about various kinds of predicates in queries during cardinality estimations.

So, you may or may not know much about how the SQL Server QP figures out what plans to run.  For the uninitiated, it almost seems like some form of magic.  In some ways, it is – it’s very powerful and poorly understood by many, and it usually requires very little effort by someone skilled in the area to make something amazing happen.  SQL Server merely needs to make itself sparkle when I fix a query plan and I’m set for life :).

In reality, SQL Server uses a cost-based optimizer, which means that it keeps track of all sorts of interesting statistical information, row counts, page counts, etc.  It uses all of these in formulas to come up with numbers for each plan fragment and then it weighs the relative costs of all of these to pick a plan that has the “least cost”.  That sounds nice and absolute until you get to go actually try to make that work, and then you are left with all sorts of nasty questions like:

* What should the cost formulas be?
* Do the numbers need to differ based on the customer’s hardware? How do we calibrate all of this stuff?  What do we do as machines get faster?
* How do I estimate how many rows are going to come back from one predicate in my WHERE clause or join in time less than running the query to figure it out?
* Same question when I have a bunch of preciates?

Eventually, the QP has to make a set of assumptions so that they can come up with a plan in a reasonable amount of time, both because customers don’t like things to ever take time and because managers don’t like customers to tell them about how much time something should take..  One assumption might be that, data is uniformly distributed over a data type’s possible values when you don’t have any better information.  This can help make it possible to come up with solutions that work well most of the time.  The problem is that estimates can be wrong, and that can cause the QP to come up with a different plan than had it had correct information to use when selecting the plan.

So, I’ll show you an example here.  To be clear, I’m not saying that this is something that is “broken”.  This just exposes a place where 2 different assumptions rub up against each other in a way that will SEEM odd to the outside observer.  When you consider the average customer use cases, these assumptions are not bad and work very well the vast majority of the time…

To the example:

drop table comp1
create table comp1(col1 int identity, col3 nvarchar(3000))

declare @i int 
set @i=0
while @i < 70000
begin
insert into comp1 (col3) values (N'123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')
set @i=@i+1
end

So I create a table with the same long string in it 70,000 times. 

Then I run a query with a where clause just to get some statistics created:

select  * from comp1 where col3 like '%4567890%'
dbcc show_statistics ('comp1', col3)

Once we have all of this stuff, we can look at the estimates for two very similar queries:

select  * from comp1 where col3 like '123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'

select  * from comp1 where col3 like '%123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'

(one is LIKE ‘abc%’.  The other is LIKE ‘%abc% where abc is the value we have inserted 70,000 times).

So, both queries will return 70,000 rows. 

Well, the abc% pattern query estimates 70,000 rows (good!).  The second query estimates 69.6985 rows.  that’s a bit under ;).

Let’s talk about this a bit more so you can understand why.  In the first query, there is an exact string match against a column represented in the statistics histogram.  So, the likely outcome is to take that cardinality count to determine the number of rows that will likely be returned from the query.  In this case, we expect all rows to come back.

In the second one, there is no mechanism to estimate the cardinality of ANY string of this size (SQL Server does have a feature that does smaller strings which is exposed as the “STRING INDEX” in the histogram, but you can’t see the details of this object in 2005 and I haven’t seen that change in 2008 either).  So, for really large strings, it is left with… guessing.

So, that 69.6985 number is an estimate that is partially based on the length of the string.  Now, the QP could try to walk through each statitistics object and try to find substrings against any existing piece of statistical data and then try to adjust its estimate.  In practice, though, the cost of doing that is expensive.  The various statistics objects are run at different times and have different sample rates, so even then they will vary somewhat.  Finally, for most cases it may just not impact the plan choice that much.  Odds are, though, that this will bite at least one of my readers at some point.  So, this is good to know – it can help you find that spot where the assumption in the QP is causing your query plan to be wrong.  This is the sort of case where you will want to consider a query hint to help the QP out.

There are more assumptions (and seams between them) in the cardinality estimation code.  I’ll let you guys go hunt for them a bit to find them.

Happy Querying!
Conor Cunningham