Last Thursday I presented a session at the PASS Winter 2012 Performance Palooza. It was a great experience and I appreciated the opportunity. The topic was “Troubleshooting Query Plan Quality Issues” and I received a few email questions after the presentation, so I thought I would walk through the full scenario, weaving in a few additional points that were motivated by the questions I received.
First, let me set up the scenario. I used the Credit database, which you can download here.
The first T-SQL I executed updated the member table with ten different city and state_prov combinations:
USE Credit;
GOUPDATE [dbo].[member]
SET [city] = ‘Minneapolis’,
[state_prov] = ‘MN’
WHERE [member_no] % 10 = 0;UPDATE [dbo].[member]
SET [city] = ‘New York’,
[state_prov] = ‘NY’
WHERE [member_no] % 10 = 1;UPDATE [dbo].[member]
SET [city] = ‘Chicago’,
[state_prov] = ‘IL’
WHERE [member_no] % 10 = 2;UPDATE [dbo].[member]
SET [city] = ‘Houston’,
[state_prov] = ‘TX’
WHERE [member_no] % 10 = 3;UPDATE [dbo].[member]
SET [city] = ‘Philadelphia’,
[state_prov] = ‘PA’
WHERE [member_no] % 10 = 4;UPDATE [dbo].[member]
SET [city] = ‘Phoenix’,
[state_prov] = ‘AZ’
WHERE [member_no] % 10 = 5;UPDATE [dbo].[member]
SET [city] = ‘San Antonio’,
[state_prov] = ‘TX’
WHERE [member_no] % 10 = 6;UPDATE [dbo].[member]
SET [city] = ‘San Diego’,
[state_prov] = ‘CA’
WHERE [member_no] % 10 = 7;UPDATE [dbo].[member]
SET [city] = ‘Dallas’,
[state_prov] = ‘TX’
WHERE [member_no] % 10 = 8;
GO
Next, with “Include Actual Execution Plan” enabled, I executed the following query:
SELECT [lastname],
[firstname]
FROM [dbo].[member]
WHERE [city] = ‘Minneapolis’;
GO
Looking at the estimated rows versus actual (using SQL Sentry Plan Explorer), I see that the estimate is spot-on with 1,000 rows estimated and 1,000 rows actual:
Now my database has statistics auto-updates enabled, so even though I don’t have a supporting index on the city column, I do have supporting statistics (which were created in conjunction with my query execution):
EXEC dbo.sp_helpstats ‘member’;
Looking at the statistics information via DBCC SHOW_STATISTICS, I see the following STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM information (highlighting the Minneapolis histogram step):
DBCC SHOW_STATISTICS(‘member’, ‘_WA_Sys_00000006_0CBAE877’);
So we see the DENSITY_VECTOR shows an “all density” value of 0.1 and we also see a Minneapolis RANGE_HI_KEY histogram step with an EQ_ROWS value of 1000.
Next, I executed the following query, looking at city AND state_prov:
SELECT [lastname],
[firstname]
FROM [dbo].[member]
WHERE [city] = ‘Minneapolis’ AND
[state_prov] = ‘MN’
OPTION (RECOMPILE);
GO
Now I personally know that these two columns are correlated, but the query optimizer does not. The query optimizer assumes that these two columns are independent. Here is the estimated versus actual for this query:
We see an estimate of 100 rows, versus the actual 1,000 rows.
We also have new statistics generated for the state_prov column:
EXEC dbo.sp_helpstats ‘member’;
And notice that the statistics_keys are for single-column statistics. SQL Server does not automatically generate multi-column statistics.
Looking at the statistics information via DBCC SHOW_STATISTICS, I see the following STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM information (highlighting the MN histogram step):
DBCC SHOW_STATISTICS(‘member’, ‘_WA_Sys_00000007_0CBAE877’);
The “all density” value for state_prov is 0.125 (and notice that unlike with city, we have one state that has 3,000 EQ_ROWS value, for TX).
So the MN step shows 1,000 rows out of 10,000 rows (10%). And the Minneapolis step shows 1,000 rows out of 10,000 rows (10%). But SQL Server is not assuming that these two columns are correlated, and so we end up with an estimate of 1% of the rows (0.10 * 0.10). And while this is a small scale example, imagine this for much larger skews. What kind of impact could this have on the query execution plan? And how many times do you have predicates referencing correlated columns in your query?
Now, to help out the query optimizer, I can manually create multi-column statistics on city and state_prov:
CREATE STATISTICS [member_city_state_prov]
ON [dbo].[member]([city],[state_prov]);
GO
If I re-execute my original query with city and state_prov predicates, I see that my estimates are now exactly correct:
But this isn’t the end of the story, because if you look at the STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM of the manually created statistics, you’ll see the following:
Notice that the DENSITY_VECTOR shows two rows – one with city and one with city, state_prov. Both show an “all density” of 0.1 – which reflects our correlation between the two columns.
But also notice that the HISTOGRAM does NOT show multi-column steps. It just shows the leading statistics key column, city – with steps equal to the various city values. So in the case of Minneapolis, MN – the “all density” value was correct.
What about a scenario where I pick a mismatched city and state_prov combination (Minneapolis and Texas)?
SELECT [lastname],
[firstname]
FROM [dbo].[member]
WHERE [city] = ‘Minneapolis’ AND
[state_prov] = ‘TX’
OPTION (RECOMPILE);
GO
This time we get the following cardinality estimate skew:
We estimated the rows based on DENSITY_VECTOR, but without a multi-column HISTOGRAM, the query optimizer doesn’t know that there are no Minneapolis city and Texas state_prov rows. So while multi-column statistics can be helpful, there are limits.
Now what if I drop my statistics and add in a multi-column index instead?
DROP STATISTICS [dbo].[member].[member_city_state_prov];
GOCREATE INDEX [member_city_state_prov]
ON [dbo].[member]([city],[state_prov]);
GO
The multi-column index will provide me with the same results for the Minneapolis / MN combination – as well as the same skew for the Minneapolis / TX combo.
One question I received was around the index choice after I created the index on city and state_prov. Why didn’t that index get used via an index seek?
Well, the new index did get used from a cardinality estimate perspective, but the final plan choice involved a clustered index scan. The warning indicator you see on the SELECT was for the following missing index:
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[member] ([city],[state_prov])
INCLUDE ([lastname],[firstname])
GO
This is where we should explore the cost alternatives. Below shows the side-by-side costs of a Clustered Index Scan, a (forced) bookmark lookup using the index I created earlier just on city and state, and then finally the missing index suggestion with the additional INCLUDE columns (using INDEX hints to force the three different options):
— Clustered index scan
SELECT [lastname],
[firstname]
FROM [dbo].[member]
WITH (INDEX = [member_ident])
WHERE [city] = ‘Minneapolis’ AND
[state_prov] = ‘MN’
OPTION (RECOMPILE);
GO— Non-covering NCI
SELECT [lastname],
[firstname]
FROM [dbo].[member]
WITH (INDEX = [member_city_state_prov])
WHERE [city] = ‘Minneapolis’ AND
[state_prov] = ‘MN’
OPTION (RECOMPILE);
GO— Covering NCI
SELECT [lastname],
[firstname]
FROM [dbo].[member]
WITH (INDEX = [member_city_state_prov_2])
WHERE [city] = ‘Minneapolis’ AND
[state_prov] = ‘MN’
OPTION (RECOMPILE);
GO
So while the query optimizer used the index statistics for my cardinality estimate, the estimated cost of a Clustered Index Scan was 0.011 versus using the non-covering nonclustered index and key lookup cost of 0.158. And of course, the fully covering index was the cheapest estimated cost out of the three – at 0.001, although whether it makes sense holistically to accommodate that one query with a covering index is another topic altogether.
3 thoughts on “Exploring Column Correlation and Cardinality Estimates”
I always wondered why multi-column statistics do not exist. It seems like such an obvious extension. And a helpful one as proven by this blog post.
Hi Thymin1 – Agreed!
Comments are closed.