Last Thursday I presented a session at the PASS Winter 2012 Performance Palooza<\/a>.\u00a0 It was a great experience and I appreciated the opportunity.\u00a0 The topic was \u201cTroubleshooting Query Plan Quality Issues\u201d and I received a few email questions after the presentation, so I thought I would walk through the full scenario, weaving in a few additional points that were motivated by the questions I received.<\/p>\n First, let me set up the scenario.\u00a0 I used the Credit database, which you can download here<\/a>.<\/p>\n The first T-SQL I executed updated the member table with ten different city and state_prov combinations:<\/p>\n USE Credit; UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] UPDATE\u00a0 [dbo].[member] Next, with \u201cInclude Actual Execution Plan\u201d enabled, I executed the following query:<\/p>\n SELECT\u00a0 [lastname], Looking at the estimated rows versus actual (using SQL Sentry Plan Explorer<\/a>), I see that the estimate is spot-on with 1,000 rows estimated and 1,000 rows actual:<\/p>\n Now my database has statistics auto-updates enabled, so even though I don\u2019t have a supporting index on the city column, I do<\/em> have supporting statistics (which were created in conjunction with my query execution):<\/p>\n EXEC dbo.sp_helpstats ‘member’;<\/p><\/blockquote>\n Looking at the statistics information via DBCC SHOW_STATISTICS, I see the following STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM information (highlighting the Minneapolis histogram step):<\/p>\n DBCC SHOW_STATISTICS(‘member’, ‘_WA_Sys_00000006_0CBAE877’);<\/p><\/blockquote>\n So we see the DENSITY_VECTOR shows an \u201call density\u201d value of 0.1 and we also see a Minneapolis RANGE_HI_KEY histogram step with an EQ_ROWS value of 1000.<\/p>\n Next, I executed the following query, looking at city AND state_prov:<\/p>\n SELECT\u00a0 [lastname], Now I personally know that these two columns are correlated, but the query optimizer does not<\/em>.\u00a0 The query optimizer assumes that these two columns are independent<\/em>.\u00a0 Here is the estimated versus actual for this query:<\/p>\n We see an estimate of 100 rows, versus the actual 1,000 rows.<\/p>\n We also have new statistics generated for the state_prov column:<\/p>\n EXEC dbo.sp_helpstats ‘member’;<\/p><\/blockquote>\n And notice that the statistics_keys are for single-column statistics.\u00a0 SQL Server does not automatically generate multi-column statistics.<\/p>\n Looking at the statistics information via DBCC SHOW_STATISTICS, I see the following STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM information (highlighting the MN histogram step):<\/p>\n DBCC SHOW_STATISTICS(‘member’, ‘_WA_Sys_00000007_0CBAE877’);<\/p><\/blockquote>\n The \u201call density\u201d value for state_prov is 0.125 (and notice that unlike with city, we have one state that has 3,000 EQ_ROWS value, for TX).<\/p>\n So the MN step shows 1,000 rows out of 10,000 rows (10%).\u00a0 And the Minneapolis step shows 1,000 rows out of 10,000 rows (10%).\u00a0 But SQL Server is not assuming that these two columns are correlated, and so we end up with an estimate of 1% of the rows (0.10 * 0.10).\u00a0 And while this is a small scale example, imagine this for much larger skews.\u00a0 What kind of impact could this have on the query execution plan?\u00a0 And how many times do you have predicates referencing correlated columns in your query?<\/p>\n Now, to help out the query optimizer, I can manually create multi-column statistics on city and state_prov:<\/p>\n CREATE STATISTICS [member_city_state_prov] If I re-execute my original query with city and state_prov predicates, I see that my estimates are now exactly correct:<\/p>\n But this isn\u2019t the end of the story, because if you look at the STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM of the manually created statistics, you\u2019ll see the following:<\/p>\n Notice that the DENSITY_VECTOR shows two rows \u2013 one with city and one with city, state_prov.\u00a0 Both show an \u201call density\u201d of 0.1 \u2013 which reflects our correlation between the two columns.<\/p>\n But also notice that the HISTOGRAM does NOT show multi-column steps.\u00a0 It just shows the leading statistics key column, city \u2013 with steps equal to the various city values.\u00a0 So in the case of Minneapolis, MN \u2013 the \u201call density\u201d value was correct.<\/p>\n What about a scenario where I pick a mismatched city and state_prov combination (Minneapolis and Texas)?<\/p>\n SELECT\u00a0 [lastname], This time we get the following cardinality estimate skew:<\/p>\n We estimated the rows based on DENSITY_VECTOR, but without a multi-column HISTOGRAM, the query optimizer doesn\u2019t know that there are no Minneapolis city and Texas state_prov rows.\u00a0 So while multi-column statistics can be helpful, there are limits.<\/p>\n Now what if I drop my statistics and add in a multi-column index instead?<\/p>\n DROP STATISTICS\u00a0 [dbo].[member].[member_city_state_prov]; CREATE INDEX [member_city_state_prov] The multi-column index will provide me with the same results for the Minneapolis \/ MN combination \u2013 as well as the same skew for the Minneapolis \/ TX combo.<\/p>\n One question I received was around the index choice after I created the index on city and state_prov.\u00a0 Why didn\u2019t that index get used via an index seek?<\/p>\n Well, the new index did get used from a cardinality estimate perspective, but the final plan choice involved a clustered index scan.\u00a0 The warning indicator you see on the SELECT was for the following missing index:<\/p>\n CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>] This is where we should explore the cost alternatives.\u00a0 Below shows the side-by-side costs of a Clustered Index Scan, a (forced) bookmark lookup using the index I created earlier just on city and state, and then finally the missing index suggestion with the additional INCLUDE columns (using INDEX hints to force the three different options):<\/p>\n — Clustered index scan — Non-covering NCI
\nGO<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘Minneapolis’,
\n[state_prov] = ‘MN’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 0;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘New York’,
\n[state_prov] = ‘NY’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 1;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘Chicago’,
\n[state_prov] = ‘IL’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 2;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘Houston’,
\n[state_prov] = ‘TX’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 3;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘Philadelphia’,
\n[state_prov] = ‘PA’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 4;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘Phoenix’,
\n[state_prov] = ‘AZ’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 5;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘San Antonio’,
\n[state_prov] = ‘TX’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 6;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘San Diego’,
\n[state_prov] = ‘CA’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 7;<\/p>\n
\nSET\u00a0\u00a0\u00a0\u00a0 [city] = ‘Dallas’,
\n[state_prov] = ‘TX’
\nWHERE\u00a0\u00a0 [member_no] % 10 = 8;
\nGO<\/p><\/blockquote>\n
\n[firstname]
\nFROM\u00a0\u00a0\u00a0 [dbo].[member]
\nWHERE\u00a0\u00a0 [city] = ‘Minneapolis’;
\nGO<\/p><\/blockquote>\n
<\/a><\/p>\n
<\/a><\/p>\n
<\/a><\/p>\n
\n[firstname]
\nFROM\u00a0\u00a0\u00a0 [dbo].[member]
\nWHERE\u00a0\u00a0 [city] = ‘Minneapolis’ AND
\n[state_prov] = ‘MN’
\nOPTION (RECOMPILE);
\nGO<\/p><\/blockquote>\n
<\/a><\/p>\n
<\/a><\/p>\n
<\/a><\/p>\n
\nON [dbo].[member]([city],[state_prov]);
\nGO<\/p><\/blockquote>\n
<\/a><\/p>\n
<\/a><\/p>\n
\n[firstname]
\nFROM\u00a0\u00a0\u00a0 [dbo].[member]
\nWHERE\u00a0\u00a0 [city] = ‘Minneapolis’ AND
\n[state_prov] = ‘TX’
\nOPTION (RECOMPILE);
\nGO<\/p><\/blockquote>\n
<\/a><\/p>\n
\nGO<\/p>\n
\nON [dbo].[member]([city],[state_prov]);
\nGO<\/p><\/blockquote>\n
<\/a><\/p>\n
\nON [dbo].[member] ([city],[state_prov])
\nINCLUDE ([lastname],[firstname])
\nGO<\/p><\/blockquote>\n
\nSELECT\u00a0 [lastname],
\n[firstname]
\nFROM\u00a0\u00a0\u00a0 [dbo].[member]
\nWITH (INDEX = [member_ident])
\nWHERE\u00a0\u00a0 [city] = ‘Minneapolis’ AND
\n[state_prov] = ‘MN’
\nOPTION (RECOMPILE);
\nGO<\/p>\n
<\/a><\/p>\n
\nSELECT\u00a0 [lastname],
\n[firstname]
\nFROM\u00a0\u00a0\u00a0 [dbo].[member]
\nWITH (INDEX = [member_city_state_prov])
\nWHERE\u00a0\u00a0 [city] = ‘Minneapolis’ AND
\n[state_prov] = ‘MN’
\nOPTION (RECOMPILE);
\nGO<\/p>\n