This post is a continuation of the SQL Server 2014 Cardinality Estimator enhancements exploration series:
- A first look at the query_optimizer_estimate_cardinality XE event
- “CSelCalcCombineFilters_ExponentialBackoff” Calculator
- “CSelCalcCombineFilters_ExponentialBackoff” Calculator– Part II
- The CSelCalcAscendingKeyFilter Calculator
- Cardinality Estimation Model Version
- Comparing Root-Level Skews in the new Cardinality Estimator
In the previous post, I looked at root-level cardinality estimates vs. actual row counts across a set of 28 queries). 9 plans had improved cardinality but two of them had significant changes. One of the plans for the new CE version had a root-level estimate of 11,199 rows vs. 1,468 actual rows. Compare that to the original estimate of 348 vs. 1,468.
The query in-question was as follows:
SELECT AVG(OrderQty) AS 'Average Quantity', NonDiscountSales = (OrderQty * UnitPrice) FROM Sales.SalesOrderDetail GROUP BY (OrderQty * UnitPrice) ORDER BY (OrderQty * UnitPrice) DESC;
The pre-SQL Server 2014 CTP2 query plan was as follows (via SQL Sentry Plan Explorer, showing estimated rows):
And here was the new version of the plan (using the new Cardinality Estimator):
You can see that at the leaf-level, the Cluster Index Scan estimates remain the same, as does the consecutive Compute Scalar. Where it changes is in the Hash Match (Aggregate) operator – with the new CE estimating 11,200 rows versus the old estimate of 348.
Out of curiosity, I changed the original query to group by just OrderQty and another query to just group by UnitPrice (in other words – was the estimate issue related to our OrderQty * UnitPrice expression, or were the skews introduces at the single-column level?) Here was the OrderQty grouping query and plan tree (ignoring the useless nature of averaging the column you’re already group by):
SELECT AVG(OrderQty) AS 'Average Quantity', OrderQty FROM Sales.SalesOrderDetail GROUP BY OrderQty ORDER BY OrderQty DESC;
And here was the UnitPrice query and plan tree:
SELECT AVG(OrderQty) AS 'Average Quantity', UnitPrice FROM Sales.SalesOrderDetail GROUP BY UnitPrice ORDER BY UnitPrice DESC;
No large skews for the prior two queries, compared to our original (OrderQty * UnitPrice) skew – shown below in plan tree form:
Aside from the standard methods of evaluating single-column statistics (do they exist, are they up to date, do I need to change the sampling) – I then wondered if some of the legacy methods for helping improve estimates would still apply? For example, would adding a calculated column help?
ALTER TABLE Sales.[SalesOrderDetail] ADD [NonDiscountSales] AS (OrderQty * UnitPrice);
Re-executing the original query, I saw the following plan tree (with auto-created stats to the rescue associated with the calculated column):
I wish I could look at the query_optimizer_estimate_cardinality results for the old version of the CE to compare the event output to the new CE. I know it isn’t possible, but it would have been more helpful to understand what changed regarding the Hash Match (Aggregate) estimate. As an aside, the exact math behind the estimates are hidden both in the new and old versions – so perhaps a comparison of calculators in this scenario would still be limited?
Regarding XE information, there were 12 query_optimizer_estimate_cardinality events raised for query with the larger estimate skew. The second event out of the 12 events raised had the following calculator output (and it is interesting to see the SingleColumnStat reference in this output followed by two column names – but I believe this just means both single-column stats were loaded):
<DistinctCountCalculator CalculatorName=”CDVCPlanLeaf” SingleColumnStat=”OrderQty,UnitPrice” />
And the stats collection was as follows:
<StatsCollection Name=”CStCollGroupBy” Id=”3″ Card=”11199.81“>
<StatsInfo DbId=”7″ ObjectId=”898102240″ StatsId=”5″ />
<StatsInfo DbId=”7″ ObjectId=”898102240″ StatsId=”6” />
The stats collection was interesting to me as it was our first appearance out of the 12 events of the GROUP BY estimate – 11,199.81 (show earlier in the plan tree rounded up). The loaded stats StatsId of 5 was for the UnitPrice column and the StatsId of 6 was for the OrderQty column.
On the subject of SingleColumnStat, I wondered if creating multi-column stats (another legacy method for improving CE) on UnitPrice and OrderQty would be helpful. I dropped the old calculated column and created the following multi-column stats instead:
CREATE STATISTICS [stats_SalesOrderDetail_OrderQty_UnitPrice] ON Sales.[SalesOrderDetail](UnitPrice, OrderQty);
This answer is, multi-column stats were indeed helpful:
We do see now a reference to MultiColumnStat in the CalculatorList for one of the query_optimizer_estimate_cardinality events:
<DistinctCountCalculator CalculatorName=”CDVCPlanLeaf” MultiColumnStat=”OrderQty,UnitPrice” />
And we see the new stats object was loaded instead of looking at the two individual stats objects:
<StatsCollection Name=”CStCollGroupBy” Id=”3″ Card=”1600.00″>
<StatsInfo DbId=”7″ ObjectId=”898102240″ StatsId=”10″ />
So we see that with skews introduced in the new CE – similar to previous versions and depending on the query and schema, multi-column stats and calculated columns can still be helpful in reducing the CE skew.
To be continued…