For a number of years, the marketing for SQL Server’s competitors made a large number of claims that SQL Server couldn’t handle “Enterprise”-scale deployments. Over time, those claims became much quieter as Microsoft used SQL Server extensively to run their whole business, through multi-terrabyte reference cases, and the hugely successful TerraServer project where Jim Gray led a team trying to build a huge database of satellite images using SQL Server.
This doesn’t mean that there aren’t challenges for all of the major vendors in adding features to help their products scale. Each of them have some failings in some of their product offerings, so please don’t interpret this as a “Microsoft SQL Server can’t do the job” post. On the contrary, I think that there is little that it can’t do, but I know where the limitations get hit and I have some expertise in working around the current limitations. Each Data Warehouse problem is a bit different, so the kinds of problems I’ll discuss here may or may not happen for any given deployment.
In my previous post, I discussed how subqueries are approached by the query optimizer (as joins), and this can lead to problems when the cardinality or cost estimates are incorrect. Often when an estimate is incorrect, “unrolling” the subquery by hand and creating a temporary table for a portion of the query will correct the mistake. Splitting a larger query like this makes the operation more procedural and makes each piece smaller. It also happens to give the optimizer better cardinality estimates for the portion of the query that has been unrolled – these better estimates may help the rest of the query perform more to your expectations. This all assumes that:
1. You have the skills and desire to unroll a subquery
2. You know how to unroll it to give the optimizer information to help avoid cardinality estimate errors
3. That the query is known a priori. Dynamically-generated queries are harder to hint
The best way to track down estimation errors is to find some time when you can run a problematic query with “set statistics profile on” enabled. While this does make the query run more slowly, it captures useful information to help identify areas of a query where the estimated and actual cardinality are materially incorrect. I’ve blogged about this previously (in previous blogs, not this one), so I’ll do a post linking some of the useful backgrounders on this area.
Data Warehouses are usually the largest of the large in terms of SQL Server deployments, and usually they have a dedicated DBA to help keep it working. This is partially because any business problem worth hundreds of thousands of dollars in hardware is also probably worth the salary of a DBA to keep it working. However, the other side of this coin is that there aren’t enough data warehouses in the world to make it truly a commodity software purchase. As such, lots of little things can and do go wrong in data warehouse deployments that often require expertise to fix.
Let’s talk about some of the reasons why the query optimizer may make incorrect guesses on data warehouse tables, starting with some of the easier ones
1. missing statistics. I’ve seen a bunch of customers disable auto-create statistics because “someone told them to do so” or “I don’t trust those things”. Unless you are going to hand-tune every query, you should likely leave auto-create enabled.
2. out-of-date statistics. This is a bit trickier. Column statistics are updated when a certain number of changes are made, and sometimes that hasn’t been triggered. You can manually update statistics in those cases. For customers who disable auto-update statistics (<2%, but this happens a lot on the higher-end deployments), you need to have a schedule to update those stats so that the optimizer has a fighting chance.
3. a statistics sample that does not reflect the whole data set. There are a number of assumptions made during statistics creation that are more pragmatic than mathematically proven. To make auto-stats possible, only a small percentage of the pages are actually used to create statistics for non-indexed columns. If those pages don’t cover the whole data range well, queries over the missing ranges could get poor estimates.
4. Fundamental limits in the statistics object. Currently there can be up to 200 steps in the histogram that is created. If your table’s data has less than 200 interesting spikes in the distribution curve, then you are probably fine. However, your fact table may benefit from more. Unfortunately, this can be a problem even with full stats enabled in SQL 2005 and, from what I have seen so far, in the CTPs of SQL 2008.
5. model limits in the optimizer. Optimizers make assumptions so that they can compile a query in a reasonable amount of time. For example, an optimizer may assume that columns are, in general, statistically independent of each other for the purposes of cardinality estimation because it’s the only computationally feasible assumption to make. This assumption is very often not true, and it can introduce errors in cardinality estimation that can lead to poor plan selection. As databases get larger, these “out of model” errors can get larger and require more hand-holding to make sure that a given query gets a good plan from the optimizer.
I hope this gives some insight into how cardinality estimation impacts join optimization and thus subquery optimization in SQL Server 200x.
Thanks,
Conor