Horizontal Table and Index Partitioning, where different rows in the same logical table are split into different physical tables, is an essential data warehousing tool. It allows large tables to be split into multiple physical heaps or b-trees based on a key in the data. This often makes big tables more manageable because many operations take longer as a function of the size of the table. Splitting those tables can allow a database administrator to take the cost of an index rebuild or statistics update in pieces instead of all at once.
Partitioned tables have somewhat increased overhead because there’s all these extra physical structures to handle. They often also have different optimization characteristics. The optimizer can often determine that whole partitions are not needed for a particular query, and it then applies
logic to eliminate those partitions from the query plan entirely. This is conceptually similar to index matching, and it provides an order-of-magnitude performance improvement when it works.
Partition elimination relies on a fairly complex and expensive part of the optimizer that reasons about the valid domains for each column participating in a query. For example, if you have a query like:
SELECT * FROM A INNER JOIN B on A.a=B.b where A.a < 20 and B.b > 50;
If you stare at this while touching your nose, you may eventually figure out that no rows should be returned from such a query because logically the conditions introduce a contradiction. In this case, the query will be simplified during optimization and you’ll see a plan like this:
I will cover constant tables in a future “operator of the day” post, but you could imagine that not much will be returned from this query once this simplification has occurred.
Partition elimination, at least in SQL Server 2005, is conceptually similar. Each access method (heap, clustered index, secondary index) that contains a partitioning is tracked and considered against the valid ranges for that query. If it is determined that one or more partitions is not needed, these are pruned from the final plan and your query just got a lot faster.
It would be nice if that were the end of it, but it gets far more complicated. Let’s say that you write the query like this:
SELECT * FROM MyTable WHERE ptncol=?
Well, I can reason that the query will touch one and only one partition in this case. Dynamic partition pruning deals with a number of cases like this, and boy was that fun to implement.
Once you get your head wrapped around dynamic partition pruning, you then start to worry about a range of other nasty, nasty problems that are only somewhat visible via the output plan. The SQL 2005 query tree representation for partitioned tables unfortunately is not orthogonal with how parallelism is applied to the query tree, so there are cases when you can’t get parallel plans against partitioned tables in a plan. Additionally, the index matching and access path selection code also has to deal with backwards compatiblity when selecting the access path, and there are lots of details about costing and index selection that are very core to compilation time performance and plan quality. Finally, there are some operators in the query tree that can’t/don’t flow information about the algebraic logic in enough detail to prune partitions.
While some of the details of this are beyond what is visible in the output plans (and therefore I won’t be talking about it), I hope you can at least accept that it’s a complex problem to make it all work perfectly. SQL 2005 did get a lot of the pieces working, and I think everyone would say that it’s a big, big step forward compared to (distributed) partitioned views from SQL 2000.
SQL 2008 contains some additional changes to partitioning to continue to integrate it tightly into the rest of the query optimizer, and I’ll do a post on the external query representation differences between SQL 2005 and SQL 2008. I’ll also be doing a post or two on some specific pruning issues that real customers have seen in SQL 2005.
Bedtime for me – happy querying!