I don’t know about you, but groupby is one of my favorite operators. There are a TON of interesting optimizations that a QP considers when you start adding group by into queries, especially when you have joins and then a group by. TPC-H benchmark wars among the large database vendors are won and lost on many such optimizations.
So, if you are doing relational OLAP (ROLAP) or are otherwise running group by queries over lots of well-normalized data, then I suggest you brush up a little on your knowledge of group by – it will help you understand when queries are behaving and when they are not.
Here’s the paper I tell people to read on the subject. It’s written for the person implementing a database, but anyone who can read query plans should get the basics from the paper. It is the basis for most of the hard-core optimizations (and tricky problems) that face all query processors today.
The basic idea is caused by understanding that an aggregate function can be split into multiple operations and done in parts. Some parts can be done earlier in a query, saving a lot of work. If these results can be combined later, you might be able to speed up a query by computing these partial results early and then combining them at the end of the query. The usual savings is that you don’t have to materialize the results of a join when you only care about the aggregate over some piece of it.
Most of the core aggregate functions defined in SQL can be decomposed. If you have a set of rows {x} := concat({y}, {z}), then
SUM({x}) == SUM({y}) + SUM({z}).
COUNT == COUNT + COUNT
MAX == MAX (MAX(), MAX())
…
Not all aggregate operations can be decomposed in this manner, but many of them can.
If you take this concept and then apply it a query with some joins:
select sum(col1) from a join b join c
Then the idea of local-global aggregation is that you can do part of the sum before joins and pass up the SUM for each group instead of all the rows from that group.
This idea becomes more powerful when you start throwing more complex operations into a query processor, such as partitioning or parallelism. Often, you want to perform partial aggregations on these groups to minimize the amount of data you have to send between threads or perhaps between nodes on a NUMA machine. All of this is the fun stuff that makes databases interesting – you can
Not every aggregate can be pushed below every join – there are rules about what can and can’t be done and maintain the same results from a query. For example, you may need to consider whether the aggregate function can handle seeing additional NULL values and return the same result or not.
If you go look at the SQL CLR user-defined aggregate definition, you’ll see the exposed pieces of some of this capability in the SQL Server query optimizer. I won’t spoil all of your fun, but go take a look.
Happy Querying!
Conor