Jack Li, one of the Senior Escalation Engineers in Product Support, just posted details of an interesting case over on the CSS blog – his article is here. It talks about index builds and rebuilds, but the issue is the same for both, so I'll just talk about rebuilds.
The jist of the problem is that index rebuilds can parallelize, but sometimes they don't parallelize vey well. Each thread gets a certain range of the index to rebuild, using the existing index statistics to divide the ranges equally between the threads. If there's massive data skew, then one thread can end up doing the majority of the work, leading to a long run-time. The case in Jack's post involved a 250 million row index where 150 million rows had the same (NULL) key value. This range has to be processed by a single thread – a single value can't be divided between two+ threads.
Now, this is understandable behavior by the database engine, but it relies on the statistics being up-to-date. That's a bit of a catch-22 – rebuilding an index updates the statistics, but if the statistics aren't up-to-date then the index rebuild might parallelize badly! I guess the solution is that if you know that you have massive data skew in your large indexes, update statistics BEFORE doing an index rebuild. And given what I've been hearing this week at SQL Connections about how badly statistics keep biting people, I'm leaning towards a different recommendation for those people who have lots of perf trouble caused by statistics and the potential for skewed data – rebuild all your statistics regularly, and only rebuild/reorganize fragmented indexes. Statistics just cause so many problems it seems.
Thanks
PS Kimberly has a lot more info about statistics over on her blog – I'm just starting to venture into that mine-field