Well, I've promised to blog more and I'm really going to try to do so. This morning I got the perfect question/comment (in email) to respond to and after working through a response that was taking me upwards of 3 hours (you'll learn later why I have 3 "spare" hours :)……… I figured that it was time to turn the response into a blog post. ;)
Background: The Clustered Index Debate
In the years since the storage engine was re-architected (SQL Server 7.0+) there's been constant debate on how to appropriately choose the clustered index for your tables. I've generally recommended an ever-increasing key to use as a clustered index and many find that counterintuitive. The primary reason people feel it's counterintuitive is that it creates a hotspot of activity. [If "hotspot" is not a familar term – a hotspot is solely an active place within your table.] Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there. In fact (and probably even more counterintuitive), the opposite is true. Hotspots (specifically hot PAGES not hot ROWS) can be very beneficial because they; minimize the number of pages needed in cache, improve the likelihood of the required page already being in cache and in general, they minimize the overall amount of cache required. So, this is why many of us have changed our recommendation on where to create the clustering key in 7.0+. Instead of focusing on range queries we now focus on placing the clustering key on an ever-increasing key. In earlier releases, focusing on range queries for the clustered index reduced hotspots for insert/update and this in fact was the PRIMARY motivation to choose them, NOT range query performance! But – there are even MORE reasons to choose an ever-increasing key and they are based on internals as well. These internals are based on the significant changes made in the storage engine for 7.0+. For a quick start on these, I went through them in the Blog entry here.
And, today's email is not uncommon. This is the basis for the title clustered index debate. In general, there are still a lot of questions related to creating clustered indexes to improve "range query" performance. Don't get me wrong, there's definitely a benefit in performance for some range queries but the first thing to remember is that you get only one CL index per table (therefore only one type of range query can benefit). In the real world, t's not likely that you want to see your data exactly in the same way all the time. Therefore it's very challenging to come up with the "right clustered" index if you're using range queries as your strategy. Even worse, the affect of choosing the clustering key to improve range queries causes problems for modifications against that table (INSERTs/DELETEs and UPDATEs). So………….. this is what started my day today. A great email from a reader that brought up these points. The question/comment (modified to hit only the highlights and to protect their identity :) was this:
The most important characteristic for a Clustered Index key is to satisfy range queries. More often than not, if a sufficient range of data will be scanned, the Optimizer will choose the Clustered Index over all others due to the excessive cost of Bookmark Lookup operations. As such, the table KEY is a more suitable clustered index candidate than any surrogate (few every query a database by range of surrogate keys). [kt note: this second sentence is not entirely true… SQL Server will certainly choose a clustered index over non-clustered that require table scans but there are A LOT of algorithms that SQL Server can use instead of either of these and my examples later show this… non-clustered covering seekable indexes, non-clustered scanable indexes, index-intersection, etc. ]
Now, when the default behavior for SQL Server was designed such that the PRIMARY KEY was chosen as the default clustered index, it was exactly for this reason. It is the business key. It would satisfy uniqueness (by definition of logical KEY). And, it is well suited for a wide variety of range scans. However, this is when the PRIMARY KEY is defined on the Business Key of the data.But, when you introduce the usage of surrogate keys (i.e., IDENTITY) as a physical implementation, and thus transfer the PRIMARY KEY definition to it, two things must be considered. First, the Business Key this IDENTITY will be a proxy for must still exist as it is still apart of the logical design. As part of the physical design, the logical key needs to be implemented as a physical constraint to maintain logical uniqueness. Second, just because a proxy has been defined does not make it a natural candidate for the clustered index. The business key still maintains this distinction.What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits. However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list. Page Splits are managed by proper FILLFACTOR not increasing INSERTS. Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.
In short, the argument runs shallow.
Luckily, this email arrived with perfect timing for me as I'm sitting in a "bootcamp" event on Always On technologies and I'm not speaking this morning (my colleague Bob Beauchemin is doing lectures on Scale Out technologies: Scalable Shared Databases, Service Broker, DPVs, etc.). Anyway, in addition to listening to Bob, I've decided to continue the blog series on "the clustered index debate". The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).
The Clustered Index Debate Continued
First, there are many angles to look at wrt to "the clustered index debate" and it's not until all of the issues are reviewed, that this strategy (a monotonically increasing key) becomes obvious. So, I think it will probably take a couple of blog posts to really prove this. I'll start up this debate again here…… When you look at a general purpose table (which is most) where the table has ALL DML (S/I/D/U) then you are best off with an ever-increasing key (again, you have to look at the overall impact of all operations against the table – not just select… because I/D/U will also impact select in the long term). So, I'll break this down into each DML operation here. If you don't look at the overall impact, then large tables can end up having a tremendous number of problems once they're put into production. I've certainly heard this concern/debate before (and most people are skeptical at first glance) but when you look at the situation overall, you'll find that "finding the right balance" includes not just looking at range queries. In fact, here's a quick list of the things/tests/numbers/scenarios that help to prove my strategy:
- Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing. I have some simple numbers but I'm thinking about creating a much larger/complex scenario and publishing those. Simple/quick tests on a laptop are not always as "exciting". But – this is a well documented issue (IAM/PFS lookups) and poor performance on a heap is also referenced in this KB: PRB: Poor Performance on a Heap. note: this KB is quite dated and I don't actually agree with everything in this article however, the general concern of poor performance for inserts is still true on SQL Server 2005.
- Updates are often faster (when the row needs to be relocated) and for the same reason (IAM/PFS lookups) BUT there are many types of updates and not all updates cause records to be relocated. Here are a few things to think about wrt to updates:
- Updates that are completely in-place (some examples are where the update is updating a fixed-width column OR to variable-width columns where the row size doesn't change, etc.). These types of updates don't really care.
- Updates that cause record relocation (where the row size changes) are definitely better by having a clustering key because the record relocation (which will be handled by a split) is defined by the clustering key
- Updates to the clustering key are the WORST (in this case) which is one of the key reasons for having a cl key that is static (so we have to keep this in mind when we choose a clustering key).
- Deletes aren't nearly as big of a concern BUT deletes in heaps create more gaps and more gaps creates more work in PFS/IAM lookups and while this helps to reduce wasted space, it still requires the time to find the space…….. hence the slowed performance of Inserts/Updates. I've also written some blog entries that cover very interesting test cases for large scale deletes and why you'd want to consider partitioning to optimize for the "sliding window scenario" in this blog entry: MSDN Webcast Q&A: Index Defrag Best Practices – Fragmentation, Deletes and the “Sliding Window” Scenario and it's the LAST one!.
- Selects………….. now this is the hardest one to go through in just a couple of bullets (ah, I guess this will lead to another one or two posts :) BUT I'll start by saying that the best way to tune the vast majority of range queries is through non-clustered [covering] indexes. But, it's also important for me to stress that I do NOT advocate covering every query (it's impossible to do). What's important to realize in terms of covering is that SQL Server 7.0 and up continues to include internal algorithms to improve performance when you don't have the "perfect" non-clustered covering seekable index and instead still gives better performance than going to the base table (or performing bookmark lookups – as mentioned in the mail…and I completely agree that these [bookmark lookups] can be evil!). To start this discussion, I'll give one of my favorite examples of a large-scale aggregate. The absolute best way to improve the performance is through an indexed view but the data can be gathered through many other algorithms – ideally through a non-clustered covering index that is in order by the group by and that includes the column(s) being aggregated. For example, take this query:
SELECT c.member_no AS MemberNo,
sum(c.charge_amt) AS TotalSales
FROM dbo.charge AS c
GROUP BY c.member_no
On a charge table of 1.6 million rows here are the performance numbers to handle this aggregation:
- Clustered table scan (CL PK on Charge_no) with a hash aggregate = 2.813 seconds
- Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 1.436 seconds
- Index scan (non-clustered covering in order of the group by) with a hash aggregate = .966 seconds
- Indexed view = .406 seconds
Now this was a pretty small table (narrow rows and only 1.6 million rows) AND I didn't have any concurrent activity. The concurrent activity would have caused this to be even slower for hash aggregates, etc. Regardless, it proves the point (at least generally). Now, if I wanted to improve this range query then I'd have to cluster on the member_no column (and this is an ideal example because I often hear people say that clustering on a foreign key column helps to improve range/join queries – which can be true as well)……… But – this strategy has a few problems in addition to a few benefits (and we have to look at everything to be sure of our choice/decision). First, member_no is not unique (in the charge table) so SQL Server has to "uniquify" the rows. The process of "uniquification" impacts both time (on insert) and space (the rows will be wider to store each duplicate row's uniqufier). Also, theoretically it could change (in this case that's not true). Anyway, the time it takes for the clustered index is 2.406 seconds which is better than the clustered on the PK (of course) but if I were to also start modifying the rows (which creates splits) or even just insert 15% more rows…….. then my table would become fragmented. At that point, the query performance should get worse in the table clustered by member_no table and it will continue to get even worse in the table clustered by charge_no (because of the worktable created in tempdb by the hash aggregate) BUT it won't be all that much worse in the non-clustered index examples (especially the covering index that's in the order of the group by – because this doesn't require a worktable)………
- CL on member_no = 4.906 seconds
- CL on charge_no = 6.173 seconds
- Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 3.906 seconds
- Index scan (non-clustered covering in order of the group by) with a hash aggregate = 1.250 seconds
- Indexed view = .516 seconds
This is a great start to furthering the clustered index debate but I do have to admit that it's a counterintuitive and difficult issue to tackle because often isolated tests lead you to different conclusions. In this case though, the non-clustered indexes are better for this range query and the indexed view is the best (but I wouldn't consider the Indexed unless this were more of a read focused database rather than read/write). [and – of course, that statement warrants yet another blog post :)]
So, depending on the tests that you do – especially if you focus only on selects and you don't have modifications (i.e. fragmentation) – then they will make "creating the clustered index for range queries" appear to be best. Again, I'm not just saying this to prevent fragmentation, I'm saying this because I wouldn't use the clustered index OR a non-clustered index with bookmark lookups to handle this query. I'd consider a non-clustered covering that's seekable OR even a non-clustered covering that's scanable before I'd even choose the clustered (and that's what the optimizer would prefer as well). In the end it's really a bit of an art and a science to "finding the right balance" of indexing.
Oh – and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help but many would prefer to use actual data… which means [potentially] creating your primary key with a new identity [or similar] column and this can impact your business logic (absolutely). I'm certain that certain tests can show that range queries are faster and it's absolutely correct that business application/usage can be a concern but when you look at the big picture (and the impact on I/D/U) then the benefits of the monotonically increasing key significantly outweigh these concerns. Simply put, a small/narrow key can help join performance and an ever increasing key can also help lookups for rows! (yes, definitely more coming)
Happy Friday! Have a great weekend. I'll try to continue more threads on this debate shortly!