Explicitly naming CL key columns in NC indexes – when and why

This morning, as I was prepping for my PASS Data Architecture Virtual Chapter presentation on “Why Physical Database Design Matters,” I went to tweet that folks could join the presentation. I stumbled on a discussion that led to Robert Davis (@SQLSoldier) asking this: I was searching to see if you had a blog post on missing indexes recommending CI keys as included columns in NCIs.

And, there was no way I’d be able to do that in 140 characters… so, I thought a blog post would be easier!

First off, you do NOT need to explicitly state clustering key columns in your nonclustered index definition; those that aren’t already part of the index – ARE ADDED automatically by SQL Server. So, it seems weird that the missing index DMVs always add the clustering key to the recommended nonclustered indexes – even if it’s not required?

SIDE NOTE: The “green hint” / missing index recommendation shown in showplan does this as well (since it uses the missing index DMVs behind the scenes).

But, let’s go back to why it’s in there to begin with..

Why does SQL Server NEED a lookup value?

SQL Server uses the clustering key (if the table has a rowstore clustered index defined) as the way to “lookup” a corresponding row when accessing the data from a nonclustered index. The main reason to do this is that the nonclustered index does not have all of the information the query needs so SQL Server has to look up the rest of the data by accessing the data row.

SIDE NOTE: If the table is a HEAP (a table without a clustered index) then SQL Server uses a “fixed RID” which consists of an 8 byte row-ID (2 bytes for the fileID, 4 bytes for the pageID, and 2 bytes for the slot number). There are pros/cons to heaps vs. clustered tables but that’s not the main point of this post; I’m solely going to focus on what happens when a table has a clustered index.

If you think of an index in the back of a book on animals, you have duplicate data (like a “common name”) that’s ordered in the index. We use that sorted order to find the animal in which we’re interested and then the index tells us on what page we can find the rest of the information about that animal.

Nonclustered indexes in SQL Server work similarly. However, there isn’t a person “scanning” the sorted data (as we would a regular index). Instead, SQL Server has TWO structures to an index. One structure (the leaf level) is VERY much like an index in the back of a book (in this case, the common name + a lookup value to get to the rest of the data). The other part of an index structure is called the b-tree; this structure (0 to n levels) is solely used for efficient navigation to the leaf level. (If you’re wondering when an index tree would have 0 levels – only when the data of the entire table is only 1 page. SQL Server will not add a navigational level to a table that’s only one page. Yes, rare. But, yes, I expected someone would have asked! And, again, I’m somewhat simplifying things but I want to get to the main point of the question… :))

So, now we know WHY SQL Server needs the clustering key columns added to the index. And, we know that SQL Server adds them when they’re not explicitly defined… but, is it a problem and why do they do it?

Is it a problem to EXPLICITLY define clustered key columns in a nonclustered index?

No. But, I personally don’t recommend it. If the query doesn’t need it nor is the query going to use it (and, isn’t that the point of a missing index recommendation), then there’s no reason to add it. I never recommend adding anything that isn’t explicitly needed. I can only speculate on why they do it AND it does have one benefit.

First, no, it’s completely unnecessary. But, it does NOT cause any problems (SQL Server won’t add clustering key columns to the index more than once; they’re either explicitly added or internally added – but, they’re never in the index twice).

And, it does provide ONE benefit in that columns that are explicitly defined show up in sp_helpindex and in the management studio tools. So, this helps people to better see what’s actually in their indexes (but, this is kind of a weird way around the inadequacies of both the tools as well as sp_helpindex). So, while there is a benefit; it’s only making up for shortcomings in the tools.

SIDE NOTE: Whether or not you believe me is also part of the problem because none of the standard utilities / sps, etc. expose that SQL Server is doing this. But, you can check table size with sp_spaceused and you should see the size is the same whether the clustering key column is explicitly added or not. And, there are [3rd party] ways to expose that these columns ARE part of the index (I’ve written about my sp_helpindex rewrites before and I know there are a bunch of other tools out there that show it as well! I’m a bit overdue for yet-another sp_helpindex rewrite / update but I do have one planned. So, I hope to get to that within the next few days. Stay tuned. But, the 2012 version still works in 2014/2016 for clustered/nonclustered indexes. I have a new version that works with ALL index types but it still needs a bit more testing – email me if you have columnstore, in-memory indexes, full-text, spatial, or hypothetical indexes and want to do some tests for me!)

When should you explicitly define clustered key columns in a nonclustered index?

This is the real reason I wrote this post was to get to this point. And, it’s something I hadn’t thought about until I ran into it personally…

So, when SHOULD you explictly define the clustering key columns in a nonclustered index? When they ARE needed by the query.

This sounds rather simple but if the column is USED by the query then the index MUST have the column explicitly defined. Yes, I realize that SQL Server will add it… so it’s not necessary NOW but what if things change? (this is the main point!)

What could go wrong?

If you don’t explicitly add clustering key columns in a nonclustered index and the queries actually rely on that clustering key column to be there (for better performance)… then your queries will not be able to use the nonclustered index if the clustered index changes.

What if you want to change your clustering key? What if you later want to create a clustered columnstore index. In either case, all of your nonclustered indexes will no longer include those [former] clustering key columns and the queries that relied on their being present, won’t perform.

So, the simple answer is – always define the index explicitly with the columns needed (in the key, in INCLUDE, wherever they are REQUIRED). If a column is not needed by that query then do not explicitly name it (it might happen behind the scenes but it won’t add extra data where it’s not needed if the clustering key were to change at a later time).

Using this strategy when you’re doing query tuning will make your indexes more readable, more effective / scalable and should work even if there are later changes to your schema!

SIDE NOTE: Indexing is a very difficult topic to discuss in only a blog post and “query tuning” is not the same as “server tuning.” Sometimes the index that’s best for a query is not ideal to add to your server… but, once you get to the point where you want to create an index on your server – explicitly state only the columns absolutely necessary. Don’t just add things “because SQL Server would have done it anyway.”

Hope this helps! And, thanks for reading!!
Kimberly

 

11 thoughts on “Explicitly naming CL key columns in NC indexes – when and why

  1. I have several tables that have clustered composite primary keys (e.g., col1, col2, col3). Usually Int, SmallInt, and Datetime. Many of the queries are for col1 = intvalue, col3 between dateval1 and dateval2.

    Many other queries are strictly for col3 between dateval1 and dateval2 or similar. So I have a NCI on col3.

    It seems to me that the NCI for query 2 would have to repeat the datetime column. First as the actual index key then again in the clustered key to locate lookup records.

    Most of these tables are very large (>100M rows, 10GB) so I don’t want a lot of indexes or wasted space.

    BTW, col2 is an irritant. It is required for uniqueness but seldom used in where. Mostly used in Join conditions.

    Thanks for the interesting article.

    1. I’m not entirely sure of what your question is here but I have a few suggestions for you to think about.

      First, in reply to “It seems to me that the NCI for query 2 would have to repeat the datetime column”

      Repeat it in the index ONCE – yes. But, not twice.

      If the CL index is c1, c2, c3 and you then create a non-unique nonclustered index on c2 then the structure for c2 would be as follows:
      LEAF Level: c2, c1, c3
      BTREE: c2, c1, c3

      Having said that, you have two range-based queries AND one of them is a date. Depending on the fragmentation created by leading with c1 AND the frequency/importance of the queries that are c1=intvalue AND c3 between dateval1 and dateval2, I’d suggest a few other options to evaluate. But, there are MANY factors to decide which is actually best…

      One option – keep what you have (let me see if I can recreate it here):
      Unique CL index on c1, c2, c3
      NC1: Non-unique nonclustered on c3
      Leaf level: c3, c1, c2
      NC2: Non-uniqye nonclustered on c2
      Leaf level: c2, c1, c3

      There are really a couple of CL index options here and MULTIPLE things on which I would determine which is best.
      Option 1 – your current option
      Option 2 (to evaluate)
      Unique CL index on c3, c1, c2
      NC1: Non-unique nonclustered on c1 (or, would this be unique?)
      Leaf level: c3, c1, c2
      NC2: Non-uniqye nonclustered on c2
      Leaf level: c2, c1, c3

      The reason I *might* consider the 2nd option would be:
      * the table might be less fragmented (which will help range query performance + disk space + cache + logging)
      * you might be able to get decent query performance out of both the range queries BUT it would depend more on whether or not the table had a bunch of additional columns. If the table DOES have a lot of additional columns then option 2 might be better IF the queries that supply c1 = intvalue are selective enough to actually use the new NC index in this 2nd option.

      However, if your table does NOT have a lot of additional columns (or, your queries ONLY care about c1, c2, and c3 – meaning that these indexes COVER ALL of your queries), then both option 1 and option 2 will be about the same for query performance but option 2 might be better for fragmentation.

      Really, there are just too many variables here for me to give an answer that’s 100% certain but from this info, your scenario does have other options IF you’re trying to tune it.

      Having said all of that, if you’re only wondering if c2 is duplicated in the NC index TWICE (re: first as the actual index key then again in the clustered key to locate lookup records) the answer here is DEFINITELY NOT!

      Hope that helps!
      Kimberly

    1. Actually, I did mention that a non-unique, nonclustered index would include the lookup value all the way up the tree. And, you can read more about this here: “Nonclustered indexes require the “lookup” key in the b-tree when?” where I explain this in greater depth as well as explain why.

      I think what might help is to understand that I have separated the leaf-level and the b-tree in my discussions because this is what changes/differs based on whether or not the nonclustered index has been defined as unique. But, for the part about the internals, yes, Kalen and I are saying the same thing. My post was actually about WHEN you MUST add these columns. The internals were kind of secondary but I’m just explaining them differently (with the same result ;-) ).

      Just to reiterate…
      If you have a unique CL index on c2, c3

      Then, you create a UNIQUE, nonclustered index (NC1) on c4 the structure of the nonclustered index would be:
      – in the leaf level: c4, c2, c3
      – in the b-tree: c4 ONLY (c2, c3 are NOT needed because c4 is already unique)

      If you were to create a NON-unique, nonclustered index (NC2) on c4 then the structure of this nonclustered would be:
      – in the leaf level: c4, c2, c3
      – in the b-tree: c4, c2, c3

      To be honest, I’m not sure if there is a question here? Let me know if this doesn’t make sense!

      Cheers,
      k

  2. Hi Kalen,

    I have to disagree on this one. For readability, I think it’s best to have them included, as otherwise, it’s not obvious to someone looking at the index definition why it covers the query. I try to avoid anything obscure like that.

    And more concerning, if the clustering key for the table is changed, your covering indexes then stop working because you didn’t include the clustering key column and just assumed it’s there. And the person changing it won’t realise it’s an issue because it’s not written into the index and it’s just assumed to be there.

    Just my 2c

    Hope you’re well

    Regards,

    Greg

    1. Hey there Greg – Actually, your comment doesn’t entirely make sense to me. If the column in NEEDED to cover the query then it should be EXPLICITLY part of the index definition. If it’s just in the index because it’s part of the clustering key then it should NOT be explicitly added to the index.

      What you might be saying though is that OTHER queries could *accidentally* benefit from having that column in the index (well, that is true) but that’s not something I’m doing when I’m *query* tuning. When I do *query* tuning, I look for the absolute minimum that makes that QUERY perform (which might be through covering). Having said that though, that still might NOT be what I end up creating on the server. If I end up doing index consolidation / SERVER tuning, then I might come up with a slightly different index – one where other columns are needed/added to the index so that it covers additional queries. If those columns become significant then I WOULD explicitly put them in the index.

      To be honest, we’re saying the same thing, if the column is significant for covering then it SHOULD be in the index.

      Cheers,
      k

  3. Oh no, just realised I wrote “Kalen” not “Kimberley” :-) I think I followed a link from something Kalen wrote, to this blog post.

    Hope you’re well Kimberley (and that Paul heals quickly)

    Regards,

    Greg

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

SQLskills SQL101: Partitioning

Continuing on our path to understanding the basics and core concepts, there’s a big topic that’s often greatly misunderstood and that’s partitioning. I know I’m

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.