Ever-increasing clustering key – the Clustered Index Debate……….again!

As I'm preparing for my Tech*Ed session on Indexing Best Practices in SQL Server 2005, I'm reminded that there are a ton of best practices that really apply to both SQL Server 2000 as well as SQL Server 2005. When it comes to indexing, there are many dependencies on the storage structures. These dependencies are the basis for why I recommend a very specific type of clustering key – for all versions of SQL Server, 7.0 and higher!

I'm going to start with my recommendation for the Clustering Key – for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later…..

Let's start with the key things that I look for in a clustering key:

  • Unique
  • Narrow
  • Static

Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book – if you need to find the data that an index entry points to – that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index – it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Why Narrow?
A clustering key should be narrow for some of the same reasons it should be unique. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. If the clustering key is really wide, then all of the non-clustered indexes will be [unnecessarily] wide. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of size) when rebuilding these index structures. So, what does narrow mean – as few bytes as possible to help uniquely define your rows. A narrow numeric when possible.

Why Static?
A clustering key should be static for some of the same reasons it should be unique and narrow. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. In fact, for a given table the clustering key will be the most duplicated data. If this data changes then they'll need to update the value in the base table as well as in EVERY non-clustered index. And, if the key changes, it will cause the record to move. When a record moves, it creates fragmentation. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of record relocation and [the likely] subsequent splits) and require more maintenance. 

OK, so it sounds like I want a narrow, unique and static value… What about a guid?
Typically, I recommend a numeric IDENTITY column as the clustering key but I always get this question. In fact, I often wait to see how long it's going to take before I get this question ;). Anyway, a guid does meet the criteria fairly well – it's certainly unique, it's usually static and it's relatively narrow. So, what's wrong with it? In SQL Server 2000, the guid function (newid()) is built using a value that does not create an ever increasing pattern (an IDENTITY column would). But wait, I didn't say that you needed to have an ever-increasing pattern…..

OK, so the final criteria I look for in a clustering key is: an ever-increasing pattern!
If the clustering key is ever-increasing then new rows have a specific location where they can be placed. If that location is at the end of the table then the new row needs space allocated to it but it doesn't have to make space in the middle of the table. If a row is inserted to a location that doesn't have any room then room needs to be made (e.g. you insert based on last name then as rows come in space will need to be made where that name should be placed). If room needs to be made, it's made by SQL Server doing something called a split. Splits in SQL Server are 50/50 splits – simply put – 50% of the data stays and 50% of the data is moved. This keeps the index logically intact (the lowest level of an index – called the leaf level – is a douly-linked list) but not physically intact. When an index has a lot of splits then the index is said to be fragmented. Good examples of an index that is ever-increasing are IDENTITY columns (and they're also naturally unique, natural static and naturally narrow) or something that follows as many of these things as possible – like a datetime column (or since that's NOT very likely to be unique by itself datetime, identity). But wait, what about that a guid.

Well, in SQL Server 2000 the only SQL Server function for guids is newid – that does not create an ever increasing pattern. In SQL Server 2005, you can use a new guid function called newsequentialid() to populate your uniqueidentifier column. Here's an example of how you can use it:

CREATE TABLE Test
(
TestID uniqueidentifier CONSTRAINT Test_TestID_Default DEFAULT newsequentialid(),
Inserted datetime CONSTRAINT Test_Inserted_Default DEFAULT getdate()
)
go

INSERT Test DEFAULT VALUES
go

SELECT * FROM Test
go

Is there a way to create a sequential guid in SQL Server 2000?
YES, use Gert Drapers wrote an xp to generate sequential guids! Check it out here. He just published this recently (May 2005) and it's a good change for you in SQL Server 2000 databases/applications.
Full title with link: XPGUID.DLL – Sequential GUID generation and GUID helper functions XP

OK, so I've tackled a few things here today and I've answered a few questions related to indexes before. Check out my Indexes Category of blog entries here and if you're at Tech*Ed today, I hope to see you at 5PM.

Thanks for reading,
kt

7 thoughts on “Ever-increasing clustering key – the Clustered Index Debate……….again!

  1. Hi,
    Checking out your testscript I stumbled into a litle confusion.
    regarding the "ever increasing" sequence of newsequentialId() :~

    Declare @Test table (
    TestID uniqueidentifier DEFAULT newsequentialid(),
    Inserted datetime DEFAULT getdate()
    );

    set nocount on;
    declare @x integer;
    set @x = 0;
    while @x < 1000
    begin
    INSERT into @Test DEFAULT VALUES;
    set @x = @x + 1;
    end;
    set nocount off;

    with cteMyGuidTest (MySeqNo, TestID, Inserted, TestID_String3)
    as (
    SELECT row_number() OVER(ORDER BY substring(convert(char(36) , TestID),1,3), TestID ASC)
    , TestID, Inserted
    , substring(convert(char(36) , TestID),1,3) as x
    FROM @Test
    )
    select *
    from cteMyGuidTest T1
    inner join cteMyGuidTest T2
    on T1.TestID_String3 = T2.TestID_String3
    and T1.MySeqNo = T2.MySeqNo – 1
    order by T1.MySeqNo, T1.TestID, T2.TestID;
    go
    This results in :
    MySeqNo TestID MySeqNo TestID
    ——————– ———————————— ——————– ————————————
    2 0060B1B4-129A-DA11-B4A8-0014227408B8 3 0061B1B4-129A-DA11-B4A8-0014227408B8
    3 0061B1B4-129A-DA11-B4A8-0014227408B8 4 0062B1B4-129A-DA11-B4A8-0014227408B8
    6 0160B1B4-129A-DA11-B4A8-0014227408B8 7 0161B1B4-129A-DA11-B4A8-0014227408B8
    7 0161B1B4-129A-DA11-B4A8-0014227408B8 8 0162B1B4-129A-DA11-B4A8-0014227408B8
    10 0260B1B4-129A-DA11-B4A8-0014227408B8 11 0261B1B4-129A-DA11-B4A8-0014227408B8

    Notice the increment is in the beginning of the guid-string ??
    Let’s hope the guids are handled insideout or we’ll get a bunch of splits wich we may not expect ?

  2. newsequentialid does NOT create a perfectly sequential id. But, it does siginifcatly reduce the overall fragmentation in a table – usually by 80-90 percent. You should see an index that has newsequentialid as it’s leading column – result in 5-10% fragmentation. I thought that one of us had done a post of this but alas, I can’t find one :). So, I might do one to reiterate what you’re seeing and why.

    Long story short, you’re correct. newsequentialid does not eliminate fragmentation but it significantly reduces it.

    Cheers,
    kt

  3. Hi Kimberly, That´s a great post! Thanks…

    Don´t you ever consider checking the index usage in a table when choosing the which index must be the clustered one?

    For example, if i have an index which was checked in sys.dm_db_index_usage_stats with almost 300,000 seeks, but a little wider than other index with 16,000 seeks. Couldn´t i choose the one with more seeks?

    Regards

    1. You could, but it depends on the access and insert patterns on the data. Remember – narrow, static, unique, ever increasing – and usually a surrogate key instead of a natural key.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

SQLskills SQL101: Partitioning

Continuing on our path to understanding the basics and core concepts, there’s a big topic that’s often greatly misunderstood and that’s partitioning. I know I’m

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.