Warning: array_merge(): Expected parameter 1 to be an array, null given in D:\Wordpress\blogs\kimberly\wp-includes\blocks\shortcode.php on line 31 Warning: array_merge(): Expected parameter 1 to be an array, null given in D:\Wordpress\blogs\kimberly\wp-includes\blocks\social-link.php on line 45 GUIDs as PRIMARY KEYs and/or the clustering key - Kimberly L. Tripp

GUIDs as PRIMARY KEYs and/or the clustering key

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!)

Expanding on the topic of “are you kidding me”… one of the MOST PREVALENT problems I see today is the dreaded “GUIDs as PKs” problem. However, just to be clear, it’s not [as much of a] problem that your PRIMARY KEY is a GUID as much as it is a problem that the PRIMARY KEY is probably your clustering key. They really are two things BUT the default behavior in SQL Server is that a PRIMARY KEY uses a UNIQUE CLUSTERED INDEX to enforce entity integrity. So, I thought I’d take this post to really dive into why this is a problem and how you can hope to minimize it.

Relational Concepts – What is a PRIMARY KEY? (quick and basic reminder for what is what and why)

Starting at the very beginning… a primary key is used to enforce entity integrity. Entity integrity is the very basic concept that every row is uniquely identifiable. This is especially important in a normalized database because you usually end up with many tables and a need to reference rows across those tables (i.e. relationships). Relational theory says that every table MUST have a primary key. SQL Server does not have this requirement. However, many features – like replication – often have a requirement on a primary key so that they can guarantee which row to modify on a related database/server (like the subscriber in a replication environment). So, most people think to create one. However, not always…

What happens when a column(s) is defined as a PRIMARY KEY – in SQL Server?

The first thing that SQL Server checks is that ALL of the columns that make up the PRIMARY KEY constraint do not all NULLs. This is a requirement of a PRIMARY KEY but not a requirement of a UNIQUE KEY. They also check to make sure (if the table has data) that the existing data meets the uniqueness requirement. If there are any duplicate rows, the addition of the constraint will fail. And, to check this as well as to enforce this for [future] new rows – SQL Server builds a UNIQUE index. More specifically, if you don’t specify index type when adding the constraint, SQL Server makes the index a UNIQUE CLUSTERED index. So, why is that interesting…

What is a clustered index?

In SQL Server 7.0 and higher the internal dependencies on the clustering key CHANGED. (Yes, it’s important to know that things CHANGED in 7.0… why? Because there are still some folks out there that don’t realize how RADICAL of a change occurred in the internals (wrt to the clustering key) in SQL Server 7.0). It’s always (in all releases of SQL Server) been true that the clustered index defines the order of the data in the table itself (yes, the data of the table becomes the leaf level of the clustered index) and, it’s always been a [potential] source of fragmentation. That’s really not new. Although it does seem like it’s more of a hot topic in recent releases but that may solely because there are more and more databases out there in general AND they’ve gotten bigger and bigger… and you feel the effects of fragmentation more when databases get really large.

What changed is that the clustering key gets used as the “lookup” value from the nonclustered indexes. Prior to SQL Server 7.0, SQL Server used a volatile RID structure. This was problematic because as records moved, ALL of the nonclustered indexes would need to get updated. Imagine a page that “splits” where half of the records are relocated to a new page. If that page has 20 rows then 10 rows have new RIDs – that means that 10 rows in EACH (and ALL) of your nonclustered indexes would need to get updated. The more nonclustered indexes you had, the worse it got (this is also where the idea that nonclustered indexes are TERRIBLY expensive comes from). In 7.0, the negative affects of record relocation were addressed in BOTH clustered tables and heaps. In heaps they chose to use forwarding pointers. The idea is that the row’s FIXED RID is defined at insert and even if the data for the row has to relocate because the row no longer fits on the original page – the rows RID does not change. Instead, SQL Server just uses a forwarding pointer to make one extra hop (never more) to get to the data. In a clustered table, SQL Server uses the clustering key to lookup the data. As a result, this puts some strain on the clustering key that was never there before. It should be narrow (otherwise it can make the nonclustered indexes UNNECESSARILY wide). The clustering key should be UNIQUE (otherwise the nonclustered indexes wouldn’t know “which” row to lookup – and, if the clustering key is not defined as unique then SQL Server will internally add a 4-byte uniquifier to each duplicate key value… this wastes time and space – both in the base table AND the nonclustered indexes). And, the clustering key should be STATIC (otherwise it will be costly to update because the clustering key is duplicated in ALL nonclustered indexes).

In summary, the clustering key really has all of these purposes:

  1. It defines the lookup value used by the nonclustered indexes (should be unique, narrow and static)
  2. It defines the table’s order (physically at creation and logically maintained through a linked list after that) – so we need to be careful of fragmentation
  3. It can be used to answer a query (either as a table scan – or, if the query wants a subset of data (a range query) and the clustering key supports that range, then yes, the clustering key can be used to reduce the cost of the scan (it can seek with a partial scan)

However, the first two are the two that I think about the most when I choose a clustering key. The third is just one that I *might* be able to leverage if my clustering key also happens to be good for that. So, some examples of GOOD clustering keys are:

  • An identity column
  • A composite key of date and identity – in that order (date, identity)
  • A pseudo sequential GUID (using the NEWSEQUENTIALID() function in SQL Server OR a “homegrown” function that builds sequential GUIDs – like Gert’s “built originally to use in SQL 2000” xp_GUID here: http://sqldev.net/xp/xpguid.htm

But, a GUID that is not sequential – like one that has it’s values generated in the client (using .NET) OR generated by the newid() function (in SQL Server) can be a horribly bad choiceprimarily because of the fragmentation that it creates in the base table but also because of its size. It’s unnecessarily wide (it’s 4 times wider than an int-based identity – which can give you 2 billion (really, 4 billion) unique rows). And, if you need more than 2 billion you can always go with a bigint (8-byte int) and get 263-1 rows. And, if you don’t really think that 12 bytes wider (or 8 bytes wider) is a big deal – estimate how much this costs on a bigger table and one with a few indexes…

  • Base Table with 1,000,000 rows (3.8MB vs. 15.26MB)
  • 6 nonclustered indexes (22.89MB vs. 91.55MB)

So, we’re looking at 25MB vs 106MB – and, just to be clear, this is JUST for 1 million rows and this is really JUST overhead. If you create an even wider clustering key (something horrible like LastName, FirstName, MiddlieInitial – which let’s say is 64bytes then you’re looking at 427.25MB *just* in overhead….. And, then think about how bad that gets with 10 million rows and 6 nonclustered indexes – yes, you’d be wasting over 4GB with a key like that.

And, fragmentation costs you even more in wasted space and time because of splitting. Paul’s covered A LOT about fragmentation on his blog so I’ll skip that discussion for now BUT if your clustering key is prone to fragmentation then you NEED a solid maintenance plan – and this has it’s own costs (and potential for downtime).

So…………… choosing a GOOD clustering key EARLY is very important!

Otherwise, the problems can start piling up!

kt

42 thoughts on “GUIDs as PRIMARY KEYs and/or the clustering key

  1. I have a real world case for this. We had a new product being worked on for over a year with no DBA involvement. At random times throughout the year, 5 different DBAs happened to see the schema and remarked that the GUIDs as PKs would cause performance issues, focusing more on write performance and fragmentation than read performance. We were finally able to convince the dev team to do load testing for read data comparing GUIDs and BIGINTs as PKs. For the tests we used a simple [orders] and [order_items] schema, with [orders] containing 1 million rows and order_lines 10 million (an even 1:10 ratio). The first test was for single-row select statements; queries using GUIDs did not take up any additional measurable resources over the BIGINTs. The next test was JOINING parent and child tables for one key value. The GUID PK/FK combination consumed approximately 18% additional resources over the BIGINT PK/FK combination. The third test was to query based on a range of data (we used order_date) combined with an INNER JOIN. In this third test the GUID combination consumed approximately 80% additional resources over the BIGINTs. 80%.

    Work is now underway by the dev team to refactor the entire schema to use bigints, and now DBAs are being involved in the dev cycle.

  2. Great post, as always, Kimberly!

    I, too, encountered this in the ‘real world.’ In a 2 billion row, wide table, to be exact. A GUID was a PK and FK in each of these tables. Part of the redesign was to remove the GUIDs and use an identity column as an arbitrary clustered key and FK instead. I also implemented partitioning at the same time, so it’s hard to give an exact number, but read performance improved somewhere in the vicinity of 88%.

    I now use a GUID clustered key and insert records using NewID() to illustrate fragmentation, because it works *so* well for that. :)

    Side note: for some reason, I didn’t see any blog posts from you in my Google Reader account, then yesterday, 8 showed up. RSS issues?

  3. Bnaya – If you add a int clustering key then it’s best – if you can – to also use that as the FK. You can certainly look up the rows by the PK but make the PK a nonclustered index instead.

    Cheers,
    kt

  4. Whoa.

    I’m reading this post for the third time, I read it for the first time yesterday and I’m still reeling. I’m a programmer and I’m shuddering with the realization of just how big the mistakes are that we as a team made with a now-cancelled proof-of-concept project about two years ago. We used GUIDs as the primary key on almost all of our tables and we were running into performance issues with only a few hundred rows.

    @Kimberly: I just want to be sure about what I understand from this post:
    – A primary key is normally the main unique identifier for the rows in the table. Other tables would ordinarily have foreign keys in them that linked to the table’s PK. (This is from varsity.)
    – Clustered indices dictate how the data is stored on the disk.
    – "Narrow" PKs like ints are good options.
    – If one expects that the PK will fragment very much or if the PK might be wide, like if the (unique) PK was something along the lines of a varchar or GUID, specify that the PK should be non-clustered and add an incrementing int/bigint column and create a non-unique, CLUSTERED index on that column so as to decrease I/O.

    Some code so you can follow what I am trying to say:

    CREATE TABLE [dbo].[test](
    	[primaryIdentifier] [varchar](50) NOT NULL,
    	[id] [bigint] IDENTITY(1,1) NOT NULL,
    	[lotsOfOtherColumnsOfVariousTypes] [varchar](25) NOT NULL,
     CONSTRAINT [PK_test_1] PRIMARY KEY NONCLUSTERED
    (
    	[primaryIdentifier] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 65) ON [PRIMARY]
    ) ON [PRIMARY]
    
    GO
    
    CREATE UNIQUE NONCLUSTERED INDEX [IX_Unique_Index] ON [dbo].[test]
    (
    	[primaryIdentifier] ASC
    )WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 65) ON [PRIMARY]
    GO
    

    Am I understanding the principles [i]behind[/i] the "GUID PKs are usually very bad" message that the masses seem to be trumpeting?

    Will performance be better with the code above than with ditching the id column and just using primaryIdentifier as the clustered PK given, say, four million rows that are inserted occasionally (I realise that this is moving into fill factor territory)?

    Thank you

  5. So then why is it that a very popular microsoft product, sharepoint, uses all it’s PKs that are GUIDs and Clustered?

  6. Kutta, for the simple reason that the product was written by developers who knew precious little about databases.

    Everything about Sharepoint stinks from a database point of view.

  7. If the GUID is the intrinsic identifier for the entity being modelled (i.e. used by selects) then it should be the clustered primary key without question. The reason is that adding a surrogate identity key (of int or bigint) and demoting the GUID primary key to a column with an index/unique constraint requires 2 indexes to be maintained and slows down, in my experience, by a factor of 2.

    1. Hey there Herman – Actually, I disagree. For point-based queries using a nonclustered index does not add a significant amount of *costly* IOs. And, the maintenance of a nonclustered index that’s heavily fragmented is a lot cheaper than the required maintenance on a heavily fragmented clustered index. Additionally, the GUID might make your nonclustered indexes unnecessarily wide – making them take: more log space, more disk space, more cache as well as adding time on insert and access (especially in larger queries/joins). So, while you might not feel like an arbitrary/surrogate key is useful (because you never directly query against it) it can be incredibly efficient to use indirectly through your nonclustered indexes. There’s definitely an element of “it depends” here but if you have even just a few nonclustered indexes then it’s likely to be more beneficial than negative and often significantly so.

      Cheers,
      kt

  8. hi Kimberly,

    I appreciate that “choosing a GOOD clustering key EARLY is very important!” but im a contractor and its a luxury i dont have often enough. so on that topic whats the best way to remove all those Guid’s as clustered Keys I have encountered in a poorly performing production DB thats 300+ tables with on average a dozen FK constraints on the guid clustered keys. the bigger tables are upwards of 10m Rows … oh it also happens to be the Publisher in transactional replication and a 24×7 db. The fill factor is 0 as you might have guessed (just to compound the error).

    Id like to see it fixed and performing properly, so im not considering calling my agent for a new contract just yet.

    Regards.

    1. Start with the 6 or 8 monsters first and slowly clean up the problem. Unfortunately there just isn’t a quick fix. Just make sure you write a script to do it and test it thoroughly (ideally on a copy of the db that’s been restored to a test server) so that you know how long it will take to run.

      Hope that helps!
      kt

      1. cheers,

        going to start with the smaller tables i think, build up a best practice for rolling out the largest tables! thinking of new identity field (if app can support it) using a new filegroup rebuilding the index with move and recluster on seperate physical disk.

        if app wont support new column without major release, what do you think about the newsequentialGuid() function. ive an idea i can just change the default on the table to newSequentialGuid as a short term fix. the old data will still fragment on update but at least new records wont impact performance in the same way. The maint plan should then be able to keep up with the reindexing of the updated data.

        whats your experience of utilising newsequentialGuid() as the default.

        Regards.

        1. Hey there Colin – Sure, if you’re generating the GUIDs server-side then switching from NEWID() to NEWSEQUENTUALID() will help a lot. It’s not perfectly sequential but it’s significantly better than NEWID which is random. But, my experience (in general) is often that the GUIDs are getting generated at the client. So this isn’t always an option.

          And, starting with the smaller tables might be easier for you to get used to the process, etc. (so that’s good). But, it’s the bigger tables that will give you better gains. So, a bit of a trade-off in terms of time.

          Keep me posted!
          kt

          1. Kimberly,

            im having a few issues testing my short term improvements to my clustering on guids problem. Im using Trans/sec counter and im seing a significant drop. wondering if you can help with a few pointers?

            as a short term fix until i can arrange downtime (due to tran replication and constraints) ive changed the fill factors on the affected tables from 0 to 95% full. (the idea being to slow down the level of fragmentation) this has worked as it now takes 3 days to fragment significantly giving me time to manually intervene with re-indexing.

            before changing the fill factor i was getting a trans per sec averaging 350 per sec this has now dropped to the low 100’s. could the drop be explained by a lower number of page splits etc and i actually have a better performing DB? any idea how I can identify good and bad transaction activity?

            There are some positives my log backups are getting larger (Suggesting theres more data in there now for the same time period) and my replication agent rates have increased from 4000 per sec to 6,000 per second. however this raises the question is this improved replication rate hurting my genuine transactions per second. and my front end oltp application ?

            am i missing something important when changing the fill Factor on the Indexes that are clustering on Guids I saw improvements to fragmentation rate at 97% fill then reduced to 95% before noticing the drop in transactions per second. this is stopping me Dropping it further till i can explain what im seeing.

            Colin.

  9. How good it is to have PK on GUID and clustered index on Identity column. Will this improve the performance on 20 million records?

    1. It’s hard to say. 20 million rows could be only 100 MB or it could be 20 GB. It’s more about the row size than anything. But, yes, a [nonclustered] PK on a GUID with a CL index on an identity is often a better choice. There are some cases where this can start to bottleneck and where having a few insertion points would be better than only 1 but having random insertion points (with a GUID) is almost always a problem (requiring all of the data to live in cache when the usage patterns often don’t warrant it).

      Cheers,
      kt

      1. Hi Kimberly,

        Could you elaborate on when this could start to bottleneck? I am looking to refactor a few tables like this (adding an int identity column and making it the clustered index), but want to make sure it is the best solution.

        Thanks,
        Nick

        1. Hey there Nick – Unfortunately there are multiple factors here and it’s really long-term scalability that’s the biggest problem. It’s not a single problem – it’s the combined added costs of wider rows, more cache, and more logging. But, part of that is tied to how many nonclustered indexes you have. EACH nonclustered is wider and the more nonclustered indexes you have the more costly it’s going to get. And, this is also tied to how the data is accessed and how much data you have. If the table is large then the entire table needs to be in cache (because the inserts are so evenly distributed). For example, if you had 10 years of sales data ordered by GUIDs then you’d need to keep all 10 years in cache in order to insert sales for 2014. And, queries would span many different pages because of the lack of order.

          In the end, there are just too many factors to give you a number. The decision to use GUIDs just isn’t scalable.

          Cheers,
          kt

          PS – If you have a Pluralsight subscription – you should check out my course on “SQL Server: Why Physical Database Design Matters” http://www.pluralsight.com/courses/sqlserver-why-physical-db-design-matters. I go into more details there as well.

  10. Hello,

    I ran across this article from SQLCAT, discussing hot latches when sequential keys are used:
    http://www.microsoft.com/en-us/download/confirmation.aspx?id=26665

    Beginning with page 40 is the discussion of hot pages and latch waits. They report a 690% improvement moving away from sequential keys on high-insert tables. The improvements largely due to reduced latch waits, and dropping from and avg latch wait time of 36ms to <1 ms per insert.

    Developers and data architects who choose GUIDs usually point to this as their reason. As a DBA I compensate with more frequent rebuilds/reorgs.

    What's your take on latch waits due to hot latches and using guids to "spread the data"?

    1. Unfortunately ANYONE can create a test that shows problems with BOTH approaches. What’s rarely done is a comprehensive analysis of EVERYTHING that a production database has. A table is NEVER just inserts alone. It has foreign keys, indexes, and is often wider than these tests show too (the narrower the row – the fast the latching becomes a problem).

      Anyway, the best link is Pedro’s: http://blogs.msdn.com/b/blogdoezequiel/archive/2013/05/23/pagelatch-ex-waits-and-heavy-inserts.aspx#.UmR-iNLn8-U and I do talk about things like hash partitioning in class and for high-end tables. This discussion is actually really hard but it’s also not as simple as just use GUIDs either.

      I do hope Pedro’s link helps! Let me know if you have more Qs.

      Cheers,
      kt

      1. Not when it’s a Data Vault warehouse…in that case, it truly IS a write-only bunch of tables and the referential integrity is handled by the simplistic load routines, not in the table setup…I think a case could be made for GUID/HASH as a surrogate key in a data vault. Just a thought.

        1. I’m not sure if I understand the comment (and, sorry for the delay – I found this among a massive amount of blog spam [which is really a nightmare these days]).

          Anyway, I’m not saying that you shouldn’t have GUIDs as a PK or a UK for the table. Generally, I just don’t want them as the clustering key. The PK does NOT have to be clustered. That default behavior is really what you need to look at and decide if it’s beneficial. Usually, a GUID as a CL key is a bad thing. But, it can be a PK or UK – but, usually best as a nonclustered. There are certainly exceptions to the rule but in this case – not often.

          Hope that helps!
          k

        1. I’m not sure if there’s a question here but in general – always / never – can rarely be used in database design. There are pros and cons to everything that you could possible come up with…

          Cheers,
          k

  11. Thanks for a great article. I was having performance problems with inserts into a table – anywhere from 100mS to 60 seconds! – and after making sure my database was ok and that the server CPU usage was low, it had lots of RAM and free disk space, etc., I looked in the SQL log and found multiple entries like this:

    SQL Server has encountered 1226 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file […] in database […].

    What was interesting was that this was in a different database! I had to ask the owners of that database for access and found out that it had a table with millions of rows, being added to constantly, and the primary key was a clustered index on a guid field. After stopping their application, and doing maintenance on their database (it needed to grow and have a log backup), inserts into my database went back to normal – 8 to 21 mS every time. So this issue was bogging down SQL so much it was affecting other databases on the same SQL instance, even though the server hardware wasn’t under a heavy load.

    1. Yeah… this is a constant battle that I’m fighting. In test scenarios (without foreign keys, other indexes, realistic tables, etc.) people get all sorts of things to look good (even inserts into a GUID because YES inserts into an identity CAN have their own problems…). But, MORE OFTEN, GUIDs are a HUGE problem. Fixing them with maintenance (as a short-term solution) and ultimately through better design and better indexing – that’s what gives you long term scalability and performance!

      Keep us posted on how your improvements are working!

      Cheers,
      Kimberly

  12. HELLO Kimberly

    I am Naresh recently I have joined in a firm as a junior sql devoloper,
    could you please suggest me how to start the sql and what are the things we have to consider while design a database,I mean normalization database.give me some tips to design the normalized database,my team leader gave me a task.i.e,I have to design a database for educational organisation.
    please give me the suggestions.
    Thanks a lot in advance

  13. Hi Kimberly, great article! I have a question about foreign keys when using the [guid/pk/ non-clustered] + [int/clustered/unique index] approach. If I’m joining on guid will there be any performance issues or is this the correct way?
    My reason for using guids as pk and fk’s is so the data is portable across databases.

    1. Hey there Alwin – You need these for portability yes and the entry point for your code (the value you use for the lookup) can be the main GUID / PK value. However, you could still have an underlying ID on which the table is clustered and then use that for joins. However, that does mean that those specific rows are specific to that server. So, if you really need rows to move around and be highly portable then you don’t really have a choice, right? There are more efficient data types on which to join but you have other priorities in terms of portability. So, you’ll have to choose your trade off there.

      hth,
      k

  14. I am pretty late to this discussion, but this seems like a good place to start…

    I have a new project…I am using a int identity for PK’s but all the tables also have a UUID column as an alternate key. All the UUID’s are created client side.

    I use the alternate key (UUID) for all joins…not the primary key. I am planning on adding a unique constraint to each Alternate UUID key…will this help?

    The problem is that in the client, records from different databases, with the same table names, are combined in the same array…so an identity int key is really not possible.

    If my PK’s are all identity int, but I am not ever using the pk’s for joins, but using instead an alternate key that is a client side created UUID (not sequential), is this better than just using UUID’s for PK?

    Mark

    1. Hey there Mark – I’m not sure I follow you. Are you saying that users never know the key that you’ve used internally and so they always need to join on the UUID? My point there would be that the UUID can get them into a core / critical table but YOUR code behind the scenes could still do all of the joins with an internal (and more efficient) key like an ID. And, yes, don’t use the UUID for the clustered key.

      Remember, the PK does NOT have to be the clustered key. And, it’s the clustered key that’s the concern here. Your PK can be the UUID – just make it nonclustered.

      hth,
      k

  15. GUID’s are not truly guaranteed to be unique.

    Don’t like the GUID’s in database tables because it is much harder to follow than auto increment integers and big integers.

    GUID’s are not sequential which confuses things if one were to use the GUID to sort the results of a SQL Query.

    GUID’s not being sequential adds unnecessary confusion in the tables that I have run across where you join multiple tables which all have GUID’s between the various tables. It is easy to get lost in the GUID Jungle in such cases.

    My preference is to keep things simple. Sorting by the results of auto increment keys keeps things in the order that the records were created in the Database table.

    No thank you on using GUID’s for me… Give my Auto Incrementing Integers and Big Integers and things will work out better in the long run.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

SQLskills SQL101: Partitioning

Continuing on our path to understanding the basics and core concepts, there’s a big topic that’s often greatly misunderstood and that’s partitioning. I know I’m

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.