OK, two in two days? What's wrong with me (is what Paul said) after he said – who are you and where is my wife?
But, this one is a short one, specifically a follow-on to my prior post about "Disk space is cheap…"
I did a bit of simple math on the internal overhead and costs of inefficient clustering keys and I wanted to share those as well. If a table is clustered then each and every nonclustered index must include ALL columns of the clustering key in [at least] the leaf level of the clustered index. When a nonclustered index is non-unique then it's even worse in that all columns of the clustering key must go all the way up the tree (into the non-leaf levels). I re-wrote sp_helpindex to give you better insight into what was happening in your nonclustered indexes here: A new and improved sp_helpindex (jokingly sp_helpindex8). I've actually re-written it again to include a column that shows whether or not an index is disabled but it still needs the base procedure [sp_SQLskills_ExposeColsInIndexLevels] from the other post – make sure you download that first before trying to use this sp_helpindex9: sp_SQLskills_SQL2008_helpindex (sp_helpindex9).sql (11.06 kb).
Having said that – what's the overhead in terms of some of your bigger tables?
If you have a 10 million row table with 8 nonclustered indexes then *just* the internal overhead is going to cost you the following with keys of the varying sizes:
Description |
Width of CL key |
Rows |
NC Indexes |
MB |
int |
4 |
10,000,000 |
8 |
305.18 |
datetime |
8 |
10,000,000 |
8 |
610.35 |
datetime, int |
12 |
10,000,000 |
8 |
915.53 |
guid |
16 |
10,000,000 |
8 |
1,220.70 |
composite |
32 |
10,000,000 |
8 |
2,441.41 |
composite |
64 |
10,000,000 |
8 |
4,882.81 |
And, what if you have a bigger table – say a 100 million row table with 12 nonclustered indexes… the internal overhead is as follows:
Description |
Width of CL key |
Rows |
NC Indexes |
MB |
int |
4 |
100,000,000 |
12 |
4,577.64 |
datetime |
8 |
100,000,000 |
12 |
9,155.27 |
datetime, int |
12 |
100,000,000 |
12 |
13,732.91 |
guid |
16 |
100,000,000 |
12 |
18,310.55 |
composite |
32 |
100,000,000 |
12 |
36,621.09 |
composite |
64 |
100,000,000 |
12 |
73,242.19 |
So, if you're talking about disk space, IOs, backups, etc. and you have a very inefficient clustering key (say a 5 column "natural key") that's 64 bytes then you have 71.53GB of OVERHEAD where it could have been only 4.47GB. That's a tad over 67GB of wasted disk space, memory (really – do you have 67GB+ of memory to waste??) and what about the long term costs of backing all of this up?? And, I haven't even started to talk about the inefficiencies with the FKs too. Or, the inefficiencies within the nonunique nonclustered index trees as well.
It's a COMPLETE NIGHTMARE!
Please… everyone repeat after me…………….. DESIGN MATTERS!!! :-)
Thanks for reading!
kt
15 thoughts on “How much does that key cost? (plus sp_helpindex9)”
Are there any general tools for estimating the size of a database with a known schema?
Hey there Andy – There are lots of tools to use to see the average, min and max row size of existing tables and indexes. The best is sys.dm_db_index_physical_stats. With the “detailed” mode you can actually see the size of every level. This can be really helpful. For a quick glance at space in general you can use sp_spaceused (without any parameters) to see total data and index size OR you can use sp_spaceused tablename to see just the data and index size for a specific table. And, as of SQL Server 2005 the new standard reports have a good one that I often use that shows the disk space usage for top tables. This gives you a list of your tables from the largest down for the top 1000 tables in a database. To be honest, most of us only *really* care about the top 10-20.
Hope that helps!
kt
I have always advocated using Identity (1-1) columns as the PK. My argument has been that [u]within[/u] the associated tables, this saves space. You have now added a second reason for my argument and, interestingly enough, it is essentially based upon the same reasoning. The same bloat that compound PK/Clustered Indexes experience is also going to impact the tables that have to have the PK included in their structure in order to maintain referential integrity.
Am I correct in understanding this is only the disk space overhead for storing the CL key at the leaf level for a unique OR a non unique index and does not include the actual disk space required to store the columns actually in the nonclustered?
In other words, the additional overhead if this is a non unique nonclustered is potentially (I say potentially because if there is a column that exists in both the CL key and nonclustered, it won’t be duplicated) going to be the same per row up the tree because of the need to include the CL key at non leaf levels?
Does anyone have any good suggestions for other resources I can reference if I really want to learn this subject inside and out, down to the bare bones? I love indexes. They are so flippin rad. :)
(moved up as a response)
Hey there Andy – There are lots of tools to use to see the average, min and max row size of existing tables and indexes. The best is sys.dm_db_index_physical_stats. With the "detailed" mode you can actually see the size of every level. This can be really helpful. For a quick glance at space in general you can use sp_spaceused (without any parameters) to see total data and index size OR you can use sp_spaceused tablename to see just the data and index size for a specific table. And, as of SQL Server 2005 the new standard reports have a good one that I often use that shows the disk space usage for top tables. This gives you a list of your tables from the largest down for the top 1000 tables in a database. To be honest, most of us only *really* care about the top 10-20.
Hope that helps!
kt
Hey there tanner.robin (couldn’t decide if you were Robin Tanner or Tanner Robin :) )…
But, first and foremost – yes, indexes are completely and totally flippin rad! For a really comprehensive deep dive into indexes I’d suggest the SQL Server 2008 Internals title (www.amazon.com/…/0735626243). I’m somewhat baised though as I wrote the Index Internals chapter. For that chapter, I also put *every* sample line of code into a solution with 3 projects. You might check out the demo scripts (here:www.sqlskills.com/…/…rver-2008-Internals.aspx) and if you find you want to fill in the gaps – go back and check out the book!
But, back to your question – YES! That’s the overhead NOT including the actual columns of the index (the key + include) and I was ONLY talking about the leaf level. But, you’re also correct in that when a nonclustered index is nonunique then the CL key columns must go all the way up the tree. So, a wide clustering key can be VERY expensive for sure!
Cheers,
kt
Yes, design matters. One thing you neglected is the binary tree of the index. Only the FIRST field of an index is stored in the binary tree. Early in my work with SQL, I didn’t realize this and totally messed up a table by putting the partioning field of a partitioned view as the first field in the clustered primary key.
To find a record required a scan of the data table layer until the second field is found.
The binary tree contains a list of the first value and the physical address to the next binary tree page or leaf page. The leaf page of a nonclustered index will contain the values of all the fields defined and in addition either the physical address of the data table page in a heap table or all the field values of the clustered index.
In the second case, you then have to traverse the binary tree of the clustered index to find the data table row you are looking for.
Hey there Ken – I think you’re misunderstanding the correct uses of clustered vs. nonclustered indexes and index internals. SQL Server [b]always[/b] stores the [u][b]entire key[/b][/u] and possibly the lookup value (the lookup value [either the heap’s RID or the clustering key] goes all the way up the tree when you create a nonunique nonclustered index). But, you’re definitely right on the nonseekable aspects of an index. All indexes are only seekable by left-based subsets of the key. So, it’s true that if you’re looking for a second value of a an index key – the index must be scanned. It would be the same case when using the phone book and saying that you always want to search for people by *firstname* – you can do it but it’s painfully slow because as it requires a scan. And, if that index is the clustered index then that’s a table scan.
But, in general, this is a discussion that requires a few more details, I’ll probably end up doing a blog post because this is a bit more complicated than a simple/quick answer but I just want to make sure you know that SQL Server *always* stores the entire key in the btree not just the first column! The problem you are talking about is just that the index wasn’t good for what you were doing. To be honest, if the second value of you CL key was unique (like an identity) then it would have (probably) been better to have the following:
CL PK on Date, Identity
Nonclustered unique on Identity, Date
And, if it was a partitioned view (as you mention) then the next thing I would have done is to create TWO constraints. One constraint on date and then a second on the identity (as they probably closely mimic the date in that each partition has a specific range – maybe there’s a slight overlap at the beginning or end but in general they should closely mirror each other). Yes, you might not know the specific range until each month is added but for all past months you do and for the current month you have a >=. As a result, you can do partitioning elimination over one or both of the keys. This is really cool!
So, all I was talking about in these posts are the internal ramifications of CL key choice – not the index usage patterns. That’s a different discussion. :) But, I often choose (and HIGHLY recommend) that the CL PK in partitioning scenarios be a combination of (and in this order) the partitioning key (when date-relatied) and then the row-identifier (ideally an int/bigint). That has lots of other benefits for sorting as well.
Cheers,
kt
First name Robin, last name Tanner…
In reference to the last paragraph in your last post, shouldn’t we use a GUID as the row-identifier and partition with that?? Maybe we should add a couple of them, just to make sure its EXTRA unique. :D
Also, I will be getting your book on Internals, I am excited to read it. At the moment (maybe I put the cart before this horse here), I’m knee deep in The Guru’s Guide to SQL Server Architecture and Internals by Ken Henderson. Memory Allocation is fun stuff!
Hey there Robin – Yes, GUIDs for all columns. Even quanity is good as a GUID. But, somehow coming up with a "random" GUID and then using the last 1-2 digits as a quantity might cause the customer to buy more items than they want. Hmmm… maybe that’s a good thing? ;-) ;-)
I’ll look forward to more questions! And, The Guru’s Guides are good too!!
Cheers,
kt
I am a regular visitor of your blog. Learnt a lot about indexes and tried out many of your ideas. I am currently working on refining Non clustered indexes on one our legacy tables (with around 15 million rows and growing). We have a couple of NC indexes where the key columns are repeated in more than one index. I am trying to remove the redundant key columns from these NC indexes.
Since the table is a legacy table I am not really aware about how the processes query the table (most of the sql calls against this table are inline statements in Java code), so I wanted to take a route that is least risky but I am afraid it might not buy me any major benefit. We have three key columns being repeated in three of the NC indexes. I am planning on creating one new NC index on the three repeated columns and remove the three columns from the existing NC indexes. But I am worried that if the person who created those initial NC indexes repeated those columns because they wanted to cover a particular query so I am planing on actually removing the three repeated key columns and include those columns so that all existing queries would still be covered.
But by creating a new NC index on the three columns and including those three columns in the three existing NC indexes, I am not sure if I would get any performance benefit on DML operations. I am thinking I should atleast get a benefit on CPU resources as sorting of data is no longer required for the existing indexes. Are there any other gains I can expect with this approach.
In this regard I wanted to see if you have done this or if you have any better ideas to see more performance gains…
As for key columns that are in more than one index – that might be more than appropriate. Removing a column just because it exists in more than one index is NOT recommended. You really need to do deeper analysis of your workload to determine what the correct indexes should be rather than trying to come up with a simple rule (which in this case is not really appropriate).
And, your second paragraph further confirms this. Without workload analysis you may remove columns that other queries DO NEED! And, you’re correct, removing those columns could affect queries that were previously covered so this is definitely NOT a good idea!
Simply put – you should consider looking at index usage stats to firust gauge how much these indexes are being used. Then, through workload analysis (and probably some time with DTA) you should start to determine which indexes are being used by which queries. It’s this specific analysis that will help you to determine exactly which indexes should remain and which can be removed/combined.
However, I have to admit – this is a VERY difficult question to answer in just a comment on a blog post. I do hope this helps though!
Cheers,
kt
Hey Kimberly,
Extremely useful script, use it almost every other day. :)
Why not have another column, giving the index size as well?
-Y :)