There’s a lot of confusion around duplicate indexes and I think a lot of it has to do with how sp_helpindex (or even SSMS) display what’s in an index. What’s actually in the index – and how it’s structured – are not always what they seem. This was the original motivation behind my rewrite of sp_helpindex but even since then, I’ve still seen a lot of confusion. In today’s blog post I’m going to first explain EXACTLY which indexes are the same and which aren’t – as well as the faults in the tools.
So, starting with index structures… (it all starts with internals :) )
The clustered index IS the data. The KEY to the clustered index (what I often refer to as the clustering key) defines the way that the data is ordered (not necessarily truly, physically ordered on disk – but ordered LOGICALLY). And, no, I’m not going to rehash every bit of internals here… just a few reminders.
A nonclustered index is duplicated data (similar to an index in the back of a book). This duplicated data can be used to help reference the actual data (exactly like an index in the back of a book) OR can be used to actually respond to requests (for example, if you’re solely looking for a count of people with a name that begins with ” then an index that has LastName in the index could be used to count them – without actually “looking up” the actual data). So, there are some very powerful uses to indexes. But, alas, this isn’t a post on using indexes or indexing strategies – this is ALL about internals (and understanding the structure of an index). So, I’m going to cut to the chase!
A nonclustered index always has:
- Key (this is what defines the order of the index)
- A leaf-level entry (this is the actual data stored in the index + the lookup value* + any included columns) – however, ALL of these columns are only stored once (and they’ve ALWAYS only been stored once here so, even if you reference a column that’s part of the lookup value, SQL Server will NOT duplicate it again).
*So, what is this lookup value?
The lookup value is what SQL Server uses to reference the actual data row. If a table has a clustered index then the lookup value is the clustering key (and EVERY column defined by it). If a table does not have a clustered index (it’s then a HEAP) then SQL Server uses something called a RID. A RID is an 8-byte structure consisting of 2:4:2 bytes which breakdown into 2 for the FileID, 4 for the PageID and 2 for the slot number. And while RID’s (and their history) are very interesting – it’s not entirely relevant here (how they specifically work and/or their structures) but if/when they’re in an index, I’ll list is as RID.
Let’s now put all of this together with an example (or two) starting with a relatively easy one.
USE [JunkDB]; GO CREATE TABLE Test ( TestID int identity, [Name] char(16) ); GO CREATE UNIQUE CLUSTERED INDEX TestCL ON Test (TestID); GO CREATE INDEX TestName ON Test([Name]); GO sp_helpindex Test GO
The output shows:
index_name index_description index_keys TestCL clustered, unique located on PRIMARY TestID TestName nonclustered located on PRIMARY Name
And, while that looks right, it’s horribly misleading. The index that’s on TestName also includes TestID in the index as well. And, not just in the leaf level but in the tree (for ordering purposes). So, that should really show Name, TestID. But, it’s a bit more confusing if you throw in this:
CREATE UNIQUE INDEX TestNameUnique ON Test([Name]); GO EXECUTE sp_helpindex Test; GO
The output shows:
index_name index_description index_keys TestCL clustered, unique located on PRIMARY TestID TestName nonclustered located on PRIMARY Name TestNameUnique nonclustered, unique located on PRIMARY Name
And, at this point, it doesn’t look like there’s any difference at all between the second and third indexes (well, except that the third index requires that the values be unique – shown in the description). But, as for the “index_keys” they look the same. However, they’re actually not the same (in all parts of the tree). So, this is why I tend to differentiate between the “leaf” and the non-leaf levels of an index (when I describe them). And, it only gets more complicated when you throw in included columns (2005+).
So, how to do you tell the difference? Unfortunately, there are NO tools within SQL Server (or even any 3rd party tools that I know of) that display this through the UI, etc. But, you could start by using my replacement to sp_helpindex. I’ve updated and rewritten it a few times but the latest one will always be in the sp_helpindex reqrites category here: https://www.sqlskills.com/blogs/kimberly/category/sp_helpindex-rewrites.aspx. I personally think that’s the easiest (and is the reason why I originally wrote it!). Using it you can see how the output is more detailed.
The output shows (specifically showing ONLY the last 2 columns):
columns_in_tree columns_in_leaf [TestID] All columns "included" – the leaf level IS the data row. [Name], [TestID] [Name], [TestID] [Name] [Name], [TestID]
Now, we’re getting somewhere. We can definitely see that these two indexes ARE different. A nonunique nonclustered needs to have the lookup value pushed up into the tree (for navigation). A unique nonclustered index does not.
Next, let’s go with a [just slightly more challenging] example:
USE [JunkDB]; GO CREATE TABLE Member ( MemberNo int identity, FirstName varchar(30) NOT NULL, LastName varchar(30) NOT NULL, RegionNo int ); GO CREATE UNIQUE CLUSTERED INDEX MemberCL ON Member(MemberNo); GO CREATE INDEX MemberIndex1 ON Member(FirstName, RegionNo, MemberNo) INCLUDE (LastName); GO CREATE INDEX MemberIndex2 ON Member(FirstName, RegionNo) INCLUDE (LastName); GO CREATE INDEX MemberIndex3 ON Member(FirstName, RegionNo) INCLUDE (MemberNo, LastName); GO CREATE UNIQUE INDEX MemberIndex4 ON Member(Firstname, RegionNo) INCLUDE (MemberNo, lastname); GO
First, let’s review with sp_helpindex:
EXECUTE sp_helpindex Member; GO
The output shows:
index_name index_description index_keys MemberCL clustered, unique located on PRIMARY MemberNo MemberIndex1 nonclustered located on PRIMARY FirstName, RegionNo, MemberNo MemberIndex2 nonclustered located on PRIMARY FirstName, RegionNo MemberIndex3 nonclustered located on PRIMARY FirstName, RegionNo MemberIndex4 nonclustered, unique located on PRIMARY FirstName, RegionNo
Looking solely at sp_helpindex it LOOKS like the first nonclustered index is different from the others and that the 2nd, 3rd and 4th nonclustered indexes are the same. However, that’s actually NOT true. Next, let’s use my version of sp_helpindex.
EXECUTE sp_SQLskills_SQL2008_helpindex Member; GO
The output (again, just showing the last two columns of output):
columns_in_tree columns_in_leaf [MemberNo] All columns "included" – the leaf level IS the data row. [FirstName], [RegionNo], [MemberNo] [FirstName], [RegionNo], [MemberNo], [LastName] [FirstName], [RegionNo], [MemberNo] [FirstName], [RegionNo], [MemberNo], [LastName] [FirstName], [RegionNo], [MemberNo] [FirstName], [RegionNo], [MemberNo], [LastName] [FirstName], [RegionNo] [FirstName], [RegionNo], [MemberNo], [LastName]
From this, you can see that all 4 indexes have the same leaf level but index 4 has a slightly different tree structure. Ultimately, it’s indexes 1, 2 and 3 that are the same and index 4 is actually [slightly] different. How they’re different (outside of the fact that the 4th index guarantees uniqueness) is a bit beyond the scope of this post. But, yes, there are some [relatively minor in this case] differences. And, since I’m ONLY looking for indexes that are the same then only 1, 2 and 3 fit that requirement.
And, things get MUCH more complicated if you have multicolumn clustering key and/or more complicated INCLUDEs.
Having said that – how do you find duplicate indexes?
Well… I had started this by putting together a simple way for you to check for dupes with my version of sp_helpindex but then I figured out an issue with included columns. I’m showing the structures as they’re defined (and stored). But, in terms of usage – the order of the columns in the INCLUDE does not matter. As a result, two indexes with different orders for their included columns will show as two different indexes (technically they are and they are stored differently). However, there are absolutely NO differences (in terms of usage) so I’ll need to write code to adjust for that (to truly find duplicates).
For now, here’s some quick code to get you closer. I had written some of this when we discussed this in our London Immersion Event. However, I’ve tweaked it even further here after thinking about some interesting exceptions. This code will find absolute duplicates (where the order of the structure is completely identical). To use it you’ll need to enter in the schemaname and tablename (in bold below).
IF (SELECT count(*) FROM #FindDupes) IS NOT NULL DROP TABLE #FindDupes; GO CREATE TABLE #FindDupes ( index_id int, is_disabled bit, index_name sysname, index_description varchar(210), index_keys nvarchar(2126), included_columns nvarchar(max), filter_definition nvarchar(max), columns_in_tree nvarchar(2126), columns_in_leaf nvarchar(max) ); GO DECLARE @SchemaName sysname, @TableName sysname, @ExecStr nvarchar(max); SELECT @SchemaName = N'schemaname', — MODIFY @TableName = N'tablename'; — MODIFY SELECT @ExecStr = 'EXECUTE sp_SQLskills_SQL2008_helpindex ''' + QUOTENAME(@SchemaName) + N'.' + QUOTENAME(@TableName) + N''''; INSERT #FindDupes EXEC (@ExecStr); SELECT t1.index_id, COUNT(*) AS 'Duplicate Indexes w/Lower Index_ID', N'DROP INDEX ' + QUOTENAME(@SchemaName, N']') + N'.' + QUOTENAME(@TableName, N']') + N'.' + t1.index_name AS 'Drop Index Statement' FROM #FindDupes AS t1 INNER JOIN #FindDupes AS t2 ON t1.columns_in_tree = t2.columns_in_tree AND t1.columns_in_leaf = t2.columns_in_leaf AND ISNULL(t1.filter_definition, 1) = ISNULL(t2.filter_definition, 1) AND PATINDEX('%unique%', t1.index_description) = PATINDEX('%unique%', t2.index_description) AND t1.index_id > t2.index_id GROUP BY t1.index_id, N'DROP INDEX ' + QUOTENAME(@SchemaName, N']') + N'.' + QUOTENAME(@TableName, N']') + N'.' + t1.index_name; GO
I’m planning to bulid a sproc around this next as well as get rid of cases where the included columns are just in different orders next. It will be another sp_ for master and I do hope to wrap it up shortly! That code will REALLY find ALL unnecessary duplicates and help you to drop them! Stay tuned!!
Cheers and thanks for reading!!
17 thoughts on “How can you tell if an index is REALLY a duplicate?”
Thank you for post, it’s very informative. I look forward to your sprocs next week.
At what point do you say that indexes are duplicates (and therefore can be removed) when a wider index contains the same keys as a smaller index.
CREATE INDEX MemberIndex5 ON Member(FirstName) INCLUDE (LastName)
Assuming regionNo is a fairly narrow column, would you consider it to be a good candidate to be dropped because queries that use it can be satisfied with MemberIndex2?
The answer is probably going to be "it depends", but I’m really interested in what it depends on, and what are your general rules around this area.
Duplicate indexes are exactly that – completely redundant and exactly the same.
Indexes that are left-based subsets of others are not the same. And there are a lot of issues around whether or not an index which is a left-based subset of another should be removed. The two things that are the most important that I look for in the additional/wider indexes are:
How much wider is it?
When the columns in the wider index are not all that much wider then it might not be a big deal. However, there are a bunch of factors here. For example, are the queries that use the narrower index doing scans? And, the queries accessing the wider index doing seeks? I’d almost argue against the wider index in that case. But, if they’re all doing scans and the wider index isn’t that much wider. And, here it’s really hard to assign a percentage to it – it really needs to be tied to the cost of the narrower queries… are the scans against the wider index costing the narrower queries 10%, 20%, 30% more IOs? That you’d have to test. In many cases it’s probably LESS costly than you might think and if the wider queries are MORE common and the narrower queries less common then there’s your answer.
How often are the indexes being used?
If the narrower queries are 90% of the queries and wider ones only 10 then you might actually want BOTH?? So, for this part, you’ll need to review the usage patterns. To get information about the usage patterns of the indexes (to help you determine how often the narrower index is being used compared to the wider version) then you can also use dm_db_index_usage_stats. That can be really helpful here for reviewing/tracking certain patterns.
But, yes, this is a big "it depends" but it’s all tied to understanding your workloads, understanding your data AND understanding how SQL Server works. It’s this last one that Paul and I spend a lot of time teaching and writing about! Lifting the hood and seeing what’s going on inside the engine is what really answers a lot of questions!
Great reply. Look forward to seeing you and Paul at in immersion event in Australia one day :)
@David-Absolutely, we love Oz. We’re hoping to head down there in 2012!
Thanks for the great post.
I just discovered this alternative to sp_helpindex, excellent stuff. Thank you!
What are some other good alternative tools?
A few I know and use are:
Adam Machanic’s sp_whoisactive
Aaron Bertrand’s sp_foreachdb
SQL Sentry’s (free) Plan Explorer
This is mind blowing blog post. Thank you for providing the code snippets and the blog post.
I have not run across an example of this, but… Would a different sort direction (ASCending or DESCending) of the values for a column(s) also make an index a non-duplicate?
Yes, they would have different benefits. Mostly for queries that had something like:
ORDER BY Category, Price DESC
Meaning – multicolumn order by clauses where one or more of the columns used DESC. Having said that though, there are A LOT of factors that can influence which index(es) get used and many happen before an index to support ORDER BY is used. A join and the WHERE clause can often better use an index to eliminate rows and then the result (being a lot smaller) is sorted in memory.
Nice post. I came across your posting today and is very useful. I have a situation where I need your help in understanding the index tree structure for this table stucture below.
CREATE CLUSTERED INDEX IX1
CREATE UNIQUE NONCLUSTERED INDEX IX2
Am i right in saying this is how the structure is stored internally? If so aren’t they similar? Please advise
(COL1,COL2,COL3) All columns “included” – the leaf level IS the data row.
Similar yes but not the same. The clustered index has all of the data columns in the leaf level where as the nonclustered ONLY has the columns of the index. So, I’d really need to know what tbl1 looks like. If tbl1 ONLY has c1, c2, c3 then the indexes would actually be the same. But, if the table has 47 columns then this is how the two structures differ:
b-tree columns (for navigation and searching) c1, c2, c3
Leaf level c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, … c47
b-tree columns (for navigation and searching) c1, c2, c3
Leaf level c1, c2, c3
A nonclustered index on the SAME columns as the clustered (even in the same order) does provide SOME uses. In fact, this is the narrowest index you can create. The coolest thing this can provide is that SMALLEST index to scan for a count(*) query. But, it also has other uses in “covering” any queries that want any combination of c1, c2, c3 – for any subset or even the entire set (as this takes fewer pages to read).
So, while I don’t always do this and there aren’t always uses for it. They are different enough (on a wider table) to consider it!
Hope that helps!
I have Clustered Index on a Primary key. If I create a non clustered index on the same column. Non clustered index is not needed here. But your script is not displaying it. Am I missing anything?
Vijay Anand Madhuranayagam
They are very different indexes and there are some fantastic reasons to create nonclustered indexes on the same column as the clustering key. Remember, the clustered index has every column in the leaf level. A nonclustered has only the columns defined by the index plus the clustering key (when not already present). So, the nonclustered index will be very narrow compared to the table and may help with a variety of queries.
Hope that helps!