I received a great question in email and it’s something I cover in our IEPTO1 (Immersion Event on Performance Tuning, Part 1) so I thought I’d write a post about it…

Question: 

When you have a non-unique key value in a nonclustered index, SQL Server adds the RID / Row Identifier (if the NC is on a heap table) or the clustered key (if the table has a clustered index) to the navigational structure (the b-tree) of the nonclustered index to make non-unique index rows unique. What is the reason behind that implementation?

Answer: 

SQL Server requires the lookup value (the RID or the clustering key) to go up the tree when a nonclustered key is non-unique in order to efficiently locate rows in the leaf level.

Scenario: Take a non-unique, nonclustered index on gender (on an employee table that’s clustered by EmployeeID). To show this, I’ll create a simple tree structure with only 24 rows (12 female and 12 male employees).

Note: this is NOT exactly what the index looks like but for simplicity, I’ll use this. Also, this data set is small for simplicity (you can understand this problem with a small data set); the problem is even more exaggerated for larger and larger data sets. Finally, if you’re new to index internals, consider checking out my online course here.

Possible Option 1: Only push the non-unique nonclustered key up the tree

If the leaf level is sorted only by the nonclustered key of gender then there’s no order to the EmployeeID values within each gender grouping:

Imagine a nonclustered index on gender with a clustered index on EmployeeID. If the leaf level were not sorted then every request (DELETE / UPDATE) would have to scan ALL of the Fs to find the female row they were searching for… This is a “What If” picture; this is NOT how nonclustered indexes are stored.

In this case, the index has very few uses. We can get a count of Female or Male employees easily but not much else is efficient. A SELECT will have to scan (from the start – every time); it can stop once the record is found (EmployeeID is unique) but the cost of always starting at page 1 (for female) or page 4 (for male) would be too expensive with many rows / pages.

Having said that though, you might even start by wondering what an index on gender would benefit anyway? The most useful requests are exactly that (counts by gender, list by gender). So, for most queries this structure could work and it really wouldn’t matter for queries. Inserts could just go to the end of each section. But, what about deletes? What if an employee were to leave? If you were to execute the following:

DELETE [Employee] WHERE [EmployeeID] = 27;
GO

How would SQL Server find that row? They would know that the employee is female when they went to the data row itself. But, finding this row within the female grouping of this nonclustered index would ALWAYS require a scan through all female rows. The same would be true for an update.

So, while this could work for inserts as well as some queries, it’s horribly inefficient for updates and deletes.

Possible Option 2: Sort the leaf level but still ONLY push the non-unique nonclustered key up the tree

Even if they were sorted – if you don’t “push” the clustered key value (EmployeeID) up the tree, the index cannot be efficiently searched / modified, we still have to start at the lowest value and scan until we find the row to DELETE / UPDATE. This is a “What If” picture; this is NOT how nonclustered indexes are stored.

Here we don’t save much; we can still stop a scan once a value was found. But, we can do that in the structure for Option 1 (as long as EmployeeID was defined as unique).  About the only added benefit of this structure is for an ORDER BY query.

SELECT [EmployeeID], [Gender]
FROM [Employee]
ORDER BY [Gender], [EmployeeID];
GO

For Option 1 this query would have to add an additional sort. For Option 2, we’d save this. But, we’d still have the same problem for a DELETE / UPDATE; we’d have to scan to find the specific row to modify.

So, we’re really back to the same issues as Option 1: this could work for inserts as well as even a few more queries, it’s horribly inefficient for updates and deletes.

SQL Server pushes the “lookup value” up the tree

This makes each row in a nonclustered tree unique. This makes every query directly seekable to the specific page on which that row resides. We can support queries with additional predicates (such as WHERE EmployeeID >). Again, this index doesn’t provide a lot of uses BUT it’s still MORE useful when it’s sorted AND seekable by the composite key and where that key is unique up the tree.

A delete can go directly to the leaf page.

DELETE [Employee] WHERE [EmployeeID] = 27;
GO

After going to the row, we’d find that EmployeeID 27 is female. Using the root page we can see that the first pointer is for EmployeeID 2 or higher. The second pointer is for EmployeeID 59 or higher. Since we’re looking for 27 – we’d use the first pointer which points to page 1 for data.

In the same way, an update can go directly to the leaf page. An insert has a specific place to insert (yes, these can cause splits).

With both the non-unique nonclustered key as well as the clustered key, you have a tree structure that can be efficiently searched / modified / updated.

So, while I could have used a larger and more realistic example for this – this definitely shows the need for the clustering key in the b-tree portion of an index.

Have fun and keep those questions coming!

Cheers,
k