SQL Server has a horrible habit of letting bad habits move forward (for the sake of backward compatibility [e.g. autoshrink]). And, I do understand *some* of the reasons for allowing some of these "features/options" to move forward. However, there are a few that frustrate me and I've talked about this one quite often.

SQL Server lets you create completely redundant and totally duplicate indexes.

Yes, you can even create the same index 999 times! Try it:

SET NOCOUNT ON
go

CREATE TABLE TestTable
(
     col1 int  identity
)
go

DECLARE @IndexID smallint,
        @ExecStr nvarchar(500)
SELECT @IndexID = 1
WHILE @IndexID <= 999
BEGIN
  SELECT @ExecStr = 'CREATE INDEX [Test' + right('00' + convert(varchar(5), @IndexID), 3) + '] ON TestTable (col1)'
  EXEC(@ExecStr)
  SELECT @IndexID = @IndexID + 1
END
go

SELECT count(*)
FROM sys.indexes
WHERE object_id = object_id('TestTable')
go

--DROP TABLE TestTable
go

For somewhat obvious reasons having 999 duplicate indexes does seem completely useless... so, why does SQL Server allow it at all? And, what's the backward compatibility argument? Well, I've written about that before on our SQL Server Magazine Q&A blog (Kimberly & Paul - Questions Answered): Why SQL Server Lets You Create Redundant Indexes.

Having said that - and, for those of you that are only checking YOUR code, you don't use index hints and/or you don't care about backward compat - then, you can use my proc to help you find duplicate indexes (and drop them)! It's all based on my "sp_helpindex rewrites" stored procedures but I had to do some "tweaking" to get the procedures to understand that the order of the columns in the INCLUDE list are irrelevant (in terms of order). So, you can't use the sps as written. They give you the exact (and correct) structure - and, that's exactly how SQL Server stores them. For example, the following two indexes have a different structure. However, they are completely redundant and you do NOT need both.

create index test1 on member(lastname) include (firstname, middleinitial)
create index test2 on member(lastname) include (middleinitial, firstname)
go

Here's what you need:

First, my modified sp_ that exposes all of the columns in all levels of the index. This one does NOT replace the one needed by sp_SQLskills_SQL2008_helpindex. You need both. This version is used to reorder the INCLUDEd columns so that the "finddupes" proc can remove indexes that are different solely because of the order of the columns in the INCLUDE clause (which is irrelevant for navigation/index usage).

(1) Install this one first: 20110715_sp_SQLskills_ExposeColsInIndexLevels_INCLUDE_UNORDERED.sql (6.95 kb)

Second, my modifed sp_helpindex that is used SOLELY by finddupes. Basically, it's just modified to use the UNORDERED version of "exposing the columns" in the nonclustered indexes. Again, this does NOT replace the other sp - you need both.

(2) Install this one second: 20110715_sp_SQLskills_SQL2008_finddupes_helpindex.sql (10.95 kb)

Finally, here's the "finddupes" procedure.

(3) Install this one last: 20110720_sp_SQLskills_SQL2008_finddupes.sql (5.30 kb)

NOTES: At some point in the not-too-distant future I'll modify this for the following:

1) One set of code that will work for 2005/2008

2) I'll combine the two versions of the "expose columns in index levels" so that there's only one.

3) I'll make this work for indexed views

Finally, how can you use this:

To find the duplicates for just one table - you can use one-part or two-part naming:

USE AdventureWorks2008
go

EXEC sp_SQLskills_SQL2008_finddupes 'Production.Document'
go

OR... to find all duplicates in the entire database:

USE AdventureWorks2008
go

EXEC sp_SQLskills_SQL2008_finddupes
go

OR... if you're really motivated, you can run it for all databases - but, before you do - see Aaron Bertrand's comment to this post for his replacement for sp_msforeachdb and use that instead!!

sp_msforeachdb 'USE ?; exec sp_SQLskills_SQL2008_finddupes'
go

Have fun and let me know how many duplicates you find? I found one in AdventureWorks2008 and a few in Northwind. I suspect you'll find a few! This isn't something that will find tons of indexes (to drop) but if there are even a couple of dupes (especially on large tables) you'll have savings in database modification statements, logging, caching, maintenance, storage (and therefore backups, etc.) and this is why dupes are HORRIBLE! I do wish that SQL Server had an option to prevent their being created! Maybe someday?!

Enjoy,
kt

There's a lot of confusion around duplicate indexes and I think a lot of it has to do with how sp_helpindex (or even SSMS) display what's in an index. What's actually in the index - and how it's structured - are not always what they seem. This was the original motivation behind my rewrite of sp_helpindex but even since then, I've still seen a lot of confusion. In today's blog post I'm going to first explain EXACTLY which indexes are the same and which aren't - as well as the faults in the tools.

So, starting with index structures... (it all starts with internals :) )

The clustered index IS the data. The KEY to the clustered index (what I often refer to as the clustering key) defines the way that the data is ordered (not necessarily truly, physically ordered on disk - but ordered LOGICALLY). And, no, I'm not going to rehash every bit of internals here... just a few reminders.

A nonclustered index is duplicated data (similar to an index in the back of a book). This duplicated data can be used to help reference the actual data (exactly like an index in the back of a book) OR can be used to actually respond to requests (for example, if you're solely looking for a count of people with a name that begins with 'S' then an index that has LastName in the index could be used to count them - without actually "looking up" the actual data). So, there are some very powerful uses to indexes. But, alas, this isn't a post on using indexes or indexing strategies - this is ALL about internals (and understanding the structure of an index). So, I'm going to cut to the chase!

A nonclustered index always has:

  • Key (this is what defines the order of the index)
  • A leaf-level entry (this is the actual data stored in the index + the lookup value* + any included columns) - however, ALL of these columns are only stored once (and they've ALWAYS only been stored once here so, even if you reference a column that's part of the lookup value, SQL Server will NOT duplicate it again).

*So, what is this lookup value?

The lookup value is what SQL Server uses to reference the actual data row. If a table has a clustered index then the lookup value is the clustering key (and EVERY column defined by it). If a table does not have a clustered index (it's then a HEAP) then SQL Server uses something called a RID. A RID is an 8-byte structure consisting of 2:4:2 bytes which breakdown into 2 for the FileID, 4 for the PageID and 2 for the slot number. And while RID's (and their history) are very interesting - it's not entirely relevant here (how they specifically work and/or their structures) but if/when they're in an index, I'll list is as RID.

Let's now put all of this together with an example (or two) starting with a relatively easy one.

USE JunkDB
go

CREATE TABLE Test
(
    TestID  int identity,
    [Name]  char(16)
)
go

CREATE UNIQUE CLUSTERED INDEX TestCL ON Test (TestID)
go

CREATE INDEX TestName ON Test([Name])
go

sp_helpindex Test
go

The output shows:

index_name      index_description                               index_keys
TestCL          clustered, unique located on PRIMARY            TestID
TestName        nonclustered located on PRIMARY            
     Name

And, while that looks right, it's horribly misleading. The index that's on TestName also includes TestID in the index as well. And, not just in the leaf level but in the tree (for ordering purposes). So, that should really show Name, TestID. But, it's a bit more confusing if you throw in this:

CREATE UNIQUE INDEX TestNameUnique ON Test([Name])
go

sp_helpindex Test
go

The output shows:

index_name      index_description                               index_keys
TestCL          clustered, unique located on PRIMARY            TestID
TestName        nonclustered located on PRIMARY                 Name
TestNameUnique  nonclustered, unique located on PRIMARY         Name

And, at this point, it doesn't look like there's any difference at all between the second and third indexes (well, except that the third index requires that the values be unique - shown in the description). But, as for the "index_keys" they look the same. However, they're actually not the same (in all parts of the tree). So, this is why I tend to differentiate between the "leaf" and the non-leaf levels of an index (when I describe them). And, it only gets more complicated when you throw in included columns (2005+).

So, how to do you tell the difference? Unfortunately, there are NO tools within SQL Server (or even any 3rd party tools that I know of) that display this through the UI, etc. But, you could start by using my replacement to sp_helpindex. I've updated and rewritten it a few times but the latest one will always be in the sp_helpindex reqrites category here: http://www.sqlskills.com/BLOGS/KIMBERLY/category/sp_helpindex-rewrites.aspx. I personally think that's the easiest (and is the reason why I originally wrote it!). Using it you can see how the output is more detailed.

The output shows (specifically showing ONLY the last 2 columns):

columns_in_tree              columns_in_leaf
[TestID]                     All columns "included" - the leaf level IS the data row.
[Name], [TestID]             [Name], [TestID]
[Name]                       [Name], [TestID]

Now, we're getting somewhere. We can definitely see that these two indexes ARE different. A nonunique nonclustered needs to have the lookup value pushed up into the tree (for navigation). A unique nonclustered index does not.

Next, let's go with a [just slightly more challenging] example:

USE JunkDB
go

CREATE TABLE Member
(
    MemberNo    int     identity,
    FirstName   varchar(30)     NOT NULL,
    LastName    varchar(30)     NOT NULL,
    RegionNo    int
)
go

CREATE UNIQUE CLUSTERED INDEX MemberCL ON Member(MemberNo)
go
   
CREATE INDEX MemberIndex1
ON Member(FirstName, RegionNo, MemberNo)
INCLUDE (LastName)
go

CREATE INDEX MemberIndex2
ON Member(FirstName, RegionNo)
INCLUDE (LastName)
go

CREATE INDEX MemberIndex3
ON Member(FirstName, RegionNo)
INCLUDE (MemberNo, LastName)
go

CREATE UNIQUE INDEX MemberIndex4
ON Member(Firstname, RegionNo)
INCLUDE (MemberNo, lastname)
go

First, let's review with sp_helpindex:

sp_helpindex Member
go

The output shows:

index_name      index_description                           index_keys
MemberCL        clustered, unique located on PRIMARY        MemberNo
MemberIndex1    nonclustered located on PRIMARY             FirstName, RegionNo, MemberNo
MemberIndex2    nonclustered located on PRIMARY             FirstName, RegionNo
MemberIndex3    nonclustered located on PRIMARY             FirstName, RegionNo
MemberIndex4    nonclustered, unique located on PRIMARY     FirstName, RegionNo

Looking solely at sp_helpindex it LOOKS like the first nonclustered index is different from the others and that the 2nd, 3rd and 4th nonclustered indexes are the same. However, that's actually NOT true. Next, let's use my version of sp_helpindex.

sp_SQLskills_SQL2008_helpindex Member
go

The output (again, just showing the last two columns of output):

columns_in_tree                           columns_in_leaf
[MemberNo]                                All columns "included" - the leaf level IS the data row.
[FirstName], [RegionNo], [MemberNo]       [FirstName], [RegionNo], [MemberNo], [LastName]
[FirstName], [RegionNo], [MemberNo]       [FirstName], [RegionNo], [MemberNo], [LastName]
[FirstName], [RegionNo], [MemberNo]       [FirstName], [RegionNo], [MemberNo], [LastName]
[FirstName], [RegionNo]                   [FirstName], [RegionNo], [MemberNo], [LastName]

From this, you can see that all 4 indexes have the same leaf level but index 4 has a slightly different tree structure. Ultimately, it's indexes 1, 2 and 3 that are the same and index 4 is actually [slightly] different. How they're different (outside of the fact that the 4th index guarantees uniqueness) is a bit beyond the scope of this post. But, yes, there are some [relatively minor in this case] differences. And, since I'm ONLY looking for indexes that are the same then only 1, 2 and 3 fit that requirement.

And, things get MUCH more complicated if you have multicolumn clustering key and/or more complicated INCLUDEs.

Having said that - how do you find duplicate indexes?

Well... I had started this by putting together a simple way for you to check for dupes with my version of sp_helpindex but then I figured out an issue with included columns. I'm showing the structures as they're defined (and stored). But, in terms of usage - the order of the columns in the INCLUDE does not matter. As a result, two indexes with different orders for their included columns will show as two different indexes (technically they are and they are stored differently). However, there are absolutely NO differences (in terms of usage) so I'll need to write code to adjust for that (to truly find duplicates).

For now, here's some quick code to get you closer. I had written some of this when we discussed this in our London Immersion Event. However, I've tweaked it even further here after thinking about some interesting exceptions. This code will find absolute duplicates (where the order of the structure is completely identical). To use it you'll need to enter in the schemaname and tablename (in bold below).

IF (SELECT count(*) FROM #FindDupes) IS NOT NULL
    DROP TABLE #FindDupes
go

CREATE TABLE #FindDupes
(
 index_id           int,
 is_disabled        bit,
 index_name         sysname,
 index_description  varchar(210),
 index_keys         nvarchar(2126),
 included_columns   nvarchar(max),
 filter_definition  nvarchar(max),
 columns_in_tree    nvarchar(2126),
 columns_in_leaf    nvarchar(max)
)
go

DECLARE @SchemaName sysname,
        @TableName  sysname,
        @ExecStr    nvarchar(max)

SELECT @SchemaName = N'schemaname',     -- MODIFY
       @TableName = N'tablename'        -- MODIFY
       
SELECT @ExecStr = 'EXEC sp_SQLskills_SQL2008_helpindex '''
                    + QUOTENAME(@SchemaName)
                    + N'.'
                    + QUOTENAME(@TableName)
                    + N''''

INSERT #FindDupes
EXEC (@ExecStr)

SELECT t1.index_id, COUNT(*) AS 'Duplicate Indexes w/Lower Index_ID',
        N'DROP INDEX '
            + QUOTENAME(@SchemaName, N']')
            + N'.'
            + QUOTENAME(@TableName, N']')
            + N'.'
            + t1.index_name AS 'Drop Index Statement'
FROM #FindDupes AS t1
    JOIN #FindDupes AS t2
        ON t1.columns_in_tree = t2.columns_in_tree
            AND t1.columns_in_leaf = t2.columns_in_leaf
            AND ISNULL(t1.filter_definition, 1) = ISNULL(t2.filter_definition, 1)
           
AND PATINDEX('%unique%', t1.index_description) = PATINDEX('%unique%', t2.index_description)
           
AND t1.index_id > t2.index_id
GROUP BY t1.index_id, N'DROP INDEX ' + QUOTENAME(@SchemaName, N']')
                    + N'.'
                    + QUOTENAME(@TableName, N']')
                    + N'.' + t1.index_name
go

I'm planning to bulid a sproc around this next as well as get rid of cases where the included columns are just in different orders next. It will be another sp_ for master and I do hope to wrap it up shortly! That code will REALLY find ALL unnecessary duplicates and help you to drop them! Stay tuned!!

Cheers and thanks for reading!!
kt

Well, it's been a crazy few weeks! Paul and I have been booked solid with client engagements, Immersion Events, conferences and other things (like chapter editing for the upcoming SQL Server Deep Dives Volume 2) and so I've been a bit behind with blogging. I have a ToDo list of items to blog about but I've got one that's fairly urgent. I'll mention a few issues here but above all - don't trust anyone.

OK, I know that sounds a bit paranoid. Not meaning to be but this is a post that I'm adding to my "Just because" series as well as my "Are you kidding me?" series. That's how bad this is... almost as bad as these "stupid bacon related tatoos" - really, someone has this (forever!):

As a DBA (and/or SQL Architect/Database Designer/Developer) we're often tasked with making changes to databases and ideally, we want to use tools to make it easier. While I do agree that many tools help tremendously, I've been shocked lately by a few that have done the wrong thing. Paul blogged about a third party tool that didn't analyze indexes correctly here: Beware of advice from 3rd-party tools around dropping indexes (ultimately, the recommendations for what to drop - were wrong). And, recently, I've been shocked to learn that some of the schema change options in the database designer (in SSMS) are less than optimal when changing index definitions (especially those that handle changes to the clustering key).

Specifically, we had a customer that was doing everything right by testing the changes on a secondary system but SSMS was doing everything wrong (the order in which it made the changes and how it made the changes - were horrible). The end result is that if you're going to make schema changes, you really need to get more insight into what the application is doing. And, many tools have an option to script out the changes rather than execute them. So, my "quick tip" is to do that and see what's going on.

And, if you're going to make schema changes in ANY tool - then work through what those changes are going to do. Most important - consider re-arranging the SQL (especially wrt indexes). If you're ever changing the CL index you need to do the following:

  1. Drop nonclustered indexes first
  2. Drop the clustered next
    • IMPORTANT NOTES: If you're completely changing it you definitely need to drop the nonclustered and then drop the CL. BUT, if you're keeping the name and only changing the definition (and it's NOT a constraint-based index) then you can use WITH DROP_EXISTING. If you use DROP_EXISTING you do NOT need to drop the nonclustered indexes first.
  3. Create the new clustered
  4. Recreate the nonclustered indexes

Some of the weird things that we've seen:

  1. Some changes to table structure (changing from a heap to a clustered or a clustered to a heap) cause disabled nonclustered to get rebuilt (and re-enabled). That's not even SSMS - that's a general SQL Server problem. In my opinion this means that disabling indexes is somewhat broken. So... gotta use drop/create (which is a bummer!).
  2. Changes to the clustering key in SSMS are sometimes done by dropping the clustered and then recreating the clustered. This will cause your nonclustered indexes to be rebuilt TWICE!
  3. If you have col1 that's currently clustered and col2 is currently nonclustered but you want to swap them... you can do that in the UI and then save but here's what they do:
    1. They drop the nonclustered
    2. They drop the clustered (this causes all of the remaining nonclustered to get rebuilt)
    3. They create the nonclustered
    4. They create the clustered (this causes all of the remaining AND the new nonclustered to get rebuilt... again)

This is only a short list of some of the crazy things that we've seen. I'm working on a more comprehensive overview of all of these things but I wanted to get a few of them out there. BEWARE OF THE TOOLS and always use scripting and testing to make sure that things are doing what you think. If you're EVER going to make changes to a critical system - this is exactly what I'd do:

  1. Script out *JUST* the objects and their definitions from the production environment
  2. Take *just* the schema and then go through and make your changes in the designer. Instead of saving the changes (which would immediately implement them), script out the changes and review the SQL.
  3. Run through the SQL and see what it does (but, this is an empty database so time isn't going to be as much of a factor here). If you think there's something wrong - ask around (colleagues/twitter/forums)...
  4. Then, once you feel you have a good version of the script THEN backup/restore the production database to your test system (I hope that you have this??) and THEN see how long it takes. If you have new concerns then ask around again!
  5. Next, before you consider if for production - you need to thoroughly test your applications. Are they affected by these changes?
  6. Then, and only when you've thoroughly tested it - you can consider it for production.

Picky yes... surprises NO. On a critical system you CANNOT afford surprises that create downtime - or worse, data loss.

Thanks for reading!!
kt

I've decided to create a new series of posts - just because you can, doesn't mean you should! These are going to go in a new category called "Just because" and I've already got a few lined up in this series. The first one is one I've spoken about often and one that's near/dear to my heart - indexes.

I've often received the question - why did they increase the limit from 249 nonclustered indexes to 999 nonclustered in SQL Server 2008? Does that mean that I have a need for more - potentially, many more - indexes? Simply put - yes and no. (a new spin on "it depends" :)). However, for most environments, NO is the correct answer. And, a VERY STRONG NO at that. To help give you some insight, I've answered the question "how many indexes should a table have" in this SQL Server Magazine Q&A blog post titled How Many Indexes Should I Create? (however, it's a difficult question to answer in general). But if everyone says that tables should have minimal indexes then why would the SQL team increase the limit from 249 to 999? It just doesn't seem to make sense.

And, it's a great start to my "just because" series in that this is one area where you can really shoot yourself in the foot if you create too many indexes (which, btw, has become a VERY common problem). So... I want to break this down into a few parts (and therefore posts).

Why shouldn't I create a lot of indexes?

  • Indexes have overhead for every INSERT/DELETE on the table. Every INSERT must add a row to each/every nonclustered index. Every DELETE must remove a row from each/every nonclustered index.
  • Indexes require disk space. While nonclustered indexes are generally a lot smaller than the table itself their total size (of all indexes together) can often exceed the size of the table by 2 or 3 times. This can be quite normal. However, a table that is poorly indexed and severely over indexed might have index space which is 6-10 times the size of the base table. This is wasted space on disk and also in cache. While disk space is relatively cheap, memory is not. But, that's not even the most important point. There's a lot more to it... For more insight, check out the post titled: Disk space is cheap... (that's NOT the point!)
  • Even if you have the disk space to store all of these indexes, it's unlikely that you can fit all of your tables and all of your excessive indexes in cache. Most environments have a hard time fitting all of their data in general in cache - let alone a poorly/overindexed set of data. As a result, you're going to be constantly marshalling data in and out of memory and causing your system to perform excessive IOs.
  • You can cause the optimizer to waste time in evaluating indexes as well as cause compilation plans to be bigger. This can in turn waste both time and waste even more cache. Especially if you have a lot of plans being generated through adhoc statements. Each statement might waste quite a bit of cache. For more insight, check out the post titled: Clearing the cache - are there other options?

Why is this a common problem?

I'll go into more details in a moment but there are really three reasons:

  1. SHOWPLAN shows missing index recommendations when evaluating a plan. Again, some folks take that to mean that they absolutely should create the index.  Often, I've seen multiple developers working on a project all get their index recommendations in a vacuum (per se) and then they just provide scripts to the DBA for the indexes to implement. At this point, someone should evaluate the overall combination but often no one does. And, this leads to another reason why this is such a big problem.
  2. SQL Server 2005 introduced the missing index DMVs and some people believe that all of the indexes it recommends should be created (which is far from the truth).
  3. SQL Server (every version) will allow you to create redundant indexes. Not just similar (and redundant) but completely duplicate. There is a reason and I wrote about it in this SQL Server Magazine Q&A titled: Why SQL Server Lets You Create Redundant Indexes

The end result of this combination is that more and more servers I see - have too many indexes and often redundant or similar indexes. This wastes resources, time, cache and ultimately adds to a servers inability to scale as the data set grows. See, during development when the data sets were smaller, none of this really matters. Almost none is detected until more data and more users (and unfortunately, more code and more applications) exist. And, at that point it's too late to make schema changes BUT, it's not too late to make index changes. Phew. Believe me, many problems are MUCH harder to solve after an application is in production but indexes are one of the easier ones.

Common Problem 1: Index Recommendations from SHOWPLAN (the green hint)

 

Don't get me wrong - I *LOVE* that this exists. And I *always* look at what they recommend. It's a good way to see what tables might have holes in my indexing strategy. But, there's a potential problem here. These recommendations are tied to the Missing Index DMVs and so BOTH have this fault but it's A LOT more apparent when using the "green hint" than using the DMVs. My main reason to think twice about these recommendations is that the Missing Index DMVs recommendations are based SOLELY on the plan that was executed. SQL Server did not analyze the statements for different algorithms. If a hash join was used then the index recommended will be an index that will help the hash join. However, your query might be SIGNIFICANTLY better off using a merge join instead. No, please understand that this is NOT a generalization (I'm not saying that merge is better than hash) it's just an example of a situation where I saw a query go from many minutes to HALF with the best index for the hash join (hey, that's good, right?) BUT, the query went down to only 4 seconds with the merge join's index. There was NO comparison here.

There are many cases where the plan that was executed IS the best plan but there are also cases where it's not. As a result, if I'm about to create the index from the "green hint" in SHOWPLAN, I will always run the individual query through DTA (the Database Tuning Advisor) when possible. This will more thoroughly evaluate the query, providing me with alternatives. Yes, it's "focused" tuning but if you're already going to create an index for that query then you might as well work out the best one. Having said that, I will ONLY create the index recommended for the table that the "green hint" recommended and I will see if that gives me enough of a gain. DTA will often make additional recommendations for both indexes and stats. For stats, I would HIGHLY recommend creating all of the statistics it recommends but ONLY the table(s) where you actually create the indexes it recommends. Personally, I don't always create all of the recommended indexes. If I'm going to create them I'll do so iteratively - testing the improvements as I go so I know what indexes give me the best gains. But, I will always add the statistics for the tables where I create their recommended indexes. Why? Because the stats (and usually multi-column stats) are there to help the indexes and their uses. So, it's pretty much always a good thing to create those stats.

Common Problem 2: The Missing Index DMVs

Don't get me wrong - I *LOVE* that these exist too and just like SHOWPLAN, I use these as well. There are some great examples out there of ways to better evaluate what's in them, what should be created and there are even some cool posts about clever ways to corroborate what's recommended by looking at showplan XML, etc. Check out these links:

However, there's a dark side to these missing index DMVs and these points aren't often mentioned and - they might SIGNIFICANTLY change the way that you view these recommendations. First, they're not perfect (ok, nothing is but I had to remind you :). The Missing Index DMVs might recommend indexes that you have already created - even after you created them, even after a reboot, etc. They will just continue to recommend the index as if it didn't exist. Check out Paul's post titled: Missing index DMVs bug that could cost your sanity. And, second, my issue ties in with Common Problem 1. Again, these recommendations are NOT for the best index but for indexes that will help the plans as they were executed.

However, there's a HUGE win that comes from the Missing Index DMVs OVER SHOWPLAN. And, that's that they tune for query classes. What this means is that a query that's higher up in the "user impact" category will likely affect many queries and therefore it is much higher on the list to consider. But, even armed with this information, I would NOT automate the creation of these indexes based on the results of queries that access these DMVs. Instead, as an application runs in production and as workload characteristics are being evaluated - save this information and use it when you're tuning to help you know better what indexes might give you the best improvements.

Common Problem 3: Duplicate/redundant Indexes

I can't help you with this one. Except feel free to increase the vote count on this on connect. I found that someone recently (3/17/2011) re-added this as a DCR here: https://connect.microsoft.com/SQLServer/feedback/details/652071/duplicate-indexes#tabs. I've made a comment to further explain it and so I hope that they will reevaluate it.

In Summary

The tools are great. You definitely WANT to use them. However, you must understand their caveats and pitfalls and add indexes only when it's appropriate. Adding more and more indexes does NOT improve performance. And, above all - be sure to review your existing indexes to see if you can drop an existing index OR make your index slightly different to consolidate index definitions into one index that does a lot more with less overhead, etc. I call this Index Consolidation. And, this is another topic for another day. :)

So - why would you want to create lots and lots of indexes?

Ha, I bet you never thought we'd get here? And, for today - I'd rather you just DON'T do it. Just because you can, doesn't mean you should. I'll give you some reasons to consider LOTS of indexes...later...after you've cleaned up all of these redundant indexes! And, to help you find redundant indexes - use my latest version of sp_helpindex from my category: sp_helpindex rewrite. This will help you to more accurately see what's really in those indexes.

Thanks for reading!
kt

Become a SQLskills Insider NOW!

I realize that I've created quite a few posts (and rewrites) around sp_helpindex and recently I've been recommending that most folks use my latest version (jokingly, this is sp_helpindex9) but it's not so obvious where to get it and/or what to do to install it. To make it easier... I'm just going to do a post like this EVERY TIME I have a new version and so there won't be anything else to review/read.

All versions use a base procedure that builds the columns needed to produce the detailed output. So, you need to setup TWO procedures.

Step 1: setup sp_SQLskills_ExposeColsInIndexLevels

Create this procedure first: sp_SQLskills_ExposeColsInIndexLevels.sql (6.73 kb).

This is what gives us the tree/leaf definitions. And, this works for both SQL Server 2005 and SQL Server 2008.

Step 2: setup the replacement procedure for sp_helpindex. This IS version specific:

On SQL Server 2005, use: sp_SQLskills_SQL2005_helpindex.sql (10.50 kb) to create sp_SQLskills_SQL2005_helpindex.

On SQL Server 2008, use: sp_SQLskills_SQL2008_helpindex (sp_helpindex9).sql (11.06 kb) to create sp_SQLskills_SQL2008_helpindex.

Step 3: Optionally, setup this procedure to be invoked through a keyboard shortcut using Tools, Options, Environment/Keyboard. I usually make it Ctrl+F1 and I described how to do this here.

Enjoy!
Kimberly

OK, I've talked about the clustering key many times. Here, I thought I'd bring together one final series of details (and links) to help you really understand why I'm so adamant about DESIGNING a clustering key and not just letting SQL Server pick it for you (for example when it defaults to making the primary key clustered). Just because SQL Server defaults to making the primary key clustered doesn't actually mean it's a GOOD clustering key!

The key things that I've always recommended about a GOOD clustering key is that it is: unique, narrow, static and ever-increasing. For more details on the reasons behind this, check out these posts:

And, today, there are two additional items that I want to add to this list: your clustering key should be non-nullable and fixed-width!

First, why non-nullable?

In a regular data row you will ALWAYS have a null block. This is 2 bytes for a column count (in that row) and 1 bit per column (to store the actual null values). If you want to get more details on the internals of a data row, see Paul's post: Inside the Storage Engine: Anatomy of a record. However, the btree of the clustered index and the nonclustered index leaf/non-leaf levels do NOT have to have this "null block" of information if the columns in the index do NOT allow nulls. While this may be only 3 or 4 bytes (depending on the number of columns), it's still 3 or 4 bytes that would have to be added to EVERY nonclustered index's leaf level (for every row). And, it just doesn't have to be there. Why waste space when you don't have to!

Second, and along the same lines, is the need to use fixed-width columns!

Just like in data rows, index rows only have a variable block (offsets + end of row) when there are variable-width columns. If a clustered index is chosen that has one or more variable-width columns then you are adding at least 4 bytes to every row that might not have otherwise needed to be there. In fact, often people forget that uniquifiers are stored in the variable block portion of the row and as a result, a uniquifier really needs 8 bytes (4 bytes for the int + 2 bytes in the variable block for the offset + 2 bytes for the end of row marker).

To show you this, I've created four tables each with the same 3 columns DATA TYPES (table 4 has col1 as an int that's nullable and you can't do that with an identity so I used INSERT/SELECT to copy the data over):

col1 int identity not null,
col2 datetime2(7) not null default sysdatetime(),
col3 datetime2(7) null default sysdatetime()

And, each of the tables has these four nonclustered indexes:

  • Non-unique nonclustered on col2
  • Unique nonclustered on col2
  • Non-unique nonclustered on col3
  • Unique nonclustered on col3

The difference is solely within the definition of the clustering key as well as weather or not the values are unique (or nullable)!

CLTable1 is clustered and non-nullable but NOT defined as unique (the values are unique)

CLTable2 is clustered and non-nullable but there are duplicate values in the clustering key

CLTable3 is defined as a unique clustered index

CLTable4 is clustered, nullable and there are duplicate values in the clustering key

The end result is the sizes of all of the keys and the wasted space from different keys!

For CLTable1, CLTable2 and CLTable3 - all of the minimums are the same. For CLTable1 and CLTable3 - the maximums are the same as the minimums. In CLTable3 it's because the records MUST be unique. In CLTable1 it's because the records just happen to be unique right now (meaning there's no overhead unless there are actual dupes).

        record_size_in_bytes
index_id           min          max          avg
1 27 27 27
1 11 11 11
2 13 13 13
2 19 19 19
3 13 13 13
3 15 15 15
4 16 16 16
4 22 22 22
5 16 16 16
5 18 18 18

For CLTable2 the maximums are each 8 bytes larger (highlighted in yellow) EXCEPT where in the btrees of UNIQUE nonclustered indexes where SQL Server does NOT need to duplicate the clustering key (pale pink):

        record_size_in_bytes
index_id         min         max          avg
1 27 35 27.73
1 11 19 11.76
2 13 21 13.73
2 19 27 20.09
3 13 21 13.73
3 15 15 15.00
4 16 24 16.73
4 22 30 22.92
5 16 24 16.73
5 18 18 18.00

NOTE: If you're not familar with index internals and when/why SQL Server duplicates the clustering key in nonclustered indexes, check out my Index Internals chapter from the SQL Server 2008 Internals title (here's the Amazon link: http://www.amazon.com/Microsoft-SQL-Server-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&qid=1291001371&sr=8-1) and my companion content from Chapter 6 here: Companion content for Chapter 6 (Index Internals) of SQL Server 2008 Internals. And, if you really want to see what's in your indexes, check out my updated versions of sp_helpindex here: http://www.sqlskills.com/BLOGS/KIMBERLY/category/sp_helpindex-rewrites.aspx.

Finally, CLTable4 has the most interesting results...

       record_size_in_bytes
index_id         min         max          avg
1 27 35 27.73
1 14 22 14.78
2 16 24 16.73
2 22 30 22.92
3 16 24 16.73
3 15 15 15
4 16 24 16.73
4 22 30 22.92
5 16 24 16.73
5 18 18 18

Nonclustered indexes 2 and 3 (and the clustered index's btree) are all 3 bytes larger... why? Because the clustering key allows nulls and each of these structures has only 2-3 columns. As a result, the space needed for the null block is 3 bytes (2 bytes for the column count [NCol] and 1 byte for the actual null bitmap [less than 9 columns]).

The btree for index id 3 only needs the nonclustered key column (col2) which does not allow NULLs and the index is unique so the tree is the same as before.

Nonclustered indexes 4 and 5 already had a nullable column and therefore already had a null block.

My point - it all adds up and if it's not truly necessary, then avoid it!

Thanks for reading!
kt

NOTE: I've released other versions of sp_helpindex since this post. Check out the category: sp_helpindex rewrites for all of the options!

OK, I first blogged about re-writing sp_helpindex here as sp_helpindex2 (April 2008). Shortly thereafter (Aug 2008), a reader found a bug and when I went digging, so did I (a couple of others). I did a few tweaks and the version that's stuck for the past couple of years is in this blog post: Updates (fixes) to sp_helpindex2.

However, as much as I've used that - I've always been frustrated by its output. And, so, over the months+, I wrote sp_helpindex3 (this solely added the index_id as the first column and then ordered the output; I've always hated the weird way in which sp_helpindex/2 ordered the output). Then, I started writing sp_helpindex4 which listed the clustering key columns with each index so that I could more easily identify columns on large tables without a lot of scrolling. Ultimately, I didn't like this because the clustering key (or some of the columns in a multi-column clustering key - might already be in the nonclustered index and so I still had to figure things out more manually). And, so I finally sat down and created what's jokingly being called sp_helpindex8.

Geeky note: Er, you might wonder with how many geeky SQL people I've joked about sp_helpindex8 but in fact, in my last few events (both onsite with a couple of customers as well as when teaching the Indexing Strategies and Performance Tuning portion of the Microsoft Certified Masters - SQL Server course (two weeks ago)), I've actually joked - and promised - that I'd have this blogged before this particular 3 week MCM rotation ends. And, well, that's this week. So, I've actually joked with quite a few folks on this and I'm close to the end of time on blogging it. So, here I am - sitting in a hotel room in Houston, after having presented at the Houston Area User Group (7-9pm) (yes, with Paul) and after having been onsite all day with a customer - blogging this darn thing. I just have to get it out there. ;-)

Having said that - I'm pretty happy with this version but it was pretty painful to write. And, before I get into what it does - and why I wanted this version - let me explain what I didn't do and why. There are a lot of other things that could be added. And, quite a few folks have said - why don't you do this or why don't you do that... Specific suggestions have been made such as adding index usage stats (from the current state shown by the DMV) or add index operational stats (to show if an index is suffering from latch IO waits, for example) or add index physical stats (to show if an index is fragmented), etc. And while many of these suggestions would make for lots of interesting uses of this output, I've stayed away from these. Why? Some are expensive to run - such as physical stats - even a limited scan can be painful on a really large table. Or, the results might not be able to be truly trusted  - such as with index_operational_stats. Why? Because the information from that DMV might only cover a short period of time. The information help by them is cleared whenever SQL Server is restarted (or if a database is taken offline or restored, etc.). So, when were those usage or operational stats last cleared? Moreover, is this single "look" at the DMV information really the best on which to base our decisions (even just in general).

In the end, I think that physical stats, operational stats and even usage stats - should be used wisely and carefully (definitely used but not by this). In fact, we often recommend that the best way to use these is by studying them over time and then analyzing their trends not just the point in time values. And, there are LOTS of posts that can help you with understanding this information; here's just a couple to get you started:

So, having said that - I've kept my sp_helpindex8 more pure ;-). What I want to see is EXACTLY what's stored in that index. In fact, this has always been my main goal. The first version (sp_helpindex2) added INCLUDED columns (2005+) as well as filter definitions (2008) to the output. So, what was missing? The added columns that SQL Server might have to add if a nonclustered is nonunique and/or when the clustering key is nonunique. The reason this is so important to me is that there are many things coming together and recommending indexes (DTA, the missing index DMVs, application tools, etc.) and while many of these tools are solely there to "help" us get better performance and "help" us create more effective indexing strategies - these tools sometimes recommend the same index over and over again (see Paul's recent blog post: Missing index DMVs bug that could cost your sanity... about the bug in the missing index DMVs) OR the tools might recommend similar indexes to what we already have.

Additionally, the output from sp_helpindex/2 only shows the columns that are EXPLICITLY defined in the key and in the leaf (using INCLUDE) but they don't show what SQL adds for you. Most folks know that SQL Server adds the clustering key to the nonclustered indexes and when you have a simple single-column clustering key it's not difficult to know that this IS "there" in your index even though it doesn't show up. But, even with that, I think there are still a few things that are lesser known. For example, if your nonclustered index is nonunique - did you know that the clustering key is not only added to the leaf level of the index (like INCLUDE does) but that the clustering key is added into the tree (as part of the key and ordering?). And, if have one index that lists this clustering key explicitly and another that doesn't - are you actually going to be able to easily tell that these are in fact the SAME. Ultimately, I want you to be able to better recognize what's REALLY in that index. What you might end up seeing is that two indexes are more similar than you first thought and as a result - maybe you can drop one of them. Or, you might be able to combine them to reduce your overall indexes (I call this index consolidation). And, all of this becomes significantly more challenging when you have more complex clustering keys (and, yes, they do exist!). The other thing I did is delimit the columns and index names properly using brackets. This is also nice because you might have identifiers that make the output of sp_helpindex/2 hard to read. This should help. However, I do have to admit that you probably have other problems if your index/column names have commas, spaces or brackets in them. But, this should still help/work.

And, that's why I wrote "sp_helpindex8" it's a MUCH better version of sp_helpindex but it requires a few other components. And, it was a lot more challenging to write. Basically, this information is more internal to SQL Server and not directly exposed anywhere. So, we had to do a bit of digging. Paul started the process (thanks!) but, I ended up spending about 5 hours getting it exactly the way I wanted it. And, I also ended up breaking it up into two parts: one sp that gets column definitions and the other that builds the final output. And, I've actually named it with a longer name (sorry!). But, I rarely type in the name anyway. What I typically do is go to SSMS, Tools | Options, then under Environment | Keyboard, I set the keyboard shortcut for Ctrl+F1 to my sp_SQLskills_SQL2008_helpindex.

Now, when I want to see the true defintions of indexes on a table - I highlight the tablename and hit Ctrl+F1. But, if you want to rename this to sp_helpindex8 - I won't be offended. ;-)

So, how do you create this?

  1. You need to create sp_SQLskills_ExposeColsInIndexLevels (sp_SQLskills_ExposeColsInIndexLevels.sql (6.73 kb)). This is what gives us the tree/leaf definitions. And, this works for both SQL Server 2005 and SQL Server 2008.
  2. You need to create the version specific sp_helpindex8. These are named: sp_SQLskills_SQL2005_helpindex (sp_SQLskills_SQL2005_helpindex.sql (10.50 kb)) and  sp_SQLskills_SQL2008_helpindex (sp_SQLskills_SQL2008_helpindex.sql (10.73 kb)). Again, I know they're long names but using the keyboard shortcuts can help. And, if you do rename these make sure you do a replace ALL. There are multiple places where there are comments and/or statements that need to execute to get everything to work (for example the sp_MS_marksystemobject execution).

NOTE: The very first version of the 2005 script had a bug in it (thanks Calvin!). Funnily one I already knew about and thought I had caught (it was fixed in the 2008 version). Anyway, as of 3:15pm CT on May 19, it's fixed in the scripts here. There could be others but so far - it's looking good!

Enjoy!
kt

OK, I'll definitely take a beating from all of you for having gone so long between my survey posts and now. I won't even go into the details but between some crazy work schedules, multiple sinus problems and even migraines... well, I've been a bit behind. Let's just say that April/May were rough at best. I'm feeling better and well, now I'm trying to catch up. I had really gotten the blogging bug in March but I completely lost it in April. But, this tipping point series is in dire need of lots of explaining so I'm really hoping to get a few posts done in this area for sure!

First, I started the discussion around this in a few surveys:

Survey/Question 1

Q1 was described as this: if a table has 1 million rows at 20 rows per page (50,000 pages), at what percentage (roughly) of the data would a nonclustered index no longer be used. Blogged here. Here's what the survey said as of today:

And, for Q1 the correct result (Between 0-2% of the rows) is actually the best result (but, by no means the overwhelming majority at only 28%). However, often people just "think" the answer is very small. So... I did a few more questions/surveys. 

Survey/Question 2

Q2 was described as this: if a table has 1 million rows at 100 rows per page (10,000 pages), at what percentage (roughly) of the data would a nonclustered index no longer be used. Blogged here. Here's what the survey said as of today:

And, for Q2 the correct result (Less than .5% of the rows) is actually at a tie for the best (but, again, even a small percentage at only 22%). Again, often people just "think" the answer is very small. So... I did one more question/survey. 

Survey/Question 3

Q3 was described as this: if a table has 1 million rows at 2 rows per page (500,000 pages), at what percentage (roughly) of the data would a nonclustered index no longer be used. Blogged here. Here's what the survey said as of today:

And, for Q3 the correct result (Between 10-20% of the rows) is actually NOT the highest answer. And, this is even more convincing that there's confusion around what's going on and why.

The Tipping Point

What is the tipping point?

It's the point where the number of rows returned is "no longer selective enough". SQL Server chooses NOT to use the nonclustered index to look up the corresponding data rows and instead performs a table scan.

When does the tipping point occur?

It depends... it's MOSTLY tied to the number of pages in the table. Generally, around 30% of the number of PAGES in the table is about where the tipping point occurs. However, parallelism, some server settings (processor affinity and I/O affinity), memory and table size - all can have an impact. And, since it can vary - I typically estimate somewhere between 25% and 33% as a rough tipping point (and, you'll see from a bunch of my examples, that number is not EXACT). Then, I translate that into rows.

Math for Tipping Point Query 3: If a table has 500,000 pages then 25% = 125,000 and 33% = 166,000. So, somewhere between 125,000 and 166,000 ROWS the query will tip. Turning that into a percentage 125,000/1million = 12.5% and 166,000/1million = 16.6%. So, if a table has 500,000 pages (and 1 million rows) then queries that return less than 12.5% of the data are likely to USE the nonclustered index to lookup the data and queries over 16.6% of the data are LIKELY to use a table scan. For this table, that percentage seems "reasonable". But, most of us say that the tipping point happens at a much lower percentage... why? Because row size - which determines table size (and therefore pages) is really what has the greatest impact. So, let's look at Tipping Point Query 2... 

Math for Tipping Point Query 2: If a table has 10,000 pages then 25% = 2,500 and 33% = 3,333. So, somewhere between 2,500 and 3,333 ROWS the query will tip. Turning that into a percentage 2,500/1million = .25% and 3,333/1million = .33% (not even 1%). So, if a table has only 10,000 pages (and 1 million rows) then queries that return less than a quarter of 1% of the data are likely to USE the nonclustered index to lookup the data and queries over one third of one percent are LIKELY to use a table scan. For this table, that percentage seems really low BUT, at the same time it makes sense (to a point) that a small table would be scanned... but, for less than 1%. 1% is NOT selective enough. For small tables, it might not matter all that much (they're small, they fit in cache, etc.) but for bigger tables - it might be a big performance problem. 

Math for Tipping Point Query 1: If a table has 50,000 pages then 25% = 12,500 and 33% = 16,666. So, somewhere between 12,500 and 16,666 ROWS the query will tip. Turning that into a percentage 12,500/1million = 1.25% and 16,666/1million = 1.66% (under 2%). So, if a table has 50,000 pages (and 1 million rows) then queries that return less than 1.25% of the data are likely to USE the nonclustered index to lookup the data and queries over 1.66% are LIKELY to use a table scan. Again, this seems like a low number. Again, for small tables, it might not matter all that much (they're small, they fit in cache, etc.) but as tables get larger and larger - it CAN be a big performance problem. 

Why is the tipping point interesting?

  • It shows that narrow (non-covering) nonclustered indexes have fewer uses than often expected (just because a query has a column in the WHERE clause doesn't mean that SQL Server's going to use that index)
  • It happens at a point that's typically MUCH earlier than expected... and, in fact, sometimes this is a VERY bad thing!
  • Only nonclustered indexes that do not cover a query have a tipping point. Covering indexes don't have this same issue (which further proves why they're so important for performance tuning)
  • You might find larger tables/queries performing table scans when in fact, it might be better to use a nonclustered index. How do you know, how do you test, how do you hint and/or force... and, is that a good thing?

Real example of an interesting tipping point

Earlier today, I went on facebook and twitter and gave the following information - very vaguely - and I asked "why" is Q2 so much slower than Q1 if Q2 returns only 10 more rows. Same table and no hints (other than MAXDOP)...

Q1: SELECT * FROM table WHERE colx < 597420 OPTION (MAXDOP 1)

  • returns 197,419 rows
  • takes 116,031 ms (1 minute, 52 seconds)
  • 1,197,700 logical reads, 5 physical reads, 137,861 read-ahead reads
  • 7,562 ms CPU time

    Q2: SELECT * FROM table WHERE colx < 597430 OPTION (MAXDOP 1)

  • returns 197,429 rows
  • takes 244,094 ms (4 minutes, 4 seconds)
  • 801,685 logical reads, 1410 physical reads, 801,678 read-ahead reads
  • 9,188 ms CPU time

There were lots of great guesses... but, it's the tipping point. SQL Server chose to "tip" the second query because it was "over the line". But, it's important to realize that there are cases when that's NOT a good idea. And, what are your options?

In SQL Server 2005 - the only option is to force the nonclustered index to be used:

Q2: SELECT * FROM table WITH (INDEX (NCInd)) WHERE colx < 597430 OPTION (MAXDOP 1)

But, this can be TERRIBLY bad on some machines where the IOs could be a lot faster (and where data might already be in cache). These specific numbers are exactly that - specific to this HARDWARE (and, I chose not-so-optimal HW in this case to highlight this problem). And, depending on what number you use (what if this is a parameter in sps?) you might force SQL Server to do WAY more IOs by forcing the index than allowing the tipping point to do its job. But, depending on your hardware (and/or what you know to be in cache at the time of execution), it might be better to force an index instead of letting SQL Server choose. So, should I force the index? Be careful, if you're wrong - it could take more time and actually be slower.

In SQL Server 2008 - there's a new hint - FORCESEEK:

-- Cut/paste error with this hint... no INDEX

Q2: SELECT * FROM table WITH (INDEX (FORCESEEK)) WHERE colx < 597430 OPTION (MAXDOP 1)

Q2: SELECT * FROM table WITH (FORCESEEK) WHERE colx < 597430 OPTION (MAXDOP 1)  -- Thanks to @SQL_Kiwi

FORCESEEK is better because it doesn't tie you to a particular index directly but it also doesn't let SQL Server tip to a table scan. However, just like forcing an index - you can be wrong!

So, what should you do? It depends. If you know your data well and you do some extensive testing you might consider using a hint (there are some clever things you can do programmatically in sps, I'll try and dedicate a post to this soon). However, a much better choice (if at all possible) is to consider covering (that's really my main point :). In my queries, covering is unrealistic because my queries want all columns (the evil SELECT *) but, if your queries are narrower AND they are high-priority, you are better off with a covering index (in many cases) over a hint because an index which covers a query, never tips.

That's the answer to the puzzle for now but there's definitely a lot more to dive into. The Tipping Point can be a very good thing - and it usually works well. But, if you're finding that you can force an index and get better performance you might want to do some investigating and see if it's this. Then consider how likely a hint is to help and now you know where you can focus.

Thanks for reading,
kt

OK, so this is interesting. I've got a few answers to my last survey (Tipping Point Query #1) and well, there's a good mix of answers (and, yes, some are correct! ;)). Be sure to go back and review that last post so that you can evaluate it and these two tipping point questions completely. So, now I want to see if people really know the basis of "the tipping point".

Try these two:

Tipping Point Query #2

Table1 (t1) has 1 million rows at 100 rows per page. The table has 10,000 pages. A nonclustered index exists (on name) but it does not cover the query. At what percentage (of the table) is this nonclustered index no longer selective enough to use:

Tipping Point Query #3

Table2 (t2) has 1 million rows at 2 rows per page. The table has 500,000 pages. A nonclustered index exists (on name) but it does not cover the query. At what percentage (of the table) is this nonclustered index no longer selective enough to use:

OK, so I'd really love to see quite a few responses to these *3* "tipping point" questions. I PROMISE to do a nice long (and detailed) post for what is the actual tipping point AND the answers to all three of these questions. I'll explain the math as well as how you can generalize "what is selective enough" so that you can better create your nonclustered indexes!!!

Thanks for reading - and responding to these brain teasers!!

Cheers,
kt

PS - It's snowing here (ah...again)... maybe I'll spend the day creating brain teasers??! Do you guys like this kind of a post? (well, I suppose you won't really know until I post the answer part of it... but, just in general??). I think it's pretty cool. But, don't worry, I won't (nor will Paul) make all of my posts surveys. But, I think this is a really good one. I'm anxious to see if the asnwers come in correctly for these two as well! Have at it!

Along the same lines of improving database design and getting better performance on SQL Server (which [IMO] DOES take an experienced SQL Server database developer - but, we'll talk more about "whose job this really is" in many more posts and probably even a RunAs - which Richard and I just setup to record on Thursday (Mar 12)), I started thinking about how I could convince people of why they NEED a database developer. So, I thought I'd ask this VERY important question...

What percentage of data IS selective enough to use a nonclustered index which doesn't cover the query... in other words (just in case you're not entirely sure of what I mean :)), think of indexes in the back of a book... if you need to go to the back of the book to reference a bunch of data (this is called a [bookmark] lookup in SQL Server), there's a point where the randomness of the lookups (especially if you think in terms of many rows on a page) becomes too expensive. For example, imagine that the index is customer name and the data (the book) is customer orders - and, each page (of this rather weird book ;)), has 20 orders on it. Doing a query to lookup customer number 12's orders might be really easy (if they only have only a few orders) BUT, what if the query is "show me all of the orders for people that have an 'e' in their name". First, the number of people have have an 'e' in their name is probably better than 50% (that's TOTALLY a guess) and, if there's 20 orders per page then a lookup from the index into the book would require SQL Server to touch every page roughly 10 times. If the table has 50,000 pages (therefore 1 million rows - at 20 rows per page), then to find the 500,000 rows (remember, I'm estimating half), SQL Server would have to do 500,000 bookmark lookups. For a table with only 50,000 pages that's terribly expensive.

So, here's the question - what's the tipping point? When is a nonclustered index on customer name NOT going to be used to lookup rows of sales orders? I'm going to use a survey to see what you think and then within a week, I'll give the specific SQL Server math AND a query you can run within your own DBs to see EVERY one of your table's "tipping points". It's really interesting and I think will really help you to understand why SQL Server might not be using those nonclustered indexes.............

Cheers,
kt

Theme design by Nukeation based on Jelle Druyts