SQLskills introduced our new SQL101 recently and well… indexing is something that everyone needs to get right. But, it’s not a simple task. And, as I start to sit down to write a SQL101 post on indexing, I suspect I’m going to struggle keeping it simple? However, there are some core points on which I will focus and I’ll be sure to list a bunch of additional resources to get you more information from here! Remember, the point of each of our SQL101 posts is to make sure that everyone’s on the same page and has the same basic idea / background about a topic. And, for indexing, that’s incredibly important (and, often, misunderstood).
What is an Index?
Simply, it’s a structure that’s applied to a set of [or, subset of] data to enforce ordering – either to quickly check uniqueness or to aid in accessing data quickly. Simply, that’s why you create an index. You’re either wanting to enforce data integrity (such as uniqueness) or you’re trying to improve performance in some way.
How Does an Index Enforce Uniqueness?
If you want to maintain uniqueness over a column (or a combination of columns), SQL Server takes the required data (and very likely more data than you specifically chose) and sorts that data in an index. By storing the data in sorted order, SQL Server is able to quickly determine if a value exists (by efficiently navigating the index structure). For this intro post, it doesn’t entirely matter exactly what’s in the index but it does matter which index you choose and for what purpose.
Relational Rules Enforced by Indexing
In a relational database, relational rules rule the world. And, many learn some of the rules rather quickly. One of these rules is that relational theory says that every table must have a primary key. A primary key can consist of multiple columns; however, none of the columns can allow NULLs and the combination of those columns must be unique. While I agree that every table should have a primary key, what’s chosen AS the primary key can be more complex than the relational rules allow. Behind the scenes, a primary key is enforced by an index (to enforce and check for uniqueness). The type of index that is used depends on whether or not you explicitly state the index type or not. If you do not explicitly state an index type, SQL Server will default to trying to enforcing your primary key constraint with a unique clustered index. If a clustered index already exists, SQL Server will create a nonclustered index instead.
And, this is where things get tricky… a clustered index is a very important index to define. Internally, the clustering key defines how the entire data set is initially ordered. If not well chosen then SQL Server might end up with a structure that’s not as efficient as it could be. There’s quite a bit that goes into choosing a good clustering and I’ll stress that I think it’s one of the most important decisions to be made for your tables. And, it also needs to be made early as later changes to your clustering key can be difficult at best (often requiring downtime and complex coordinated scripting after already suffering poor performance before you make the decision to change).
So, let’s keep this simple… you’ll want to choose the clustering key wisely and early. And, you’ll want to get a good understanding on the things that depend on the clustering key. I did a discussion and demo in my Pluralsight course: Why Physical Database Design Matters and I’ve discussed this quite a bit in my blog category: Clustering key.
The key point is that the primary does NOT have to be enforced with a clustered index. Sometimes your primary key is not an ideal choice as the clustering key. Some great clustering key choices are:
- Composite key: EntryDate, RowID where EntryDate is an ever-increasing date value that follows the insert pattern of your data. For example, OrderDate for a table that stores Orders. RowID should be something that helps to uniquely identify the rows (something like an identity column is useful). Key points: choose the smallest (but reasonable) data types for both the date and the ID. Ideally, use DATETIME2(p) where p is the level of precision desired. And, for an identity column – choose INT if you know you’ll never get anywhere near 2 billion rows. However, if you even think you’ll have “hundreds of millions” of rows, I’d probably go straight for BIGINT so that you never have to deal with the problems that you’ll have if you run out.
- Identity column: When I don’t have a good composite key like that above, I’ll often consider an identity column for clustering – even if my queries aren’t specifically using this value. Even if you’re not explicitly using this value, SQL Server is using it behind the scenes in its nonclustered indexes. Please note that this is both a good thing and a bad thing. If your clustering key is very narrow then you’re not unnecessarily widening your nonclustered indexes. That isn’t to say that you won’t have a few wider nonclustered indexes but choosing a wide clustering key makes all of your nonclustered indexes wide when they might not need to be.
OK, I feel like I’ve started to open a car of worms with this one. But, the key points are:
- The primary key does NOT have to be clustered (and sometimes it’s better not to be)
- The clustering key needs to be chosen early and based on many factors – there’s no single right answer ALL the time… for example, if you don’t need any nonclustered indexes then the width of the clustering key becomes less of an issue.
At the end of this post, I’ll point you to more resources to help you to make a better decision.
What about Indexing for Performance?
In addition to enforcing uniqueness (and, allowing SQL Server to quickly determine whether or not a value already exists), indexes are used to help performance. And here’s where there are some very simple yet important things to understand. There are two types of performance tuning methods that I want to describe here: query tuning and database tuning. What’s often done most is query tuning. While that might be [temporarily] good for that query, it’s NOT a good long-term strategy for the server. I always START with query tuning but that’s not an ideal strategy to implement directly on your production server.
Query TuningQuery tuning is where you focus on nothing but the query to come up with a better indexing strategy. You might use the Database Engine Tuning Advisor on the specific query right from SQL Server Management Studio (see image). Or, you might use the “green hint” found in the show actual executing plan window. While these might significantly help you with that query, they may not be ideal to implement in your database (where other users are executing, where other indexes exist, and where resource access/usage are at a premium).
Don’t get me wrong, query tuning is a MUST. But, it’s just a starting point. I always start my tuning process by determining the best indexes for a query. But, you can’t stop there. You MUST do “database tuning” if you want your production database to truly scale.
Before you create a desired index in production (or, while you’re doing testing / analysis in development / QA [quality assurance]) you really want to check to see if this index is going to be good for production.
Are there other similar indexes?
Maybe you can consolidate some of these indexes into one. Yes, this consolidated index might not be the best for the individual queries but by creating one index instead of three, you’ll be reducing the cost of this index for data modifications, maintenance, storage, etc.
Are there any suggested missing indexes?
Again, before I create a new index, I want to see if I can get more uses out of it. Can I consolidate this new index with existing and/or missing recommendations? If I can then I’ll get more uses out of this index.
Are there good maintenance strategies in place?
Before you go and create more indexes, make sure that your existing indexes are being maintained. You should also check that the indexes being maintained are actually being used. To be honest, you should do that BEFORE you do any tuning at all.
SUMMARY: Steps for Database Tuning and Server Scalability
- Get rid of the dead weight. Clean up unused indexes. Consolidate similar indexes.
- Make sure your index maintenance strategy is in place. There’s no point in adding indexes if you’re not cleaning up fragmentation and reducing splits.
- Then, you can consider adding indexes BUT only after you’ve done the following:
- Query tuning
- Existing index consolidation
- Missing index consolidation
- And, of course, TESTING!
Well… that was much longer than I had hoped. But, there are a lot of good concepts here. Unfortunately, indexing for performance is just NOT a simple discussion. You can’t just put an index on every column and expect things to work well. Conversely, some of these tools seem helpful but they mostly do query tuning and not database tuning. If you really want to get better performance diving into indexing is a fantastic way to do this! So, if you’re motivated – here are a ton of resources to consider!
Learning more about Indexing
If you want to learn more about index structures, check out this older (but still useful page of videos). On it, watch them in this order.
- Index Internals
- Index Internals Demo
- Index Fragmentation
- Index Fragmentation Demo
- Even better – skip 3 and 4 and go to Paul’s Pluralsight course on SQL Server: Index Fragmentation Internals, Analysis, and Solutions
- Index Strategies
- Index Strategies Demonstration
Also, check out these blog posts:
- First, setup the latest version of sp_SQLskills_helpindex from this post
- Nonclustered indexes require the “lookup” key in the b-tree when?
- Explicitly naming CL key columns in NC indexes – when and why
- How can you tell if an index is REALLY a duplicate?
- Removing duplicate indexes (this was created for 2008 but the code still works on higher versions… but, yes, this is due an update! :))
I’m also working on a much more extensive course on Indexing for Pluralsight, this should be available within the next few months.
Thanks for reading!