SQLskills SQL101: Partitioning

Continuing on our path to understanding the basics and core concepts, there’s a big topic that’s often greatly misunderstood and that’s partitioning. I know I’m going to struggle keeping this one an introductory post but I’ve decided there are a few critical questions to ask and a few very important things to consider before choosing a partitioning strategy / design / architecture. I’ll start with those key points!

Partitioning is Not About Performance

I’m often asked – how can I use partitioning to improve this or that… and they’re often queries. I remember one where someone said they had a 3 billion row table and one of their queries was taking 15 minutes to run. Their question was – how can they best partition to improve query performance. I asked if I could look at the query and review their schema (secretly, that meant I wanted to see their indexes ;-)). And, sure enough, I asked if they were open to trying something. I gave them an index to create and asked them to run their query again. Without ANY changes to their table’s structure and only one added index, their query ran in 9 seconds.

Partitioning, like indexing, isn’t all unicorns and rainbows… (check out more fantastic work from tomperwomper by clicking on the image)

Now, don’t forget – indexes aren’t all unicorns and rainbows… indexes have a downside:

  • An index requires disk space / memory / space in your backups
  • An index adds overhead to your inserts / updates / deletes
  • An index needs to be maintained

So, you’d still need to determine if this is the right approach. But, the main point – partitioning really isn’t designed to give incredible gains to your queries. It’s meant to be better for data management and maintenance. However, some partitioning designs can lead to query performance benefits too.

Partitioning for Manageability / Maintenance

The primary reason to consider some form of partitioning is to give you option in dealing with large tables; I tend to start thinking about partitioning as a table approaches 100GB AND it’s not going away (but, even if some of it will regularly “go away” then partitioning is definitely an option to help manage that regular archiving / deleting). I wrote a detailed post about this in response to an MSDN Webcast question I was asked about range-based deletes on a large table. It’s a very long post but it revolves around “how to make range-based deletes faster.” You can review that here. (NOTE: That post was written before we converted our blog to the new format and I’ve not yet gone through and manually converted all of my old posts… so, the formatting isn’t great. But, the content / concepts still apply!)

So the real reasons to consider partitioning are:

  • Your table is getting too large for your current system to manage but not all of that data is accessed regularly (access patterns are really the key). There isn’t a magic size for when this happens; it’s really relative to your hardware. I’ve seen tables of only 40GB or 50GB cause problems on systems that only have 32GB or 64GB of memory). No, you don’t always need to put your entire table in memory but there are many operations that cause the table to end up there. In general, as your tables get larger and larger – many problems start to become more prominent as well. Often, a VLDB (Very Large Database) is considered 1TB or more. But, what I think is the more common problem is the size of your critical tables; when do they start causing you grief? What is a VLT (Very Large Table)? For me, as your table starts to really head toward 100GB; that’s when partitioning should be considered.
  • You have varying access patterns to the data:
    • Some data is recent, critical, and very active
    • A lot of data is not quite as recent, mostly only reads but with some modifications (often referred to as “read-mostly”)
    • The bulk of your data is historical and not accessed often but must be kept for a variety of reasons (including some large / infrequent analysis queries)

And that second point, is the most common reason to consider partitioning… but, it’s the first point that’s probably more noticable. :-)

What Feature(s) Should Be Used for Partitioning

As I’ve been discussing partitioning, I’ve been trying to generalize it as more of a concept rather than a feature (NOT tied directly to either SQL Server feature: partitioned tables [2005+] or partitioned views [7.0+]). Instead, I want you to think of it as a way of breaking down a table into smaller (more manageable) chunks. This is almost always a good thing. But, while it sounds great at first – the specific technologies have many considerations before choosing (especially depending on which version of SQL Server you’re working with).

Partitioned Views (PVs)

These have been available in SQL Server since version 7.0. They were very limited in SQL Server 7.0 in that they were query-only. As of SQL Server 2000, they allow modifications (through the PV and to the base tables) but with a few restrictions. There have been some bug fixes and improvements since their introduction but their creation, and uses are largely the same. And, even with SQL Server 2016 (and all of the improvements for Partitioned Tables), there are still scenarios where PVs make sense – sometimes as the partitioning mechanism by itself and sometimes combined with partitioned tables.

Partitioned Tables (PTs)

Partitioned tables were introduced in SQL Server 2005 and have had numerous improvements from release to release. This list is by no means exhaustive but some of the biggest improvements have been:

  • SQL Server 2008 introduced partition-aligned indexed views so that you could do fast switching in / out of partitioned tables even when the PT had an indexed view. And, SQL Server 2008 introduced partition-level lock escalation (however, some architectures [like what I recommend below] can naturally reduce the need for partition-level lock escalation).
  • SQL Server 2008 R2 (SP1 and 2008 SP2) offered support for up to 15,000 partitions (personally, I’m not a fan of this one).
  • SQL Server 2012 allowed index rebuilds to be performed as online operations even if the table has LOB data. So, if you want to switch from a non-partitioned table to a partitioned table (while keeping the table online / available), you can do this even when the table has LOB columns.
  • SQL Server 2014 offered better partition-level management with online partition-level rebuilds and incremental stats. These features reduce what’s read for statistics updates and places the threshold to update at the partition-level rather than at the table-level. And, they offer the option to just rebuild the partition while keeping the table and the partition online. However, the query optimizer still uses table-level statistics for optimization (Erin wrote an article about this titled: Incremental Statistics are NOT used by the Query Optimizer and Joe Sack wrote an article about partition-level rebuilds titled: Exploring Partition-Level Online Index Operations in SQL Server 2014 CTP1.)
  • SQL Server 2016 further reduced the threshold for triggering statistics updates to a more dynamic threshold (the same as trace flag 2371) but only for databases with a compatibility mode of 130 (or higher). See KB 2754171 for more information.

However, even with these improvements, PTs still have limitations that smaller tables don’t have. And, often, the best way to deal with a very large table is to not have one. Don’t misunderstand, what I’m suggesting is to use PVs with multiple smaller tables (possibly even PTs) unioned together. The end result is one with few, if any, real restrictions (by physically breaking your table into smaller ones, you remove a lot of the limitations that exist with PTs). However, you don’t want all that many smaller tables either as the process for optimization is more complicated with larger numbers of tables. The key is to find a balance. Often, my general recommendation is to have a separate table per year of data (image “sales” data), and then for the older years – just leave those as a single, standalone table. For the current and future years, use PTs to separate the “hot” data from the more stable data. By using standalone tables for the critical data you can do full table-level rebuilds online (in EE) and you can update statistics more frequently (and they’re more accurate on smaller tables). Then, as these standalone months stabilize you can switch them into the year-based PTs to further simplify their management moving forward.

Partitioning a VLT into smaller tables (some of which are PTs) can be a HIGHLY effective strategy for optimizing all aspects of table management, availability, and performance. But, it's definitely a more complex and difficult architecture to get right...

Partitioning a VLT into smaller tables (some of which are PTs) can be a HIGHLY effective strategy for optimizing all aspects of table management, availability, and performance. But, it’s definitely a more complex and difficult architecture to get right…

This architecture is complex but highly effective when designed properly. Doing some early prototyping with all of the features you plan to leverage is key to your success.

Key Points Before Implementing Partitioning

Partitioning is not really directly tied to performance but indirectly it can be extremely beneficial. So, for a SQL101 post, the most important point is that partitioned views still have benefits; they should not be discounted only because they’re an older feature. Both partitioning strategies provide different benefits; the RIGHT solution takes understanding ALL of their pros/cons. You need to evaluate both PVs and PTs against your availability, manageability, and performance requirements – and, in the most likely case, use them together for the most gains.

Since it’s concepts only, I still feel like it’s a SQL101 post. But, getting this right is a 300-400 level prototyping task after quite a bit of design.

Thanks for reading!
Kimberly

SQLskills SQL101: Indexes on Foreign Keys

As many of you know, we started our SQLskills SQL101 series a couple of weeks ago… it’s been great fun for the whole team to go back through our most common questions / concerns and set the record straight, per se. We’ve still got a lot of things to discuss but indexing is one of many questions / discussions and unfortunately, misunderstandings.

I’m going to tie today’s post with a question I received recently: if I have a table that has 6 foreign key columns/references should I create one index with all 6 foreign key columns in it, or should I create 6 individual indexes – one for each foreign key reference.

This is an interesting question with a few tangents to cover for our SQL101 series. I consider indexing foreign keys as part of my “Phase 1 of index tuning.” But, let’s go back to some basics before we dive into the answer on this one.

What is a Foreign Key Enforcing?

Imagine you have two tables: Employees and Departments. The Employee table has a column called DepartmentID and it represents the department of which that employee is a member. The department ID must be a valid department. So, to guarantee that the department ID exists and is valid – we create a foreign key to the DepartmentID column of the Departments table. When a row is inserted or updated in the Employees table, SQL Server will check to make sure that the value entered for DepartmentID is valid. This reference is very inexpensive because the foreign key MUST reference a column which is unique (which is in turn, enforced by a unique index).

What Must Exist in Order to Create a Foreign Key Reference?

A foreign key can be created on any column(s) that has a unique index on the referenced table. That unique index can be created with a CREATE INDEX statement OR that index could have been created as part of a constraint (either a UNIQUE or PRIMARY KEY constraint). A foreign key can reference ANY column(s) that has a UNIQUE index; it does not have to have been created by a constraint. And, this can be useful during performance tuning. A UNIQUE index offers options that constraints do not. For example, a UNIQUE index can have included columns and filters. A foreign key reference CAN reference a UNIQUE index with included columns; however, it cannot reference a UNIQUE index with a filter (I wish it could).

A good example of this might occur during database tuning and specifically during index consolidation (something I do after I do after query tuning and when I’m determining the best index for the database / for production). I often review existing indexes as well as any missing index recommendations, etc. Check out my SQLskills SQL101: Indexing Basics post for more information about these concepts.

Imagine I have the following:

Table: Employees, column: NationalID (SocialSecurityNumbery)

Column NationalID: this is an alternate key for Employees as their Primary Key is EmployeeID. Because it’s another column on which you will lookup employees and you want to make sure it’s unique, you decide to enforce it with a UNIQUE constraint on it. You may even reference it from other tables.

However, later, while doing database tuning, you decide that you need the following index:

CREATE INDEX [QueryTuningIndex]
ON [dbo].[Employees] ([NationalID])
INCLUDE ([LastName], [FirstName])

This index would be similar to and redundant with the existing constraint-based index on NationalID. But, you really want this new index to help performance (you’ve tested that this index is helpful to some frequently executed and important queries so you’ve already decided that the costs outweigh the negatives).

And, this is where the excellent feature to reference a unique index comes in… instead of adding this new one and keeping the existing constraint, change the index to the following:

CREATE UNIQUE INDEX [QueryTuningIndex]
ON [dbo].[Employees] ([NationalID])
INCLUDE ([LastName], [FirstName])

The uniqueness is always enforced ONLY on the key-portion of the index. So, this new index – even with included columns – still does this. The only bad news is that SQL Server has already associated the foreign key with the original constraint-based index so you’ll still need to remove the foreign key to drop the constraint (I wish this weren’t true). But, you’ll still have data integrity handled by the new unique index – as long as you create the new index before you drop the foreign key and the original unique constraint. Having said that, there’s more bad news – because there will be a short timeframe where the foreign key does not exist, you must do this off hours and when little-to-no activity is occurring. This will reduce the possibility of rows being inserted / updated that do not have a valid reference. You’ll certainly find out when you add the referential constraint again as the default behavior of adding the foreign key will be to verify that all rows have a reference row. NOTE: there is a way to skip this checking but it is NOT recommended as your constraint will be marked as untrusted. It’s is ALWAYS PREFERRED to create a foreign key with CHECK. Here’s a script that will walk you through the entire example – leveraging the default behavior to recheck the data when the constraint is created. Be sure to run this is a test / junk database.

Creating an Index on a Foreign Key Column

Foreign keys can reference any column(s) that have a UNIQUE index (regardless of whether it was created by a constraint).

Foreign keys can reference any column(s) that have a UNIQUE index (regardless of whether it was created by a constraint).

Now that you know the options for the column being referenced, let’s consider what’s required for the referencing column? The column on which the foreign key is created will not have an index by default. I wrote about this in a prior post: When did SQL Server stop putting indexes on Foreign Key columns? and the main point is that SQL Server has NEVER automatically created indexes on foreign key columns. But, many of us recommend that you do! (but, I also wish indexing were just that simple because this might not be an index you keep forever…)

See, if EVERY foreign key column automatically had an index created for you – then SQL Server might end up requiring it to always be there. This would then remove the option of consolidating this index with others when you were later running through performance tuning techniques. So, while it’s generally a good idea to have an index on a foreign key column; it might not ALWAYS be a good idea to keep that narrow index as you add other (probably, slightly-wider indexes).

However, initially, creating this index is what I call: Phase 1 of tuning for joins.

But, there are 3 phases of tuning for joins and these phases are all during query tuning and not database tuning. So, as you do deeper tuning, you might end up consolidating this foreign key index with another index(es) to reduce the overall number of indexes on your table.

Finally, the most important point (and this answers the original question), the index must be ONE per foreign key (with only the column(s) of that specific foreign key); you will create one index for each foreign key reference.

Indexes on Foreign Keys Can Help Improve Performance

There are two ways in which these indexes can improve performance.

First, they can help the foreign key reference maintain integrity on the referenced table. Take the example of Employees and Departments. Not only must SQL Server check that a DepartmentID is valid when inserting / updating an Employee row but SQL Server must also make sure that referential integrity is maintained when DepartmentIDs are removed from the Departments table. An index on the Employees table (on the DepartmentID columns) can be used to quickly check if any rows reference the DepartmentID being deleted from the Departments table. Without an index on DepartmentID in the Employees table, SQL Server would potentially have to scan the Employees table; this can be quite expensive.

Second, and this doesn’t always work, SQL Server may be able to use the index to help improve join performance. And, this is where I’ll cut the ideas a bit short as other phases of join tuning are more complex for this SQL101 post. So while there are other strategies that can be used to tune joins when this doesn’t work, it’s still a fantastic starting point. In fact, I generally recommend indexing foreign keys as part of your very early / development phase for a database. But, again, these indexes might be consolidated later in favor of other indexes.

Summary

Indexing for performance has many steps and many strategies, I hope to keep uncovering these in our SQL101 series but between this post and the Indexing Basics post, you’re well on your way to kicking off a better tuning strategy for your tables!

And, don’t forget to check out all of our SQL101 posts here!

Thanks for reading,
Kimberly

SQLskills SQL101: Indexing Basics

SQLskills introduced our new SQL101 recently and well… indexing is something that everyone needs to get right. But, it’s not a simple task. And, as I start to sit down to write a SQL101 post on indexing, I suspect I’m going to struggle keeping it simple? However, there are some core points on which I will focus and I’ll be sure to list a bunch of additional resources to get you more information from here! Remember, the point of each of our SQL101 posts is to make sure that everyone’s on the same page and has the same basic idea / background about a topic. And, for indexing, that’s incredibly important (and, often, misunderstood).

What is an Index?

Simply, it’s a structure that’s applied to a set of [or, subset of] data to enforce ordering – either to quickly check uniqueness or to aid in accessing data quickly. Simply, that’s why you create an index. You’re either wanting to enforce data integrity (such as uniqueness) or you’re trying to improve performance in some way.

How Does an Index Enforce Uniqueness?

If you want to maintain uniqueness over a column (or a combination of columns), SQL Server takes the required data (and very likely more data than you specifically chose) and sorts that data in an index. By storing the data in sorted order, SQL Server is able to quickly determine if a value exists (by efficiently navigating the index structure). For this intro post, it doesn’t entirely matter exactly what’s in the index but it does matter which index you choose and for what purpose.

Relational Rules Enforced by Indexing

In a relational database, relational rules rule the world. And, many learn some of the rules rather quickly. One of these rules is that relational theory says that every table must have a primary key. A primary key can consist of multiple columns; however, none of the columns can allow NULLs and the combination of those columns must be unique. While I agree that every table should have a primary key, what’s chosen AS the primary key can be more complex than the relational rules allow. Behind the scenes, a primary key is enforced by an index (to enforce and check for uniqueness). The type of index that is used depends on whether or not you explicitly state the index type or not. If you do not explicitly state an index type, SQL Server will default to trying to enforcing your primary key constraint with a unique clustered index. If a clustered index already exists, SQL Server will create a nonclustered index instead.

And, this is where things get tricky… a clustered index is a very important index to define. Internally, the clustering key defines how the entire data set is initially ordered. If not well chosen then SQL Server might end up with a structure that’s not as efficient as it could be. There’s quite a bit that goes into choosing a good clustering and I’ll stress that I think it’s one of the most important decisions to be made for your tables. And, it also needs to be made early as later changes to your clustering key can be difficult at best (often requiring downtime and complex coordinated scripting after already suffering poor performance before you make the decision to change).

So, let’s keep this simple… you’ll want to choose the clustering key wisely and early. And, you’ll want to get a good understanding on the things that depend on the clustering key. I did a discussion and demo in my Pluralsight course: Why Physical Database Design Matters and I’ve discussed this quite a bit in my blog category: Clustering key.

The key point is that the primary does NOT have to be enforced with a clustered index. Sometimes your primary key is not an ideal choice as the clustering key. Some great clustering key choices are:

  • Composite key: EntryDate, RowID where EntryDate is an ever-increasing date value that follows the insert pattern of your data. For example, OrderDate for a table that stores Orders. RowID should be something that helps to uniquely identify the rows (something like an identity column is useful). Key points: choose the smallest (but reasonable) data types for both the date and the ID. Ideally, use DATETIME2(p) where p is the level of precision desired. And, for an identity column – choose INT if you know you’ll never get anywhere near 2 billion rows. However, if you even think you’ll have “hundreds of millions” of rows, I’d probably go straight for BIGINT so that you never have to deal with the problems that you’ll have if you run out.
  • Identity column: When I don’t have a good composite key like that above, I’ll often consider an identity column for clustering – even if my queries aren’t specifically using this value. Even if you’re not explicitly using this value, SQL Server is using it behind the scenes in its nonclustered indexes. Please note that this is both a good thing and a bad thing. If your clustering key is very narrow then you’re not unnecessarily widening your nonclustered indexes. That isn’t to say that you won’t have a few wider nonclustered indexes but choosing a wide clustering key makes all of your nonclustered indexes wide when they might not need to be.

OK, I feel like I’ve started to open a car of worms with this one. But, the key points are:

  1. The primary key does NOT have to be clustered (and sometimes it’s better not to be)
  2. The clustering key needs to be chosen early and based on many factors – there’s no single right answer ALL the time… for example, if you don’t need any nonclustered indexes then the width of the clustering key becomes less of an issue.

At the end of this post, I’ll point you to more resources to help you to make a better decision.

What about Indexing for Performance?

In addition to enforcing uniqueness (and, allowing SQL Server to quickly determine whether or not a value already exists), indexes are used to help performance. And here’s where there are some very simple yet important things to understand. There are two types of performance tuning methods that I want to describe here: query tuning and database tuning. What’s often done most is query tuning. While that might be [temporarily] good for that query, it’s NOT a good long-term strategy for the server. I always START with query tuning but that’s not an ideal strategy to implement directly on your production server.

Query Tuning

Database Tuning Advisor for "query tuning" from SSMS

Indexing for performance using the DTA [Database Tuning Advisor] for “query tuning” from SSMS

Query tuning is where you focus on nothing but the query to come up with a better indexing strategy. You might use the Database Engine Tuning Advisor on the specific query right from SQL Server Management Studio (see image). Or, you might use the “green hint” found in the show actual executing plan window. While these might significantly help you with that query, they may not be ideal to implement in your database (where other users are executing, where other indexes exist, and where resource access/usage are at a premium).

Don’t get me wrong, query tuning is a MUST. But, it’s just a starting point. I always start my tuning process by determining the best indexes for a query. But, you can’t stop there. You MUST do “server tuning” if you want your production database to truly scale.

Server Tuning

Before you create a desired index in production (or, while you’re doing testing / analysis in development / QA [quality assurance]) you really want to check to see if this index is going to be good for production.

Are there other similar indexes?

Maybe you can consolidate some of these indexes into one. Yes, this consolidated index might not be the best for the individual queries but by creating one index instead of three, you’ll be reducing the cost of this index for data modifications, maintenance, storage, etc.

Are there any suggested missing indexes?

Again, before I create a new index, I want to see if I can get more uses out of it. Can I consolidate this new index with existing and/or missing recommendations? If I can then I’ll get more uses out of this index.

Are there good maintenance strategies in place?

Before you go and create more indexes, make sure that your existing indexes are being maintained. You should also check that the indexes being maintained are actually being used. To be honest, you should do that BEFORE you do any tuning at all.

SUMMARY: Steps for Server Tuning and Scalability

  1. Get rid of the dead weight. Clean up unused indexes. Consolidate similar indexes.
  2. Make sure your index maintenance strategy is in place. There’s no point in adding indexes if you’re not cleaning up fragmentation and reducing splits.
  3. Then, you can consider adding indexes BUT only after you’ve done the following:
    1. Query tuning
    2. Existing index consolidation
    3. Missing index consolidation
    4. And, of course, TESTING!

Well… that was much longer than I had hoped. But, there are a lot of good concepts here. Unfortunately, indexing for performance is just NEVER just a simple discussion. You can’t just put an index on every column and expect things to work well. Conversely, some of these tools seem helpful but they mostly do query tuning and not server tuning. If you really want to get better performance diving into indexing is a fantastic way to do this! So, if you’re motivated – here are a ton of resources to consider!

Learning more about Indexing

If you want to learn more about index structures, check out this older (but still useful page of videos). On it, watch them in this order.

  1. Index Internals
  2. Index Internals Demo
  3. Index Fragmentation
  4. Index Fragmentation Demo
  5. Even better – skip 3 and 4 and go to Paul’s Pluralsight course on SQL Server: Index Fragmentation Internals, Analysis, and Solutions
  6. Index Strategies
  7. Index Strategies Demonstration

Also, check out these blog posts:

I’m also working on a much more extensive course on Indexing for Pluralsight, this should be available within the next few months.

UPDATE: Check out my Pluralsight course – SQL Server: Indexing for Performance for more details! Enjoy!!

Thanks for reading!
k