Wednesday, April 16, 2008

In my blog post on my new sp_helpindex proc (sp_helpindex2), I mentioned that the indexes in my sample were not necessarily a recommended set of indexes - just a test set of indexes. So... in this post, I thought I'd start a series on indexes, limitations and best practices/uses... Especially, why/how to best choose when to use INCLUDE v. having columns in your key. To start, I thought I'd give some background, as well as limitations that exist in various releases from 2005 to 2008 CTP6 (Feb CTP), plus what's expected in the SQL Server 2008 RTM (ah... I did say "expected" so don't come back and yell at me if/when I'm wrong ;-))

First, let's go through a few rules and limitations and background:

SQL Server 2005

  • 250 total indexes per table: 1 clustered index and up to 249 nonclustered indexes (no, this is not a goal!)
  • The index key can be up to 16 columns OR 900 bytes - whichever comes first...
  • The leaf level is limited just as a table is limited to 1024 columns (and, all column types are acceptable in the leaf level of an index - even LOB columns)
  • Statistics are kept for every index (so, up to 250 index-related stats) and there can also be statistics on columns or sets of columns in addition to the index-related stats. In earlier releases, statistics used index ids and as a result, the number of statistics were limited to 250 total minus the statistics used by indexes... in SQL Server 2005, they changed to having statistics kept/managed separately (use sys.stats to see them). As a result of using sys.stats, you can now have 2000 statistics on a table, in addition to the 250 (total) indexes and their statistics. If you want to test this out (and check it on various versions of SQL Server), use this script to setup a test database, a test table and then use dynamic string execution to loop through (until it errors) with creating nonclustered indexes and statistics: testindexmax.zip (.47 KB)

SQL Server 2008 CTP6

  • So far, it seems as though most of the maximums have not yet been lifted...
  • 250 total indexes per table: 1 clustered index and up to 249 nonclustered indexes (and this number  - for CTP6 - includes filtered indexes AND spatial indexes too!)
  • The index key limit hasn't changed (it can be up to 16 columns OR 900 bytes - whichever comes first)
  • The leaf level is still limited just as a table is limited to 1024 columns (and, all column types are still acceptable in the leaf level of an index)
  • Statistics in CTP6 seem to be limited to only 2000 stats per table...

SQL Server 2008 RTM (expected/target... no guarantees on this one :)

  • 30,000 columns per table (mostly to allow sparse columns)
  • 1,000 total indexes per table: 1 clustered index and up to 999 nonclustered indexes. This is also not a goal BUT, it makes sense because of both sparse columns and filtered indexes. Both Paul and I will try to post some entries about sparse columns and filtered indexes in the coming days...
  • The index key limit won't change
  • The leaf level is will be limited just as a table is limited to 30,000 columns (and, all column types are still acceptable in the leaf level of an index)
  • Statistics are also said to be increasing and likely to 30,000... And, for having extra statistics just sitting around and possibly not being used - well, outside of a minimal amount of disk space taken by the stat blob (which does start to get interesting at 1,000s I suppose), even stats that don't get used don't really create much of a problem. So, I'm OK with this one increasing - even significantly - but I have to admit I'm somewhat nervous about the significan't increase in indexes.........

So... you can have A LOT more indexes in SQL Server 2008 but just because you can - DOES it mean that you should?!

And on that - I'll leave you hanging for my next post where I start to talk about WHY they're increasing this (hint: sparse columns and filtered indexes = more columns/more indexes)....

Have fun,
kt

Wednesday, April 16, 2008 1:44:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, April 02, 2008

OK - so this has been frustrating me for many months... when you create indexes with included columns (which was a new feature of SQL Server 2005), they're not shown by sp_helpindex or by DBCC SHOW_STATISTICS. I understand this not showing for statistics because included columns are not factored into the histogram (that's only the high order element which is the first column in the index) OR the density vector (which only shows the densities (or averages) for the left-based subsets of the key). So, why doesn't sp_helpindex show it? Well... I guess it just didn't get updated for SQL 2005. So, in SQL 2008, I was hoping I'd not only see included columns but also filtered indexes... well, neither is there and sp_helpindex is still the same old proc. Don't get me wrong, you can use SSMS to see all of the index properties for a single index (pane, by pane for each property) OR you can run queries to find the included columns for a given index:

SELECT
(CASE ic.key_ordinal WHENTHEN CAST(AS tinyint) ELSE ic.key_ordinal END) AS [ID],
clmns.name AS [Name],
CAST(COLUMNPROPERTY(ic.object_id, clmns.name, N'IsComputed') AS bit) AS [IsComputed],
ic.is_descending_key AS [Descending],
ic.is_included_column AS [IsIncluded]
FROM sys.tables AS tbl
   
INNER JOIN sys.indexes AS
      
ON (i.index_id >AND i.is_hypothetical = 0) AND (i.object_id = tbl.object_id)
   INNER
JOIN sys.index_columns AS ic 
      
ON (ic.column_id >AND (ic.key_ordinal >OR ic.partition_ordinal =OR ic.is_included_column != 0)) 
         
AND (ic.index_id = CAST(i.index_id AS int) AND ic.object_id = i.object_id)
   INNER
JOIN sys.columns AS clmns 
   
ON clmns.object_id = ic.object_id AND clmns.column_id = ic.column_id
WHERE (i.name = N'[MyIndex]') AND ((tbl.name = N'[MyTable]' AND SCHEMA_NAME(tbl.schema_id) = N'[MySchema]'))
ORDER BY IsIncluded, [ID] ASC

but, there isn't a nice clean way to show all of the included columns for all indexes for a particular table... until now :)

A couple of weeks ago I sat down and rewrote sp_helpindex. I was actually on a plane from Hyderabad to Frankfurt or from Frankfurt to San Fran or from San Fran to Seattle (it was a long day :) and I was using (and well, forcing myself to learn how to use :) my new Vista laptop. OK, that's a HUGE story in and of itself and it definitely warrants its own post but I'll sum up the story with the fact that I had to purchase a new laptop while in Hyderabad because BOTH my primary laptop (T61p) AND my backup laptop (T60p) BOTH (yes, BOTH!!!) suffered catastrophic disk failures on their boot drives within 24 hours of each other. In the end, I really cannot believe the "coincidence" of two laptops crashing within 24 hours of each other. Yes, I thought MTBF too (at first) but the laptops were two Lenovos - one Lenovo (the T60p) was purchased in Feb 2007 and the second, a Lenovo T61p was purchase in Oct 2007. And, it was the T61p that went first. The only thing I can even begin to speculate about and/or think to attribute it to (as I was in India for 17 days from Mar 3 through Mar 20 and this all started on Mar 17) was an overactive metal detector at the hotel at which I was staying (or something related to St. Patrick but I've since ruled that out - and no, I wasn't drinking green beer either...). OK, I really need to do another post to give you all of the details about this trip BUT, I did get a new laptop... and, having just bought it only shortly before I flew back I felt like I really needed to get my money's worth so I just *had* to work on the flights home (ah, security with *3* laptops was NOT fun and I'm *VERY* glad that none of them asked me to "boot" my laptops to prove they were working... that could have been a VERY bad situation... lol).

OK - so back to the story... I was working on the flights and I was preparing to deliver some content on the Friday after I returned (yes, I taught a full day in India on Wednesday then flew back leaving India at 2:15am Thursday morning so that I could arrive back in Redmond at roughly 7pm Thursday night - about 30 hours later - and then teach Friday morning for an 8:30 start time... ah, I was *really* tired on Friday night :). Anyway, in preparing, I decided that I finally needed to re-write sp_helpindex. When I was first writing it, I was only thinking of SQL Server 2005. So, here's the 2005 version that I wrote: sp_helpindex2_2005.zip (2.71 KB).

So, I had wanted to blog that when I got back to Redmond but in preparing for the trip we're on now AND rebuilding my primary and backup laptops, well, it got tabled. So now, today, Paul and I are in Iceland (working with our great friends at Miracle Iceland) and we're teaching "the Accidental DBA" (this past Monday) and SQL Server 2008 New Features in Database Infrastructure and Scalability (Tue through Thursday)... I was giving a lecture on Filtered Indexes in SQL Server 2008 and I, once again, found myself needing a better sp_helpindex. So, when Paul got up to talk about Compression (which is no short lecture for him :), I had time to rewrite sp_helpindex... again. And, here's what I ended up with...

exec sp_helpindex2 'member'

index_name index_description index_keys included_columns filter_definition
member_corporation_link nonclustered located on PRIMARY corp_no NULL NULL
member_ident clustered, unique, primary key located on PRIMARY member_no NULL NULL
member_region_link nonclustered located on PRIMARY region_no NULL NULL
NCIndexCoveringLnFnMiIncludePhone nonclustered located on PRIMARY lastname, firstname, middleinitial phone_no NULL
NCIndexCoversAll4Cols nonclustered located on PRIMARY lastname, firstname, middleinitial, phone_no NULL NULL
NCIndexLNinKeyInclude3OtherCols nonclustered located on PRIMARY lastname firstname, middleinitial, phone_no NULL
NCIndexLNOnly nonclustered located on PRIMARY lastname NULL NULL
QuickFilterTest nonclustered located on PRIMARY lastname phone ([lastname]>'S' AND [lastname]<'T')

So, in the end, I can quickly see whether or not my index has included_columns and/or a filter_definition. Don't get me wrong, these indexes above are NOT necessarily a good combination of indexes (or recommendation of ANY kind) to have - these were just created to make sure that my code works. And, as my good friend Gunnar would say - "it's not my best code but it's not my worst code either" <G>. And, so, here it is: sp_helpindex2_2008.zip (2.75 KB).

Pretty darn useful for sure! Oh, and I used the undoc'ed sp_MS_marksystemobject so that I could still create the sp_ in master but then execute it in all other databases. It's frustrating that this behavior (with sp_ named objects) no longers works in 2005/2008 but at least the sp_MS_marksystemobject still sets the behavior so that we can create this one proc in master but use it in all other databases.

Have fun!
kt

Wednesday, April 02, 2008 10:22:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Tuesday, February 05, 2008

I know that Paul and I recommended that you subscribe to Conor's blog... but have you? He's posted some great details on Partitioning (Part 1 and Part 2) as well as statistics and it always reminds me of how much I can learn from other people's perspectives!

And, just to dove-tail on some of his statistics comments... I, too, have found that as tables get significantly larger AND have non-standard distributions of more than 200 distinct values (and un-even distribution between those values as well), that the optimizer just cannot possibly do a perfect job. The only way an optimizer can be good is when it can "find a good plan fast" (which I first heard from Nigel Ellis (former Development Manager of the Query Processor team) - back when he delivered a Pacific Northwest SQL Server User Group meeting many moons ago). The most important thing to realize is that it's just not possible to waste time to find the absolutely best plan... mathematically analyzing all permutations would be prohibitive - you'd have to take a vacation between query executions (wait, that's not a bad idea... I digress :).

The point:

  1. Make sure that statistics are up-to-date (either through the database option: auto update stats OR by manually updating statistics)
  2. Consider re-evaluating statistics over large tables (and, when poor performance occurs - look at the estimated rows v. the actual rows - if the estimate/actual are off by a fact of 10, then it could be the statistics). I'd try updating stats first and then if that doesn't work, updating with a fullscan. If neither of those work, I'd also re-evaluate other possible indexes (there are some distributions between tables being joined that just can't show a correct correlation between the values when in multiple indexes... sometimes the best index is a multi-column (ie. composite index)). 
  3. Consider breaking very large tables down into smaller chunks (not just table index partitioning but possibly Partitioned Tables AND Partitioned Views) as this can give the optimizer additional details about partiticularly interesting data sets. Even in SQL Server 2008, statistics are still table-level (filtered indexes can provide some, but not complete, relief... I'll give more details in a later post) but I'd often argue that some of the best table designs are not just for a single table. Consider the statistical, locking, and indexing implications for mixed workloads against a single table (and the tremendous amount of blocking that could occur in addition to varying access patterns). And, even while 2008 will offer Partition-level lock escalation, well-designed tables may not need it! I know I've mentioned this before but different perspectives on statistics, optimizers and the fact that a good optimizer has to be efficient in-and-of itself, remind me of some of the most basic things that are also the most common problems contributing to poor query performance.

Returning to the basics and optimizing a system from the ground up always leads to better scalability!

Enjoy!
kt

 

Tuesday, February 05, 2008 12:06:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Friday, February 02, 2007

Well, I've promised to blog more and I'm really going to try to do so. This morning I got the perfect question/comment (in email) to respond to and after working through a response that was taking me upwards of 3 hours (you'll learn later why I have 3 "spare" hours :)......... I figured that it was time to turn the response into a blog post. ;)

Background: The Clustered Index Debate
In the years since the storage engine was re-architected (SQL Server 7.0+) there's been constant debate on how to appropriately choose the clustered index for your tables. I've generally recommended an ever-increasing key to use as a clustered index and many find that counterintuitive. The primary reason people feel it's counterintuitive is that it creates a hotspot of activity. [If "hotspot" is not a familar term - a hotspot is solely an active place within your table.] Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there. In fact (and probably even more counterintuitive), the opposite is true. Hotspots (specifically hot PAGES not hot ROWS) can be very beneficial because they; minimize the number of pages needed in cache, improve the likelihood of the required page already being in cache and in general, they minimize the overall amount of cache required. So, this is why many of us have changed our recommendation on where to create the clustering key in 7.0+. Instead of focusing on range queries we now focus on placing the clustering key on an ever-increasing key. In earlier releases, focusing on range queries for the clustered index reduced hotspots for insert/update and this in fact was the PRIMARY motivation to choose them, NOT range query performance! But - there are even MORE reasons to choose an ever-increasing key and they are based on internals as well. These internals are based on the significant changes made in the storage engine for 7.0+. For a quick start on these, I went through them in the Blog entry here.

And, today's email is not uncommon. This is the basis for the title clustered index debate. In general, there are still a lot of questions related to creating clustered indexes to improve "range query" performance. Don't get me wrong, there's definitely a benefit in performance for some range queries but the first thing to remember is that you get only one CL index per table (therefore only one type of range query can benefit). In the real world, t's not likely that you want to see your data exactly in the same way all the time. Therefore it's very challenging to come up with the "right clustered" index if you're using range queries as your strategy. Even worse, the affect of choosing the clustering key to improve range queries causes problems for modifications against that table (INSERTs/DELETEs and UPDATEs). So.............. this is what started my day today. A great email from a reader that brought up these points. The question/comment (modified to hit only the highlights and to protect their identity :) was this:

The most important characteristic for a Clustered Index key is to satisfy range queries. More often than not, if a sufficient range of data will be scanned, the Optimizer will choose the Clustered Index over all others due to the excessive cost of Bookmark Lookup operations. As such, the table KEY is a more suitable clustered index candidate than any surrogate (few every query a database by range of surrogate keys).  [kt note: this second sentence is not entirely true... SQL Server will certainly choose a clustered index over non-clustered that require table scans but there are A LOT of algorithms that SQL Server can use instead of either of these and my examples later show this... non-clustered covering seekable indexes, non-clustered scanable indexes, index-intersection, etc. ] 

Now, when the default behavior for SQL Server was designed such that the PRIMARY KEY was chosen as the default clustered index, it was exactly for this reason.  It is the business key.  It would satisfy uniqueness (by definition of logical KEY).  And, it is well suited for a wide variety of range scans.  However, this is when the PRIMARY KEY is defined on the Business Key of the data.

But, when you introduce the usage of surrogate keys (i.e., IDENTITY) as a physical implementation, and thus transfer the PRIMARY KEY definition to it, two things must be considered.  First, the Business Key this IDENTITY will be a proxy for must still exist as it is still apart of the logical design.  As part of the physical design, the logical key needs to be implemented as a physical constraint to maintain logical uniqueness.  Second, just because a proxy has been defined does not make it a natural candidate for the clustered index.  The business key still maintains this distinction.

What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits.  However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list.  Page Splits are managed by proper FILLFACTOR not increasing INSERTS.  Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.

Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.

In short, the argument runs shallow.

Luckily, this email arrived with perfect timing for me as I'm sitting in a "bootcamp" event on Always On technologies and I'm not speaking this morning (my colleague Bob Beauchemin is doing lectures on Scale Out technologies: Scalable Shared Databases, Service Broker, DPVs, etc.). Anyway, in addition to listening to Bob, I've decided to continue the blog series on "the clustered index debate". The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).

The Clustered Index Debate Continued
First, there are many angles to look at wrt to "the clustered index debate" and it's not until all of the issues are reviewed, that this strategy (a monotonically increasing key) becomes obvious. So, I think it will probably take a couple of blog posts to really prove this. I'll start up this debate again here...... When you look at a general purpose table (which is most) where the table has ALL DML (S/I/D/U) then you are best off with an ever-increasing key (again, you have to look at the overall impact of all operations against the table - not just select... because I/D/U will also impact select in the long term). So, I'll break this down into each DML operation here. If you don't look at the overall impact, then large tables can end up having a tremendous number of problems once they're put into production. I've certainly heard this concern/debate before (and most people are skeptical at first glance) but when you look at the situation overall, you'll find that "finding the right balance" includes not just looking at range queries. In fact, here's a quick list of the things/tests/numbers/scenarios that help to prove my strategy:

  • Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing. I have some simple numbers but I'm thinking about creating a much larger/complex scenario and publishing those. Simple/quick tests on a laptop are not always as "exciting". But - this is a well documented issue (IAM/PFS lookups) and poor performance on a heap is also referenced in this KB: PRB: Poor Performance on a Heap. note: this KB is quite dated and I don't actually agree with everything in this article however, the general concern of poor performance for inserts is still true on SQL Server 2005.
  • Updates are often faster (when the row needs to be relocated) and for the same reason (IAM/PFS lookups) BUT there are many types of updates and not all updates cause records to be relocated. Here are a few things to think about wrt to updates:
    • Updates that are completely in-place (some examples are where the update is updating a fixed-width column OR to variable-width columns where the row size doesn't change, etc.). These types of updates don't really care.
    • Updates that cause record relocation (where the row size changes) are definitely better by having a clustering key because the record relocation (which will be handled by a split) is defined by the clustering key
    • Updates to the clustering key are the WORST (in this case) which is one of the key reasons for having a cl key that is static (so we have to keep this in mind when we choose a clustering key).
  • Deletes aren't nearly as big of a concern BUT deletes in heaps create more gaps and more gaps creates more work in PFS/IAM lookups and while this helps to reduce wasted space, it still requires the time to find the space........ hence the slowed performance of Inserts/Updates. I've also written some blog entries that cover very interesting test cases for large scale deletes and why you'd want to consider partitioning to optimize for the "sliding window scenario" in this blog entry: MSDN Webcast Q&A: Index Defrag Best Practices - Fragmentation, Deletes and the “Sliding Window” Scenario and it's the LAST one!.
  • Selects.............. now this is the hardest one to go through in just a couple of bullets (ah, I guess this will lead to another one or two posts :) BUT I'll start by saying that the best way to tune the vast majority of range queries is through non-clustered [covering] indexes. But, it's also important for me to stress that I do NOT advocate covering every query (it's impossible to do). What's important to realize in terms of covering is that SQL Server 7.0 and up continues to include internal algorithms to improve performance when you don't have the "perfect" non-clustered covering seekable index and instead still gives better performance than going to the base table (or performing bookmark lookups - as mentioned in the mail...and I completely agree that these [bookmark lookups] can be evil!). To start this discussion, I'll give one of my favorite examples of a large-scale aggregate. The absolute best way to improve the performance is through an indexed view but the data can be gathered through many other algorithms - ideally through a non-clustered covering index that is in order by the group by and that includes the column(s) being aggregated. For example, take this query:

SELECT c.member_no AS MemberNo,
 sum(c.charge_amt) AS TotalSales
FROM dbo.charge AS c
GROUP BY c.member_no

On a charge table of 1.6 million rows here are the performance numbers to handle this aggregation:

  • Clustered table scan (CL PK on Charge_no) with a hash aggregate = 2.813 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 1.436 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = .966 seconds
  • Indexed view = .406 seconds

Now this was a pretty small table (narrow rows and only 1.6 million rows) AND I didn't have any concurrent activity. The concurrent activity would have caused this to be even slower for hash aggregates, etc. Regardless, it proves the point (at least generally). Now, if I wanted to improve this range query then I'd have to cluster on the member_no column (and this is an ideal example because I often hear people say that clustering on a foreign key column helps to improve range/join queries - which can be true as well)......... But - this strategy has a few problems in addition to a few benefits (and we have to look at everything to be sure of our choice/decision). First, member_no is not unique (in the charge table) so SQL Server has to "uniquify" the rows. The process of "uniquification" impacts both time (on insert) and space (the rows will be wider to store each duplicate row's uniqufier). Also, theoretically it could change (in this case that's not true). Anyway, the time it takes for the clustered index is 2.406 seconds which is better than the clustered on the PK (of course) but if I were to also start modifying the rows (which creates splits) or even just insert 15% more rows........ then my table would become fragmented. At that point, the query performance should get worse in the table clustered by member_no table and it will continue to get even worse in the table clustered by charge_no (because of the worktable created in tempdb by the hash aggregate) BUT it won't be all that much worse in the non-clustered index examples (especially the covering index that's in the order of the group by - because this doesn't require a worktable).........

  • CL on member_no = 4.906 seconds
  • CL on charge_no = 6.173 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 3.906 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = 1.250 seconds
  • Indexed view = .516 seconds

This is a great start to furthering the clustered index debate but I do have to admit that it's a counterintuitive and difficult issue to tackle because often isolated tests lead you to different conclusions. In this case though, the non-clustered indexes are better for this range query and the indexed view is the best (but I wouldn't consider the Indexed unless this were more of a read focused database rather than read/write). [and - of course, that statement warrants yet another blog post :)]

So, depending on the tests that you do - especially if you focus only on selects and you don't have modifications (i.e. fragmentation) - then they will make "creating the clustered index for range queries" appear to be best. Again, I'm not just saying this to prevent fragmentation, I'm saying this because I wouldn't use the clustered index OR a non-clustered index with bookmark lookups to handle this query. I'd consider a non-clustered covering that's seekable OR even a non-clustered covering that's scanable before I'd even choose the clustered (and that's what the optimizer would prefer as well). In the end it's really a bit of an art and a science to "finding the right balance" of indexing.

Oh - and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help but many would prefer to use actual data... which means [potentially] creating your primary key with a new identity [or similar] column and this can impact your business logic (absolutely). I'm certain that certain tests can show that range queries are faster and it's absolutely correct that business application/usage can be a concern but when you look at the big picture (and the impact on I/D/U) then the benefits of the monotonically increasing key significantly outweigh these concerns. Simply put, a small/narrow key can help join performance and an ever increasing key can also help lookups for rows! (yes, definitely more coming)

Happy Friday! Have a great weekend. I'll try to continue more threads on this debate shortly!
kt

Friday, February 02, 2007 11:39:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Thursday, May 04, 2006

On Wednesday I had the pleasure of chatting with Greg Low - a fellow RD and MVP who is also excited to be focused and working in the SQL Server space. He's been doing a few podcasts on database technologies and we were finally able to hook up today. He caught me off guard with a couple of questions on my hobbies (the hobbies that I have outside of SQL Server ;-) but more than anything he got me talking about indexes (well, now that's not all that hard, eh? ;-). It was fun - thanks for the chat Greg.

Check it out:

SQL Down Under: http://www.sqldownunder.com/
The show in mp3 format: http://www.sqldownunder.com/SDU15FullShow.mp3
The show in wma format: http://www.sqldownunder.com/SDU15FullShow.wma

Enjoy - and thanks Greg!
kt

Events | Indexes | Tips
Thursday, May 04, 2006 1:31:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Saturday, November 05, 2005
On Friday, November 4, Kimberly L. Tripp completed part 10 of a 10-part series on developing Reliable and Scalable applications for an MSDN Webcast series. Each session had the format of a 90 minute content session followed by a 30 minute content specific Q&A. This session was a summary session for the entire series and included a lot of great questions. To get you started with many of the links to other sessions and even for some of the scripts used throughout this series - check out this blog entry. Enjoy!
Saturday, November 05, 2005 8:53:49 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Sunday, September 04, 2005
Part 5 is done and we're half way through the series. Another great series of questions from part 5 of 10 in the series titled: Effectively Designing a Scalable and Reliable Database. Looking forward to another session this Friday! Have a great holiday weekend/Monday (for those of you/us in the US)!
Sunday, September 04, 2005 6:41:41 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Tuesday, August 30, 2005
OK, so here are the technical questions answered from part 4 of the webcast series. Still 6 more sessions to go. See you on Friday!
Tuesday, August 30, 2005 9:41:42 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Friday, August 26, 2005
Lots of questions in today's MSDN Webcast titled: Indexing Best Practices. And - lots of great resources to continue learning! To allow you to check out a few of the other resources right away, I'm posting this "Resources ONLY" part of the Q&A. Wow, talk about a lot of homework... Enjoy. :-)
Friday, August 26, 2005 1:43:20 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
And - here's the written Q&A from Part 3 of the MSDN Webcast Series: Effectively Designing a Scalable and Reliable Database, A Primer to Proper SQL Server Database Development. Part 3's focus was "Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures." As for the Q&A, surprisingly, it's delivered before part 4...just barely! Lots of great questions (as usual!) and quite a few links and things to reference. Enjoy!
Thursday, August 25, 2005 11:09:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Thursday, August 25, 2005
Most people I know - especially those who don't know a lot about Indexes (but even those who do :) - can't believe that I have a two-day class on NOTHING but Indexing... They ask me, "How can you talk about one topic for two days?" The scary part is my answer...
Thursday, August 25, 2005 7:43:42 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, June 24, 2005

Many of you in the .NET Community are already familar with theServerSide.NET but many of you in the SQL Server community are not...well, here's your chance to bridge the gap over to .NET and hear a bunch of great .NET interviews! My interview is more SQL-centric but there are lots of great ones on Development in general and even one on SQL Server 2005 Reporting Services by Jason Carlson.

My interview was recorded with them back in February when I was speaking at VSLive in San Francisco and they've been saving them up, editing them and then posting an interview roughly each week (mine is the 29th posted!). The interview was great fun and Paul Ballard asked some really interesting questions about tuning, indexes, stored procedures and SQL Server 2005. Mostly it's an interview about optimization tips in both SQL Server 2000 and 2005 and just in general - things to look forward to in SQL Server 2005.

Here's a link to all of theServerSide.NET interviews: http://www.theserverside.net/talks/index.tss 
Here's a link specifically to mine: http://www.theserverside.net/talks/videos/KimberlyTripp/interview.tss?bandwidth=dsl
Here's the link to Jason Carlson's Reporting Services interview: http://www.theserverside.net/talks/videos/JasonCarlson/interview.tss?bandwidth=dsl and finally, here's a TechNet Webcast that Jason did titled: Authoring Reports in SQL Server 2000 Reporting Services (Level 200)

Enjoy!

Friday, June 24, 2005 10:09:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, June 06, 2005

As I'm preparing for my Tech*Ed session on Indexing Best Practices in SQL Server 2005, I'm reminded that there are a ton of best practices that really apply to both SQL Server 2000 as well as SQL Server 2005. When it comes to indexing, there are many dependencies on the storage structures. These dependencies are the basis for why I recommend a very specific type of clustering key - for all versions of SQL Server, 7.0 and higher!

I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....

Let's start with the key things that I look for in a clustering key:

  • Unique
  • Narrow
  • Static

Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Why Narrow?
A clustering key should be narrow for some of the same reasons it should be unique. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. If the clustering key is really wide, then all of the non-clustered indexes will be [unnecessarily] wide. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of size) when rebuilding these index structures. So, what does narrow mean - as few bytes as possible to help uniquely define your rows. A narrow numeric when possible.

Why Static?
A clustering key should be static for some of the same reasons it should be unique and narrow. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. In fact, for a given table the clustering key will be the most duplicated data. If this data changes then they'll need to update the value in the base table as well as in EVERY non-clustered index. And, if the key changes, it will cause the record to move. When a record moves, it creates fragmentation. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of record relocation and [the likely] subsequent splits) and require more maintenance. 

OK, so it sounds like I want a narrow, unique and static value... What about a guid?
Typically, I recommend a numeric IDENTITY column as the clustering key but I always get this question. In fact, I often wait to see how long it's going to take before I get this question ;). Anyway, a guid does meet the criteria fairly well - it's certainly unique, it's usually static and it's relatively narrow. So, what's wrong with it? In SQL Server 2000, the guid function (newid()) is built using a value that does not create an ever increasing pattern (an IDENTITY column would). But wait, I didn't say that you needed to have an ever-increasing pattern.....

OK, so the final criteria I look for in a clustering key is: an ever-increasing pattern!
If the clustering key is ever-increasing then new rows have a specific location where they can be placed. If that location is at the end of the table then the new row needs space allocated to it but it doesn't have to make space in the middle of the table. If a row is inserted to a location that doesn't have any room then room needs to be made (e.g. you insert based on last name then as rows come in space will need to be made where that name should be placed). If room needs to be made, it's made by SQL Server doing something called a split. Splits in SQL Server are 50/50 splits - simply put - 50% of the data stays and 50% of the data is moved. This keeps the index logically intact (the lowest level of an index - called the leaf level - is a douly-linked list) but not physically intact. When an index has a lot of splits then the index is said to be fragmented. Good examples of an index that is ever-increasing are IDENTITY columns (and they're also naturally unique, natural static and naturally narrow) or something that follows as many of these things as possible - like a datetime column (or since that's NOT very likely to be unique by itself datetime, identity). But wait, what about that a guid.

Well, in SQL Server 2000 the only SQL Server function for guids is newid - that does not create an ever increasing pattern. In SQL Server 2005, you can use a new guid function called newsequentialid() to populate your uniqueidentifier column. Here's an example of how you can use it:

CREATE TABLE Test
(
TestID uniqueidentifier CONSTRAINT Test_TestID_Default DEFAULT newsequentialid(),
Inserted datetime CONSTRAINT Test_Inserted_Default DEFAULT getdate()
)
go

INSERT Test DEFAULT VALUES
go

SELECT * FROM Test
go

Is there a way to create a sequential guid in SQL Server 2000?
YES, use Gert Drapers wrote an xp to generate sequential guids! Check it out here. He just published this recently (May 2005) and it's a good change for you in SQL Server 2000 databases/applications.
Full title with link: XPGUID.DLL - Sequential GUID generation and GUID helper functions XP

OK, so I've tackled a few things here today and I've answered a few questions related to indexes before. Check out my Indexes Category of blog entries here and if you're at Tech*Ed today, I hope to see you at 5PM.

Thanks for reading,
kt

Monday, June 06, 2005 9:48:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Monday, October 25, 2004

The title is a common question I've received in the past and I thought I'd take a few minutes to explain a bit about keys and indexes...

This is by no means a lot of detail regarding relational theory, etc. but there are a few things that we should quickly review to make sure that the basis for indexes being created makes sense. First, from a relational theory perspective every table must have a primary key. From SQL Server's perspective it's not a requirement but it's generally a good idea. A primary/unique key are entity identifiers. Each are a unique way of identifying a row. There are subtle differences between the two - in implementation:

Primary Key

  • In SQL Server the Primary Key is enforced through a Primary Key Constraint.
  • None of the columns that make up the primary key allow nulls.
  • The values in the Primary Key must be unique - to enforce uniqueness (as well as make it efficient), SQL Server creates a unique clustered [composite] index on the column(s) that make up the key.

Unique Key

  • In SQL Server the Primary Key is enforced through a Unique Key Constraint.
  • The columns that make up the unique key CAN allow nulls but not for more than one complete key. Meaning that allowing Nulls on a single column unique key really only allows NULL (only one NULL value). Overall, allowing Nulls in a column (like Social Security Number) doesn't really make much sense but if you need to then you can't go with a unique constraint - instead consider a unique index. At that point, I typically get the question of what's the difference between a unique key and a unique index...
  • Allowing Nulls values in the columns that mak