Well, it's been a crazy few weeks! Paul and I have been booked solid with client engagements, Immersion Events, conferences and other things (like chapter editing for the upcoming SQL Server Deep Dives Volume 2) and so I've been a bit behind with blogging. I have a ToDo list of items to blog about but I've got one that's fairly urgent. I'll mention a few issues here but above all - don't trust anyone.

OK, I know that sounds a bit paranoid. Not meaning to be but this is a post that I'm adding to my "Just because" series as well as my "Are you kidding me?" series. That's how bad this is... almost as bad as these "stupid bacon related tatoos" - really, someone has this (forever!):

As a DBA (and/or SQL Architect/Database Designer/Developer) we're often tasked with making changes to databases and ideally, we want to use tools to make it easier. While I do agree that many tools help tremendously, I've been shocked lately by a few that have done the wrong thing. Paul blogged about a third party tool that didn't analyze indexes correctly here: Beware of advice from 3rd-party tools around dropping indexes (ultimately, the recommendations for what to drop - were wrong). And, recently, I've been shocked to learn that some of the schema change options in the database designer (in SSMS) are less than optimal when changing index definitions (especially those that handle changes to the clustering key).

Specifically, we had a customer that was doing everything right by testing the changes on a secondary system but SSMS was doing everything wrong (the order in which it made the changes and how it made the changes - were horrible). The end result is that if you're going to make schema changes, you really need to get more insight into what the application is doing. And, many tools have an option to script out the changes rather than execute them. So, my "quick tip" is to do that and see what's going on.

And, if you're going to make schema changes in ANY tool - then work through what those changes are going to do. Most important - consider re-arranging the SQL (especially wrt indexes). If you're ever changing the CL index you need to do the following:

  1. Drop nonclustered indexes first
  2. Drop the clustered next
    • IMPORTANT NOTES: If you're completely changing it you definitely need to drop the nonclustered and then drop the CL. BUT, if you're keeping the name and only changing the definition (and it's NOT a constraint-based index) then you can use WITH DROP_EXISTING. If you use DROP_EXISTING you do NOT need to drop the nonclustered indexes first.
  3. Create the new clustered
  4. Recreate the nonclustered indexes

Some of the weird things that we've seen:

  1. Some changes to table structure (changing from a heap to a clustered or a clustered to a heap) cause disabled nonclustered to get rebuilt (and re-enabled). That's not even SSMS - that's a general SQL Server problem. In my opinion this means that disabling indexes is somewhat broken. So... gotta use drop/create (which is a bummer!).
  2. Changes to the clustering key in SSMS are sometimes done by dropping the clustered and then recreating the clustered. This will cause your nonclustered indexes to be rebuilt TWICE!
  3. If you have col1 that's currently clustered and col2 is currently nonclustered but you want to swap them... you can do that in the UI and then save but here's what they do:
    1. They drop the nonclustered
    2. They drop the clustered (this causes all of the remaining nonclustered to get rebuilt)
    3. They create the nonclustered
    4. They create the clustered (this causes all of the remaining AND the new nonclustered to get rebuilt... again)

This is only a short list of some of the crazy things that we've seen. I'm working on a more comprehensive overview of all of these things but I wanted to get a few of them out there. BEWARE OF THE TOOLS and always use scripting and testing to make sure that things are doing what you think. If you're EVER going to make changes to a critical system - this is exactly what I'd do:

  1. Script out *JUST* the objects and their definitions from the production environment
  2. Take *just* the schema and then go through and make your changes in the designer. Instead of saving the changes (which would immediately implement them), script out the changes and review the SQL.
  3. Run through the SQL and see what it does (but, this is an empty database so time isn't going to be as much of a factor here). If you think there's something wrong - ask around (colleagues/twitter/forums)...
  4. Then, once you feel you have a good version of the script THEN backup/restore the production database to your test system (I hope that you have this??) and THEN see how long it takes. If you have new concerns then ask around again!
  5. Next, before you consider if for production - you need to thoroughly test your applications. Are they affected by these changes?
  6. Then, and only when you've thoroughly tested it - you can consider it for production.

Picky yes... surprises NO. On a critical system you CANNOT afford surprises that create downtime - or worse, data loss.

Thanks for reading!!
kt

OK, two in two days? What's wrong with me (is what Paul said) after he said - who are you and where is my wife?

But, this one is a short one, specifically a follow-on to my prior post about "Disk space is cheap..."

I did a bit of simple math on the internal overhead and costs of inefficient clustering keys and I wanted to share those as well. If a table is clustered then each and every nonclustered index must include ALL columns of the clustering key in [at least] the leaf level of the clustered index. When a nonclustered index is non-unique then it's even worse in that all columns of the clustering key must go all the way up the tree (into the non-leaf levels). I re-wrote sp_helpindex to give you better insight into what was happening in your nonclustered indexes here: A new and improved sp_helpindex (jokingly sp_helpindex8). I've actually re-written it again to include a column that shows whether or not an index is disabled but it still needs the base procedure [sp_SQLskills_ExposeColsInIndexLevels] from the other post - make sure you download that first before trying to use this sp_helpindex9: sp_SQLskills_SQL2008_helpindex (sp_helpindex9).sql (11.06 kb).

Having said that - what's the overhead in terms of some of your bigger tables?

If you have a 10 million row table with 8 nonclustered indexes then *just* the internal overhead is going to cost you the following with keys of the varying sizes:

 

 Description

 Width of CL key

 Rows

 NC Indexes

 MB 

 int

                            4

       10,000,000

               8

           305.18

 datetime

                            8

       10,000,000

               8

           610.35

 datetime, int

                          12

       10,000,000

               8

           915.53

 guid

                          16

       10,000,000

               8

        1,220.70

 composite

                          32

       10,000,000

               8

        2,441.41

 composite

                          64

       10,000,000

               8

        4,882.81

And, what if you have a bigger table - say a 100 million row table with 12 nonclustered indexes... the internal overhead is as follows:

 

 Description

 Width of CL key

 Rows

 NC Indexes

 MB 

 int

                            4

     100,000,000

             12

        4,577.64

 datetime

                            8

     100,000,000

             12

        9,155.27

 datetime, int

                          12

     100,000,000

             12

      13,732.91

 guid

                          16

     100,000,000

             12

      18,310.55

 composite

                          32

     100,000,000

             12

      36,621.09

 composite

                          64

     100,000,000

             12

      73,242.19

So, if you're talking about disk space, IOs, backups, etc. and you have a very inefficient clustering key (say a 5 column "natural key") that's 64 bytes then you have 71.53GB of OVERHEAD where it could have been only 4.47GB. That's a tad over 67GB of wasted disk space, memory (really - do you have 67GB+ of memory to waste??) and what about the long term costs of backing all of this up?? And, I haven't even started to talk about the inefficiencies with the FKs too. Or, the inefficiencies within the nonunique nonclustered index trees as well.

It's a COMPLETE NIGHTMARE!

Please... everyone repeat after me................. DESIGN MATTERS!!! :-)

Thanks for reading!
kt

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!)    

Expanding on the topic of "are you kidding me"... one of the MOST PREVALENT problems I see today is the dreaded "GUIDs as PKs" problem. However, just to be clear, it's not [as much of a] problem that your PRIMARY KEY is a GUID as much as it is a problem that the PRIMARY KEY is probably your clustering key. They really are two things BUT the default behavior in SQL Server is that a PRIMARY KEY uses a UNIQUE CLUSTERED INDEX to enforce entity integrity. So, I thought I'd take this post to really dive into why this is a problem and how you can hope to minimize it.

Relational Concepts - What is a PRIMARY KEY? (quick and basic reminder for what is what and why)

Starting at the very beginning... a primary key is used to enforce entity integrity. Entity integrity is the very basic concept that every row is uniquely identifiable. This is especially important in a normalized database because you usually end up with many tables and a need to reference rows across those tables (i.e. relationships). Relational theory says that every table MUST have a primary key. SQL Server does not have this requirement. However, many features - like replication - often have a requirement on a primary key so that they can guarantee which row to modify on a related database/server (like the subscriber in a replication environment). So, most people think to create one. However, not always...

What happens when a column(s) is defined as a PRIMARY KEY - in SQL Server?

The first thing that SQL Server checks is that ALL of the columns that make up the PRIMARY KEY constraint do not all NULLs. This is a requirement of a PRIMARY KEY but not a requirement of a UNIQUE KEY. They also check to make sure (if the table has data) that the existing data meets the uniqueness requirement. If there are any duplicate rows, the addition of the constraint will fail. And, to check this as well as to enforce this for [future] new rows - SQL Server builds a UNIQUE index. More specifically, if you don't specify index type when adding the constraint, SQL Server makes the index a UNIQUE CLUSTERED index. So, why is that interesting...

What is a clustered index?

In SQL Server 7.0 and higher the internal dependencies on the clustering key CHANGED. (Yes, it's important to know that things CHANGED in 7.0... why? Because there are still some folks out there that don't realize how RADICAL of a change occurred in the internals (wrt to the clustering key) in SQL Server 7.0). It's always (in all releases of SQL Server) been true that the clustered index defines the order of the data in the table itself (yes, the data of the table becomes the leaf level of the clustered index) and, it's always been a [potential] source of fragmentation. That's really not new. Although it does seem like it's more of a hot topic in recent releases but that may solely because there are more and more databases out there in general AND they've gotten bigger and bigger... and you feel the effects of fragmentation more when databases get really large.

What changed is that the clustering key gets used as the "lookup" value from the nonclustered indexes. Prior to SQL Server 7.0, SQL Server used a volatile RID structure. This was problematic because as records moved, ALL of the nonclustered indexes would need to get updated. Imagine a page that "splits" where half of the records are relocated to a new page. If that page has 20 rows then 10 rows have new RIDs - that means that 10 rows in EACH (and ALL) of your nonclustered indexes would need to get updated. The more nonclustered indexes you had, the worse it got (this is also where the idea that nonclustered indexes are TERRIBLY expensive comes from). In 7.0, the negative affects of record relocation were addressed in BOTH clustered tables and heaps. In heaps they chose to use forwarding pointers. The idea is that the row's FIXED RID is defined at insert and even if the data for the row has to relocate because the row no longer fits on the original page - the rows RID does not change. Instead, SQL Server just uses a forwarding pointer to make one extra hop (never more) to get to the data. In a clustered table, SQL Server uses the clustering key to lookup the data. As a result, this puts some strain on the clustering key that was never there before. It should be narrow (otherwise it can make the nonclustered indexes UNNECESSARILY wide). The clustering key should be UNIQUE (otherwise the nonclustered indexes wouldn't know "which" row to lookup - and, if the clustering key is not defined as unique then SQL Server will internally add a 4-byte uniquifier to each duplicate key value... this wastes time and space - both in the base table AND the nonclustered indexes). And, the clustering key should be STATIC (otherwise it will be costly to update because the clustering key is duplicated in ALL nonclustered indexes).

In summary, the clustering key really has all of these purposes:

  1. It defines the lookup value used by the nonclustered indexes (should be unique, narrow and static)
  2. It defines the table's order (physically at creation and logically maintained through a linked list after that) - so we need to be careful of fragmentation
  3. It can be used to answer a query (either as a table scan - or, if the query wants a subset of data (a range query) and the clustering key supports that range, then yes, the clustering key can be used to reduce the cost of the scan (it can seek with a partial scan)

However, the first two are the two that I think about the most when I choose a clustering key. The third is just one that I *might* be able to leverage if my clustering key also happens to be good for that. So, some examples of GOOD clustering keys are:

  • An identity column
  • A composite key of date and identity - in that order (date, identity) 
  • A pseudo sequential GUID (using the NEWSEQUENTIALID() function in SQL Server OR a "homegrown" function that builds sequential GUIDs - like Gert's "built originally to use in SQL 2000" xp_GUID here: http://sqldev.net/xp/xpguid.htm

But, a GUID that is not sequential - like one that has it's values generated in the client (using .NET) OR generated by the newid() function (in SQL Server) can be a horribly bad choice - primarily because of the fragmentation that it creates in the base table but also because of its size. It's unnecessarily wide (it's 4 times wider than an int-based identity - which can give you 2 billion (really, 4 billion) unique rows). And, if you need more than 2 billion you can always go with a bigint (8-byte int) and get 263-1 rows. And, if you don't really think that 12 bytes wider (or 8 bytes wider) is a big deal - estimate how much this costs on a bigger table and one with a few indexes...

  • Base Table with 1,000,000 rows (3.8MB vs. 15.26MB)
  • 6 nonclustered indexes (22.89MB vs. 91.55MB)

So, we're looking at 25MB vs 106MB - and, just to be clear, this is JUST for 1 million rows and this is really JUST overhead. If you create an even wider clustering key (something horrible like LastName, FirstName, MiddlieInitial - which let's say is 64bytes then you're looking at 427.25MB *just* in overhead..... And, then think about how bad that gets with 10 million rows and 6 nonclustered indexes - yes, you'd be wasting over 4GB with a key like that.

And, fragmentation costs you even more in wasted space and time because of splitting. Paul's covered A LOT about fragmentation on his blog so I'll skip that discussion for now BUT if your clustering key is prone to fragmentation then you NEED a solid maintenance plan - and this has it's own costs (and potential for downtime).

So............... choosing a GOOD clustering key EARLY is very important!

Otherwise, the problems can start piling up!

kt

Well, today has definitely been enlightening. I've been around and around with a few folks (mostly Tim Huckaby and Richard Campbell - who are BOTH great friends and people whose opinions I greatly respect!!!) and I've come a conclusion... Paul started down this line of commentary here: http://www.sqlskills.com/BLOGS/PAUL/post/Mandatory-SQL-breathalyzer-test.aspx and this was in reference to something I was saying in our long thread...

But, where I'm at is somewhere that I'm not sure entirely understand AND I need your help. I need to understand where and with whom do you think each role should be.... Here's what I think:

The DBA has to be able to trust the code that a developer submits – plain and simple (yes, this is absolutely true). But, (IMO) it’s still the database developer that writes the code and they’re the ones that need to know what goes on IN the db and the best practices of database design/development. The database developer should know how to create scalable schemas, they should know best practices, they should have a fundamental understanding about their specific platform as well (whether it's SQL Server or DB2 or even that O one :)). At a minimum, they should have a basic understanding of things to avoid... But, I'll give you this - it's not easy and it's not like the best practices really jump out at you. I'll even argue that some of the defaults INSIDE of SQL Server itself aren't ideal (ah, let me give a list: database size and autogrowth settings have HORRIBLE defaults, the PRIMARY KEY of a table is - by default - enforced with a UNIQUE CLUSTERED INDEX (and, well, there are many cases when that's not the best CLUSTERING KEY even though it might be a completely legitimate PRIMARY KEY), databases default to the FULL recovery model but run in a pseudo simple mode where log truncation on checkpoint occurs automatically (ah, until you do your first backup...)). Ugh, many of these things can catch you out and can completely cause you grief... and some are horribly complex to change after the fact (like what is your PRIMARY KEY and/or what is the index type of a key). So, it's definitely NOT easy. BUT, back to my original point - someone needs to know this and I need to find out who YOU think should? I think it's the database developer and NOT the DBA. I think there should be a somewhat blurred line between the database developer and the DBA. IMO, the DBA needs to know operational stuff (backup/restore, availability, security, maintenance… and, this is a HUGE list of things to really understand in and of itself - especially when you're managing this for many servers), it’s not their job (IMO) to know TSQL coding, error handling, stored procedure recompilations (although I can debate both sides of this last one), I would even argue in many cases it's not the DBA that should make the primary decisions on base indexing - I *REALLY* feel that it is the database developer - they should know the database schema, know the general access patterns AND know how the platform uses those objects so that they can make better design decisions EARLY. My favorite line is "the sooner you begin to code, the longer it's going to take" and it's really true.

However, with all of the discussions I've had today (and, I'm debating on posted an edited version of each of our rants... it's colorful in places :)), it feels to me like (and what I’m seeing more and more and more) is something that got mentioned early on in the thread – the database is “just plumbing” and the real application is in the client. Is this true... are there really only two players:

The Application Developer (who doesn't touch (because they're not allowed) OR possibly because they don't want to) the database
The DBA (who doesn't allow the AppDev to touch the database BUT doesn't touch it for them either because they're too busy and/or too protective over changes??)

Seriously, IS there a database developer in this formula above? And, is it a distinct person? I don't think so (personally)... I think the Application Developer SHOULD be the database developer. I think the AppDev should do EVERYTHING in terms of schema, keys, indexes, procedures, etc. AND once in production the two should work together to handle efficient and effective change control policies and procedures. There is a way to make this work. But, when good design doesn't exist - the database performance/scalability suffers, bad practices fester and then everyone blames SQL Server and no one really makes a change. In some cases, people just accept the poor performance as "how it works".

Thoughts??
kt

**** UPDATE March 17, 2009 ****
Be sure to read the comments.
This doesn't appear to be as bad as it seemed...still glad I did the rant though; I learned more about what it's actually doing! :) 
And, it's still good to have good database design - NO MATTER WHAT!
****

OK... today's been a bit of a weird day. I've been doing research for some upcoming events and in general, I'm learning a few things. However, as one does with the web - you can get side-tracked. And, in watching the video (recommended by my good friend Timmie) highlighed in this blog post about DBAs and Developers: http://www.benhblog.com/2009/03/linq-dba-vs-developer.html, I followed a few other links as well and found this direct quote:

"I was surprised to learn that EF decided that since there was not a primary key it would just use all the non-nullable columns as a concatenated primary key.  This might not be what you want."

This quote is also from Ben's blog (and I agree) and specifically this post: http://www.benhblog.com/2008/11/entity-framework-and-tables-with-no.html. And, David Yack (a fellow RD) makes a GREAT comment that you can deselect this but I honestly cannot believe that this is the default???!

My response is....... Seriously, are you FREAKING kidding me?

OK, I don't want to go on a total rant here but this is so bad that I almost cannot believe it. And, well, it's somewhat par for the course with the day that I'm having (with other things that I've learned and been "working on"). I'll come back to more topics that fall under the category of "are you kidding me" in a second. Ah ha... there's a new category for me!!!

Anyway, why is that so bad?

  1. There's no guarantee that all non-null columns are even going to be unique (but, this is the LEAST of the problems)
  2. Making a WIDE composite PRIMARY KEY means that the underlying structure which is a UNIQUE CLUSTERED INDEX won't be as efficient. SQL Server creates this index to enforce the relational rule for a PRIMARY KEY-that it is UNIQUE. So, SQL Server will make this WIDE COMPOSITE SET OF COLUMNS the table's one and only clustered index (ugh - I hope at least it changes the index type to nonclustered.... someone please let me know if this is the case)?? However, if it's not.... then, the clustered index will now force SQL Server to order this table by this combination of columns (which probably don't have any form of pattern) so, this clustered index is probably going to be very prone to fragmentation (which in turn can negatively affect performance). And, even worse, ALL of the non-clustered indexes will ALSO be rebuilt to include ALL of the columns of the clustered index which are not already in the non-clustered index. In fact, this step alone - if a clustered index IS created - will cause ALL of the nonclustered indexes to be rebuilt in addition to the table being rebuilt.

So, please tell me this isn't the case? Sorry if this is a rant but I honestly cannot believe this one.

This is another reason where people just weren't even thinking about what was going on in the backend. Another scenario where a little bit of background knowledge of the "data store" would have REALLY helped.

Oh my?
kt

Theme design by Nukeation based on Jelle Druyts