Monday, March 12, 2007

Last week while Paul and I were in the UK delivering a one day seminar on Crucial Database Maintenance Techniques, we met David McMahon from the Next Generation User Group. They're doing some exciting things in the UK and even for the wider community - for example - podcasts. Paul and I were interviewed for one and it's ready for download here.

Enjoy!
kt

Monday, March 12, 2007 6:59:19 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Saturday, March 10, 2007

This is a bit of a rare post for me as most of my posts are technical. I guess it's a bit of a change in attitude, as a few things (from my personal life) have come to light recently. (and, I thank Richard and Carl for this :) :).

Anyway, I'm sitting in Tokyo/Narita airport waiting to board my final flight home to Seattle. This is after a trip which has brought me over every longitude possible - in only 8 days. It sounds "glamorous" as many people always tell me "how lucky I am that I get to see the world" and while many trips allow me to see some amazing things AND I absolutely LOVE what I do... this trip really wasn't all that much fun. The long story short is that my Father became very ill where he was vacationing. It really didn't sound all that good when we heard (which was Wednesday - in London - where Paul and I had just completed a one-day workshop on Database Maintenance on Tuesday). Anyway, we thought that even if he were to make it, we knew he couldn't do the return travel alone. I was a bit of a basket case and in the short period of time we had to make a decision the only logical one was to get to him to help him get home (and for Paul to help me through it). Fortunately, for me, everyone (Miracle, Paul, Tony Rogerson, Tony's wife Alex, etc...) were extremely supportive and everyone just said they'd take care of what we left behind in our wake of crazy travels. And... well, that's where we are now (on our way home). We're amazingly happy to say that he's getting his strength back (slowly) and that we're making our way back home today (which is why we're in Narita). In the end, we (Paul and I) had to miss the OpenWorld event in Lalandia (to which we were truly looking forward) but I can't say that we regret the decision in any way.

Having said all of that, we (Miracle, Paul and I) are looking at options to give the OpenWorld attendees something special for our not having been there. We're looking at options that might include a special LiveMeeting presentation with an open Q&A as well as options for later in the year when both Paul and I are back in Europe (which is at least once or twice more this year).

So, while I'm not so fond of lots of [my] personal [life] posts (I certainly don't mind other people's posts :), we hope that you'll accept our sincerest apologies for missing the event. But - we have to admit - without Paul there, there must have been more beer.

Cheers! And, thanks for reading......
kt

Saturday, March 10, 2007 11:21:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Tuesday, March 06, 2007

Another great DotNetRocks interview has been completed. It's Paul Randal's session on Disaster Recovery, DBCC, Index fragmentation (and defrag) and [unfortunately for me] a lot more. All I can say is that I was ambushed...

thanks Richard
   thanks Carl...

Enjoy: http://www.dotnetrocks.com/default.aspx?showNum=217
kt

Tuesday, March 06, 2007 3:47:08 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Monday, March 05, 2007

I had a discussion earlier today (with Paul Randal) about many misconceptions that exist about upgrading databases and more importantly, about "downgrading" databases. Really, the issue is that I've heard people get frustrated when they find that things like backup/restore works FROM SQL 7.0/2000 TO SQL Server 2005 but not the other way around - even if the database is in SQL Server 2000 (80) compatibility mode. First and foremost, compatibility mode only affects parsing, query processing, and general data manipulation; it does not affect physical storage (well, there's more to it than that but that's a general overview). When you upgrade a database to SQL Server 2005, you WILL benefit immediately from changes in the storage engine, etc. regardless of compatibility mode. Compatibility modes are there to give you time in updating/upgrading your code - if/when necessary. Most code will work when upgrading but some code may not be supported because of changes to keywords, syntax changes, etc... The best thing to do is check your application compatibility with the Upgrade Advisor. I did a a couple of webcasts on Installation/Upgrade as part of my 11-part series on TechNet. See the blog entry for the entire series here. Part 3 and part 4 are focused on Installation and Upgrade and their associated blog entries have a lot of additional links (including links to the Upgrade Advisor as well as a series of things you might want to do before you upgrade). Also, be sure to checkout the upgrade site off of the main Microsoft SQL Server site.

How to move USER databases around - a quick list of what's supported between versions

Backup/Restore from 7.0 to SQL Server 2000
Detach from 7.0, copy the files, then attach to SQL Server 2000
Backup/Restore from 7.0 to SQL Server 2005
Detach from 7.0, copy the files, then attach to SQL Server 2005
Backup/Restore from 2000 to SQL Server 2005
Detach from 2000, copy the files, then attach to SQL Server 2005

Why use Backup/Restore?

PROs

  1. Because you have a backup! This will allow you to go back to the version from which you came. However, without any changes made on the uplevel version.
  2. Because it doesn't require the database to be taken "offline" when the backup is performed (note: that this is both good and bad - bad because you don't really know the exact point in time to which the database reconciles...which may not matter if you're just testing).
  3. Because the backup will be the size of data only and will not include database free space. Free space is not backed up (e.g. a database with a 100GB data file with only 20GB of data should yield a file that's roughly 20GB in size). I say "roughly 20 GB" because the internals of a backup require that the transaction log records for the activity that occured during the backup process are also backed up with the full database (or differential) backup. This is actually the basis for why transaction log backups are not supported during a full/differential backup in SQL Server 2000 (they are in SQL Server 2005). However, this is the reason why the transaction log cannot be cleared while a full or differential is ALSO running in SQL Server 2005.

CONs

  1. You don't know the exact point in time to which the database reconciles (it will be the time that the backup completed) AND logs CAN be restored uplevel as well. NOTE: If you're interested in creating an exact point in time version of the database - consider putting the database into "restricted user" mode or "single user" mode (so that user transactions are not allowed during the backup). Again, this may not be a concern.
  2. It takes time to complete the backup (there are four phases of a restore: create/initialization, copy, redo, undo). Make the create/initialization *much* faster by enabling Instant Initialization. See my Instant Initialization blog post for more details.

Why use detach/attach?

PROs

  1. It's simple, it's fast... but once detached then the database is OFFLINE.
  2. You know the exact point in time to which it reconciles because no transactions are allowed into the database once it is offline. Again, this may not be a concern.

CONs

  1. You must copy the entire file - including the free space to the other location and the network copy might be the most expensive (meaning time consuming) part of the entire process. However, once copied, the files do NOT need to be created on the destination because on attach, these files will be used.
  2. The database is offline once detached and during copy.
  3. If you don't COPY the files and instead you attach the detached files, you will have ABSOLUTELY NO WAY of getting back to the version from which you detached. (ah, this is probably the single most important reason for why I prefer backup/restore!)

Summary for "How to move USER databases around"
Between these versions "upgrades" are supported ONLY to the uplevel version. There is NO single (or simple) feature that can be used to get back to the version from which you started (without exporting/importing all of the data). There is also no undocumented back-door to do this either (no trace flags, no DBCC commands, NA DA!!! as per Paul).

What about System Databases?
This is a whole other can of worms to open and the easiest thing I can say here is that you generally should not move/upgrade system databases across machines. These are upgraded through "in-place" upgrades of SQL Server (on the same machine) or through manual migrations (to different machines) of the users/objects (SQL Agent Jobs, user-defined system procedures in master, logins in master, etc.). This is not an easy process (manual migration) but may prove to be a better choice over an upgrade in place if something were to go horribly wrong (which is unlikely but I'm a "what's the worst case scenario" person when it comes to availability :). The other benefit of NOT upgrading in place - and instead MOVING databases from one version to another on an upgrade - is that you get to complete some basic "spring cleaning". New hardware, freshly formatted, freshly installed/configured OS, clean disks, etc. This can often alleviate some of the strangest, hard-to-determine-problems, that have plagued you for weeks/months. Like I said, this is a whole other can of worms to open!

But - if you're interested in moving system databases around on the SAME machine, here's a great KB that covers the required options, syntax, rules and restrictions: How to move SQL Server databases to a new location by using Detach and Attach functions in SQL Server

And - if you're interested in transferring logins and passwords between instances (for upgrade or for sync'ing two servers used to create a standby partnership - with Database Mirroring and/or Log Shipping), here's a great KB article that includes links to other articles even uplevel transfering of logins (like 2000 to 2005): How to transfer logins and passwords between instances of SQL Server

And - that's it for this week (probably)... two in a row is not likely to become three in a row (just setting expectations :) :) :),
kt

Monday, March 05, 2007 3:22:45 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Sunday, March 04, 2007

Instant Initialization is a new feature of SQL Server 2005 that is based on an NTFS feature that was added to Windows XP (and therefore is also available in Windows 2003 Server). It's a feature that's seemingly simple; it allows file allocation requests to skip zero initialization on creation. As a result, file allocation requests can occur instantly - no matter what the file size. You might wonder why this is interesting or why this make a difference? Most file allocation requests are small requests, with small incremental changes (like .doc files, .xls files, etc.) but database files can be rather large. In fact, they should be rather large as pre-allocation of a reasonable file size is a best practice to reduce file fragmentation. Additionally, autogrowth causes performance delays (more so in 2000 than 2005) but it's generally something that you want to avoid when possible. As as result, database creation times can take minutes to hours to days, depending on file allocation request. But - it's not just for database creation. ALL file requests can leverage this feature: file creation for a new database, adding a file to an existing database, manually or automatically growing a file and (IMO - the best) restoring a database where the file (or files) being restored does not already exist. The reason I think the last feature is the best is that it can reduce downtime if a database is damaged and allow you to get back up and running more quickly. This is especially important for databases that cannot leverage partial database availability, which is an Enterprise Engine feature. So, to give you some motivation, here is a test that I performed just to have some interesting and comparable numbers.

Performance Test with Zero Initialization
Hardware: Dell Precision 670 Dual Proc (x64) with Dual Core, 4 GB Memory, RAID 1+0 array w/4-142 GB, 15000rpm disks
   CREATE DATABASE with 20 GB Data file = 14:02 minutes
   ALTER DATABASE BY 10 GB = 7:01 minutes
   RESTORE 30 GB DATABASE (EMPTY Backup) = 21:07 minutes
   RESTORE 30 GB DATABASE (11GB Backup) = 38:28 minutes

Performance Test with Instant Initialization
Hardware: Dell Precision 670 Dual Proc (x64) with Dual Core, 4 GB Memory, RAID 1+0 array w/4-142 GB, 15000rpm disks
   CREATE DATABASE with 20 GB Data file = 1.3 seconds
   ALTER DATABASE BY 10 GB = .4 seconds
   RESTORE 30 GB DATABASE (EMPTY Backup) = 5 seconds
   RESTORE 30 GB DATABASE (11GB Backup) = 19:42 minutes

SQL Server can leverage this feature for DATA file requests ONLY; the transaction log must be zero initialized because of its circular nature... which brings me to why this is not "on by default" or more specifically - HOW do you get this feature. First, there's absolutely no syntax change required - SQL Server will use it if it has access to it (so what does that mean?). The SQL Server service must have been granted the Windows permission - "Perform Volume Maintenance Tasks". By default, Windows Administrators have this permission but as yet-another-best-practice, we recommend that your SQL Server run under an account that is a "lower privileged" account (i.e. NOT an administrator). Other ideal options include running as "network service" or running as a dedicated user/domain account that has very few permissions except to SQL Server and it's required resources. A lot of folks recommend using network service for its simplicity (it doesn't have a password and it has limited network/local rights) and I agree with this as long as it's truly dedicated to SQL Server. If network service is used by other services on the same machine then you could compromise security of your SQL Server (or the other services) with the elevated permissions that SQL Server grants or visa versa by the permissions that other applications may have granted to network service. Again, I'm not against network service BUT I would check your local permissions to see if there's anything that jumps out at you. If you've installed other applications/services/etc. you may have already compromised the security of the network service account. I would love to know if anyone has a quick/easy way to check windows permissions to see what may have been granted in addition to the default permissions OR even a link to where the defaults are listed online... I've had trouble doing exactly this when searching, etc... feel free to post links/comments in your comments!. Anyway, with a dedicated user account, you can make sure that it's not compromised because you only use it for SQL Server. But, even these have negative issues - like required passwords that networks invalidate after n days and that you must change on a server/service basis. From a management perspective, this can be difficult.

SIDE NOTE: Managing the service account password is a lot easier in SQL Server 2005 with the SQL Server Configuration Manager (SQL-CM). The SQL-CM allows you to change the password to a service without an active connection (meaning even if the service isn't started) and it invalidates the login token so that password changes don't require a restart of the service. SQL-CM also has a command-line interface and is scriptable with WMI. The WMI Provider allows server settings, client and server network protocols, and aliases to be scripted through the WMI Provider by means of simple VBScript code (or by using the command-line tool). What you could end up doing is creating a script that changes the password of your services on all of your servers (for example, when a password policy is enforced that requires that the passwords of service accounts be changed). I've recently completed a whitepaper that highlights the Management tools (it's really just an overview but even then it turned out to be quite large as we looked at the tools in many different ways). I'll certainly let you know when the whitepaper is published (which should be within the next couple of weeks).

Granting the permission "Perform Volume Maintenance Tasks"
To use instant initialization, your SQL Server service must be running with an account that has the required privilege. If your SQL Server service is running as a local administrator this permission already exists. For a service account which is not a local administrator (again, recommended!), the necessary privilege to grant is Perform Volume Maintenance Tasks. This permission can be granted by an administrator through the Local Security Policy tool (Start, All Programs, Administrative Tools) and once granted, SQL Server automatically uses instant initialization. IMPORTANT NOTE: If this permission is given while SQL Server is running, then SQL Server must be stopped and restarted. However, once the server is restarted, no other syntax or permissions are needed. All data file creation/extension options will use instant initialization for data files created on NTFS volumes when SQL Server 2005 is running on Windows XP or Windows 2003 Server.

Why isn't this on by default?
OK, so after all of this... the gains that you see and the lack of changes to syntax, etc. You're probably wondering why this isn't on by default? It's a security issue. The biggest vulnerability is with SQL Server Administrators who are NOT also Windows Administrators. Windows Administrators have access to local files and can easily see all files stored on the local server. For files that are not encrypted (and are not already open to another process), an Administrator can open and/or modify these files using an appropriate editor. For files that are encrypted, an Administrator can at least view the encrypted information using a hex editor. By granting “Perform Volume Maintenance Tasks” to a SQL Server instance, you are giving administrators of the instance the ability to read the encrypted contents of a recently deleted file (ONLY IF the file system decides to use this newly freed space on the creation of a new database - created with instant initialization) with the undocumented DBCC PAGE command.

SIDE NOTE: The format for DBCC PAGE is undocumented in the Books Online but you will find tips and tricks on many “official” Microsoft blogs. The SQL Server Storage Engine blog (http://blogs.msdn.com/sqlserverstorageengine/ has some very good blog posts on internals and often describes undocumented commands. Specifically, check out the blog entry titled: How to use DBCC PAGE. The first three components are fairly straightforward: database id (or name), file id, and page id. The fourth component is the tricky one: printopt (or print option). The print options for DBCC PAGE are as follows (taken almost verbatim from the SQL Server Storage Engine blog - and Paul said I could :):

0 - print just the page header
1 - page header plus per-row hex dumps and a dump of the page slot array (unless it’s a page that doesn't have one, like allocation bitmaps)
2 - page header plus whole page hex dump
3 - page header plus detailed per-row interpretation (in this case, this option is not available - even if the data you are trying to read is from a previously deleted database because the metadata is not accessible only the raw page data)

Bear in mind, even if you can access this [encrypted] information, making sense out of this data will be challenging if not incredibly difficult.

In production environments, database files should NOT be located on file server drives - especially those where restricted and/or sensitive files are stored. As a result of prudent security measures, the true impact of using instant initialization is low. However, because this vulnerability exists, this feature is off by default.

Want to try this?
I've written a lab on Instant Initialization AND it has an interesting sequence of exercises (using multiple instances):

  • You create a database on instance: SQLDev01
  • Populate a large chunk of pages with very contrived (and easy to find) "junk" data

USE TestWithZeroInitialization
go

CREATE TABLE JunkData
(
   
JunkDataID int identity,
   
JunkDataValue char(8000) 
      DEFAULT REPLICATE('Junk', 2000)
)
go

SET NOCOUNT ON
go

DECLARE @Counter int
SELECT
@Counter = 0
WHILE @Counter < 20000
BEGIN
   
INSERT JunkData DEFAULT VALUES
   
SELECT @Counter = @Counter + 1
END
go

Drop the database from instance: SQLDev01

  • Create a new database on a different instance: SQLDev02 (which has been configured to use Instant Initialization) and hope that it uses the freed space by having dropped the first database
  • Start walking various pages (using DBCC PAGE) to see if you can view the "junk" data from the dropped database.

DBCC PAGE ('TestSecurityExposure', 1, 200, 2)
DBCC PAGE ('TestSecurityExposure', 1, 400, 2)
DBCC PAGE ('TestSecurityExposure', 1, 600, 2)
DBCC PAGE ('TestSecurityExposure', 1, 800, 2)
DBCC PAGE ('TestSecurityExposure', 1, 1000, 2)
DBCC PAGE ('TestSecurityExposure', 1, 1500, 2)
DBCC PAGE ('TestSecurityExposure', 1, 2000, 2)
DBCC PAGE ('TestSecurityExposure', 1, 2500, 2)
DBCC PAGE ('TestSecurityExposure', 1, 3000, 2)
DBCC PAGE ('TestSecurityExposure', 1, 3500, 2)
DBCC PAGE ('TestSecurityExposure', 1, 4000, 2)
DBCC PAGE ('TestSecurityExposure', 1, 4500, 2)

In most cases the very first output (for page 200) will return Junk data. If this is not the case, simply drop the TestSecurityExposure database and recreate it again. Sometimes it’s a timing issue and sometimes it could be a background process (like Windows Update) that uses the expected pages. Regardless, if you do get the same pages again - our contrived data should be easy to find.

You can certainly create the environment on your own and see if you can get it to work. OR, you can get a copy of our AlwaysOn DVD that has the appropriate lab environment. I tend to give away the AlwaysOn DVD at events I speak at (on Availability/Disaster Recovery) but I'm also happy to send a few out over snail mail (it needs 2GB of memory and 10GB of disk space for the virtual environment AND you need to have Virtual PC or Virtual Server installed - which are freely downloadable from Microsoft). Paul asked for DBCC CHECKDB information here (to get a free DVD sent to you) and I'm going to ask for instant initialization numbers and how this has helped. Post a comment here and then send me an email with your snail mail address information as well. I'm willing to do this for the first 10 responses... go!

Have fun...and thanks for reading!
kt

PS - For those of you in our UK (London) Event tomorrow, we're giving away the AlwaysOn DVD (and an even cooler SQLskills pen... lol). Now there's motivation if the content doesn't (NOT!).

Sunday, March 04, 2007 2:28:31 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5]  | 
Monday, February 19, 2007

OK - I feel like I know a fair amount about SQL Server but sometimes I also feel like I don't :) :) I'm continuously amazed at how big a product SQL Server is... today was one of those days when I felt "I don't"!

I've been wanting to know more and more about the new features in the tools and the direction in which the tools are headed so... I setup a meeting with Paul Mestemaker (it helps that I live in Redmond and I'm working on some SP2 resources for the team :); it was a great meeting. Some exciting new features and some great new directions in which the tools are headed. I like the way they're thinking and I especially like the options that are now in place to discover, use and customize the "reports" feature within Management Studio (just to name one!). What I learned (that was the highest on my "I didn't know that" list) is that quite a few gems are released as part of the Feature Pack for SQL Server. The Feature Pack "is a collection of standalone install packages that provide additional value for SQL Server 2005. It has been updated for SP2." From that description, it doesn't leap out at me as exciting AND I often know about many of these tools through other channels - but usually it's just "a tool at a time". The thing that's nice about the FP page is that it seems to be a nice and central, single location for ALL of the "add-ons" for SQL Server. It includes things like the Upgrade Advisor (which I typically point people to individually on it's main page) and will include (once it ships) things like the BPA (SQL Server 2005 Best Practices Analyzer) BUT it also includes things like a standalone download for SQLCMD so that you don't need to install all of the tools if you just want this lightweight client for automation. Additionally, it includes the SQL Server 2000 DTS Designer Components if you want to edit/modify/maintain DTS packages in 2005 before you rewrite/convert them to SSIS.

So - the point is that there's lots of great stuff out there and sometimes it just takes another person, a blog entry, or a few minutes hitting a company site to see what's new. I'd strongly suggest that you and your team pick a morning - maybe once or twice a month (and round robin who brings the coffee/doughnuts :)) to just browse around and see if there's anything new on your hardware, software and other supporting sites - especially those that don't already have a blog/rss feed (or other form for notifications). No one needs to know everything but knowing where to look can really make all the difference when you do need something (or when you have a concern/problem). And browsing a few sites (occasionally) might make the difference in applying a hotfix/patch before something becomes a big problem. Staying current with hotfixes, service packs, bios updates, firmware updates, etc. is difficult so make it a team effort.

Speaking of service packs, here's the primary page for SQL Server 2005 SP2: http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/servicepacks/sp2.mspx
And don't forget the Books Online Update as it is NOT installed when you update an instance to SP2: http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx

Also, be sure to watch the SQL Server Manageability Blog (aka Paul Mestemaker's blog) moving forward as he'll have the first news about BPA and many other tips/tricks with regard to the tools.

Finally, (and this is great timing too), Paul Randal - prolific author at the SQL Server Storage Engine Blog - blogged about all of the active "SQL Server Product Team"-related blogs here.

And........ if that doesn't keep you busy, I'm not sure what will! :)

Instant Initialization technical details are next and then I'll get back to the Clustered Index Debate. Thanks for reading!
kt

Monday, February 19, 2007 7:58:37 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Tuesday, February 13, 2007

Enterprise only.

OK - I really need to blog more and well - I'm starting today by blogging a "quickie" blog entry on something that I just learned recently and that most of us (who speak/write/whatever on SQL Server) have been saying incorrectly...even marketing :). What's been said is that the new SQL Server 2005 feature "Instant Initialization" is an Enterprise Only feature (remember that Enterprise Only includes ALL skus that have the enterprise engine (EE) - the EE is in Enterprise Edition, Enterprise Eval and Developer Edition). Well... that's not the case. And - personally, I never tried it on standard edition as most of my customers are enterprise customers OR we're doing development work on the Developer Edition. All I have to say here is COOL! Now - I'll post another entry (shortly) that tells you all about Instant Initialization as I think it's a very simple and important feature to allow (no, it's not necessarily on by default - this is part of why I need another blog entry).

As for upcoming events - there are 3 in March.

March 6 - Reading, UK
   
One day workshop on Crucial Database Maintenance Techniques hosted by Tony Rogerson of sqlserverfaq.com.

March 8-10 - Lalandia, Denmark
   
SQL Server OpenWorld hosted by Miracle Denmark.

March 25-29 - Orlando, Florida - USA
   
SQLConnections hosted by Penton Publishing and SQL Server Magazine.

Many of the above events are focusing on HA/DR and Database Maintenance and are copresented with Paul Randal. He blogged about these events here.

And - I'll be back soon! I promise!! (notice that the time between blogging is decreasing - in general :)
kt

Tuesday, February 13, 2007 1:48:16 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, February 02, 2007

Well, I've promised to blog more and I'm really going to try to do so. This morning I got the perfect question/comment (in email) to respond to and after working through a response that was taking me upwards of 3 hours (you'll learn later why I have 3 "spare" hours :)......... I figured that it was time to turn the response into a blog post. ;)

Background: The Clustered Index Debate
In the years since the storage engine was re-architected (SQL Server 7.0+) there's been constant debate on how to appropriately choose the clustered index for your tables. I've generally recommended an ever-increasing key to use as a clustered index and many find that counterintuitive. The primary reason people feel it's counterintuitive is that it creates a hotspot of activity. [If "hotspot" is not a familar term - a hotspot is solely an active place within your table.] Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there. In fact (and probably even more counterintuitive), the opposite is true. Hotspots (specifically hot PAGES not hot ROWS) can be very beneficial because they; minimize the number of pages needed in cache, improve the likelihood of the required page already being in cache and in general, they minimize the overall amount of cache required. So, this is why many of us have changed our recommendation on where to create the clustering key in 7.0+. Instead of focusing on range queries we now focus on placing the clustering key on an ever-increasing key. In earlier releases, focusing on range queries for the clustered index reduced hotspots for insert/update and this in fact was the PRIMARY motivation to choose them, NOT range query performance! But - there are even MORE reasons to choose an ever-increasing key and they are based on internals as well. These internals are based on the significant changes made in the storage engine for 7.0+. For a quick start on these, I went through them in the Blog entry here.

And, today's email is not uncommon. This is the basis for the title clustered index debate. In general, there are still a lot of questions related to creating clustered indexes to improve "range query" performance. Don't get me wrong, there's definitely a benefit in performance for some range queries but the first thing to remember is that you get only one CL index per table (therefore only one type of range query can benefit). In the real world, t's not likely that you want to see your data exactly in the same way all the time. Therefore it's very challenging to come up with the "right clustered" index if you're using range queries as your strategy. Even worse, the affect of choosing the clustering key to improve range queries causes problems for modifications against that table (INSERTs/DELETEs and UPDATEs). So.............. this is what started my day today. A great email from a reader that brought up these points. The question/comment (modified to hit only the highlights and to protect their identity :) was this:

The most important characteristic for a Clustered Index key is to satisfy range queries. More often than not, if a sufficient range of data will be scanned, the Optimizer will choose the Clustered Index over all others due to the excessive cost of Bookmark Lookup operations. As such, the table KEY is a more suitable clustered index candidate than any surrogate (few every query a database by range of surrogate keys).  [kt note: this second sentence is not entirely true... SQL Server will certainly choose a clustered index over non-clustered that require table scans but there are A LOT of algorithms that SQL Server can use instead of either of these and my examples later show this... non-clustered covering seekable indexes, non-clustered scanable indexes, index-intersection, etc. ] 

Now, when the default behavior for SQL Server was designed such that the PRIMARY KEY was chosen as the default clustered index, it was exactly for this reason.  It is the business key.  It would satisfy uniqueness (by definition of logical KEY).  And, it is well suited for a wide variety of range scans.  However, this is when the PRIMARY KEY is defined on the Business Key of the data.

But, when you introduce the usage of surrogate keys (i.e., IDENTITY) as a physical implementation, and thus transfer the PRIMARY KEY definition to it, two things must be considered.  First, the Business Key this IDENTITY will be a proxy for must still exist as it is still apart of the logical design.  As part of the physical design, the logical key needs to be implemented as a physical constraint to maintain logical uniqueness.  Second, just because a proxy has been defined does not make it a natural candidate for the clustered index.  The business key still maintains this distinction.

What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits.  However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list.  Page Splits are managed by proper FILLFACTOR not increasing INSERTS.  Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.

Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.

In short, the argument runs shallow.

Luckily, this email arrived with perfect timing for me as I'm sitting in a "bootcamp" event on Always On technologies and I'm not speaking this morning (my colleague Bob Beauchemin is doing lectures on Scale Out technologies: Scalable Shared Databases, Service Broker, DPVs, etc.). Anyway, in addition to listening to Bob, I've decided to continue the blog series on "the clustered index debate". The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).

The Clustered Index Debate Continued
First, there are many angles to look at wrt to "the clustered index debate" and it's not until all of the issues are reviewed, that this strategy (a monotonically increasing key) becomes obvious. So, I think it will probably take a couple of blog posts to really prove this. I'll start up this debate again here...... When you look at a general purpose table (which is most) where the table has ALL DML (S/I/D/U) then you are best off with an ever-increasing key (again, you have to look at the overall impact of all operations against the table - not just select... because I/D/U will also impact select in the long term). So, I'll break this down into each DML operation here. If you don't look at the overall impact, then large tables can end up having a tremendous number of problems once they're put into production. I've certainly heard this concern/debate before (and most people are skeptical at first glance) but when you look at the situation overall, you'll find that "finding the right balance" includes not just looking at range queries. In fact, here's a quick list of the things/tests/numbers/scenarios that help to prove my strategy:

  • Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing. I have some simple numbers but I'm thinking about creating a much larger/complex scenario and publishing those. Simple/quick tests on a laptop are not always as "exciting". But - this is a well documented issue (IAM/PFS lookups) and poor performance on a heap is also referenced in this KB: PRB: Poor Performance on a Heap. note: this KB is quite dated and I don't actually agree with everything in this article however, the general concern of poor performance for inserts is still true on SQL Server 2005.
  • Updates are often faster (when the row needs to be relocated) and for the same reason (IAM/PFS lookups) BUT there are many types of updates and not all updates cause records to be relocated. Here are a few things to think about wrt to updates:
    • Updates that are completely in-place (some examples are where the update is updating a fixed-width column OR to variable-width columns where the row size doesn't change, etc.). These types of updates don't really care.
    • Updates that cause record relocation (where the row size changes) are definitely better by having a clustering key because the record relocation (which will be handled by a split) is defined by the clustering key
    • Updates to the clustering key are the WORST (in this case) which is one of the key reasons for having a cl key that is static (so we have to keep this in mind when we choose a clustering key).
  • Deletes aren't nearly as big of a concern BUT deletes in heaps create more gaps and more gaps creates more work in PFS/IAM lookups and while this helps to reduce wasted space, it still requires the time to find the space........ hence the slowed performance of Inserts/Updates. I've also written some blog entries that cover very interesting test cases for large scale deletes and why you'd want to consider partitioning to optimize for the "sliding window scenario" in this blog entry: MSDN Webcast Q&A: Index Defrag Best Practices - Fragmentation, Deletes and the “Sliding Window” Scenario and it's the LAST one!.
  • Selects.............. now this is the hardest one to go through in just a couple of bullets (ah, I guess this will lead to another one or two posts :) BUT I'll start by saying that the best way to tune the vast majority of range queries is through non-clustered [covering] indexes. But, it's also important for me to stress that I do NOT advocate covering every query (it's impossible to do). What's important to realize in terms of covering is that SQL Server 7.0 and up continues to include internal algorithms to improve performance when you don't have the "perfect" non-clustered covering seekable index and instead still gives better performance than going to the base table (or performing bookmark lookups - as mentioned in the mail...and I completely agree that these [bookmark lookups] can be evil!). To start this discussion, I'll give one of my favorite examples of a large-scale aggregate. The absolute best way to improve the performance is through an indexed view but the data can be gathered through many other algorithms - ideally through a non-clustered covering index that is in order by the group by and that includes the column(s) being aggregated. For example, take this query:

SELECT c.member_no AS MemberNo,
 sum(c.charge_amt) AS TotalSales
FROM dbo.charge AS c
GROUP BY c.member_no

On a charge table of 1.6 million rows here are the performance numbers to handle this aggregation:

  • Clustered table scan (CL PK on Charge_no) with a hash aggregate = 2.813 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 1.436 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = .966 seconds
  • Indexed view = .406 seconds

Now this was a pretty small table (narrow rows and only 1.6 million rows) AND I didn't have any concurrent activity. The concurrent activity would have caused this to be even slower for hash aggregates, etc. Regardless, it proves the point (at least generally). Now, if I wanted to improve this range query then I'd have to cluster on the member_no column (and this is an ideal example because I often hear people say that clustering on a foreign key column helps to improve range/join queries - which can be true as well)......... But - this strategy has a few problems in addition to a few benefits (and we have to look at everything to be sure of our choice/decision). First, member_no is not unique (in the charge table) so SQL Server has to "uniquify" the rows. The process of "uniquification" impacts both time (on insert) and space (the rows will be wider to store each duplicate row's uniqufier). Also, theoretically it could change (in this case that's not true). Anyway, the time it takes for the clustered index is 2.406 seconds which is better than the clustered on the PK (of course) but if I were to also start modifying the rows (which creates splits) or even just insert 15% more rows........ then my table would become fragmented. At that point, the query performance should get worse in the table clustered by member_no table and it will continue to get even worse in the table clustered by charge_no (because of the worktable created in tempdb by the hash aggregate) BUT it won't be all that much worse in the non-clustered index examples (especially the covering index that's in the order of the group by - because this doesn't require a worktable).........

  • CL on member_no = 4.906 seconds
  • CL on charge_no = 6.173 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 3.906 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = 1.250 seconds
  • Indexed view = .516 seconds

This is a great start to furthering the clustered index debate but I do have to admit that it's a counterintuitive and difficult issue to tackle because often isolated tests lead you to different conclusions. In this case though, the non-clustered indexes are better for this range query and the indexed view is the best (but I wouldn't consider the Indexed unless this were more of a read focused database rather than read/write). [and - of course, that statement warrants yet another blog post :)]

So, depending on the tests that you do - especially if you focus only on selects and you don't have modifications (i.e. fragmentation) - then they will make "creating the clustered index for range queries" appear to be best. Again, I'm not just saying this to prevent fragmentation, I'm saying this because I wouldn't use the clustered index OR a non-clustered index with bookmark lookups to handle this query. I'd consider a non-clustered covering that's seekable OR even a non-clustered covering that's scanable before I'd even choose the clustered (and that's what the optimizer would prefer as well). In the end it's really a bit of an art and a science to "finding the right balance" of indexing.

Oh - and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help but many would prefer to use actual data... which means [potentially] creating your primary key with a new identity [or similar] column and this can impact your business logic (absolutely). I'm certain that certain tests can show that range queries are faster and it's absolutely correct that business application/usage can be a concern but when you look at the big picture (and the impact on I/D/U) then the benefits of the monotonically increasing key significantly outweigh these concerns. Simply put, a small/narrow key can help join performance and an ever increasing key can also help lookups for rows! (yes, definitely more coming)

Happy Friday! Have a great weekend. I'll try to continue more threads on this debate shortly!
kt

Friday, February 02, 2007 11:39:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Wednesday, January 31, 2007

Ok - a strange title indeed but it's been a strange couple of months. It all started with a "much needed" vacation and I thought that would get me through the persistent "cold" that I was having all through my travels. Anyway, vacation didn't help and I came back to find that what I had was a sinus infection. Janaury has been filled with antibiotics, sleep and well......still a lingering cough even though the month is over today. The long story short is that I'm starting to come out of it and I promise to start blogging a lot more frequently starting now. In fact, I have 3 or 4 entries in the queue that I'm plotting for upcoming posts.

To get you started - there are some great and NEW resources that were posted just this week by some of my SQLskills colleagues. If you read their blogs then you've probably already seen this but for completeness, I'm going to post them here:

Bob Beauchemin's Blog Entry: http://www.sqlskills.com/blogs/bobb/2007/01/30/TheFirstOfMyScaleoutWhitepapersIsAvailable.aspx
Bob's Whitepaper: Planning, Implementing, and Administering Scaleout Solutions with SQL Server 2005

Liz Vitt's Blog Entry: http://www.sqlskills.com/blogs/liz/2007/01/30/AnalysisServicesPerformanceTuningWhitepaperHasArrived.aspx  
Liz's Whitepaper: Analysis Services 2005 Performance Guide

Ok - there's my first post of 2007 and NOT my last. More to come. Thanks for reading and HAPPY NEW YEAR (at least I got that in January :),
Kimberly

Wednesday, January 31, 2007 9:41:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Thursday, November 30, 2006

OK, it's been a heck of a long time since I blogged... and for that I apologize. I'm also WAY overdue in my posting my demo scripts from a TON of conferences BUT... now everything has been posted. Check out the past events page on SQLskills and you can find the demo scripts that you're looking for......lots of fun stuff and TONS of scripts to play with and test.

Now - as for the reason(s).... many are business and for that I blame the following (yes, 17 flights [yes, one boarding pass is missing] over ONE 5 week trip with 7 events and 5 continent changes):

The other reason(s) are personal...suffice it to say that the last 6 months have been some life changing times for me and what I'm finding (or trying to find) is that ever important balance between work and life. During this holiday season (and always), I wish you and your loved ones well and I hope that you too can find (and cherish) what's most important to you.

So, you won't see anything else from me for this year but I do hope to be better (and more frequent) with blogging in the New Year and I especially hope to see you again at an upcoming conference. Let me leave you with the most exciting picture I've witnessed this year...it was during my one day of site seeing in Cape Town - where I went cage diving with Great White Sharks (and got horribly sea sick - which is rare for me) but where I was able to witness these amazing and powerful creatures....

Have a happy and safe holiday season!
kt

Thursday, November 30, 2006 10:42:45 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Wednesday, September 27, 2006

There are SO many new features in the updated DasBlog that I don't even know where to start. Here's just a quick list of the things I really love:

1) Scrolling through the entries - if you click a specific entry's title, you will get a "last blog entry" | main | "next blog entry" section above the entry's title and you can scroll through the blog entries. It's really a nice way of reading through a series of blog entries.

2) the SUPER cool complete archive by category - here's a link: http://www.sqlskills.com/blogs/kimberly/Archives.aspx

And - those are just a couple of my favorite new features with 1.9 but you can read more about it from the source - Scott Hanselman - here. Definitely worth the upgrade, definitely worth the consideration to move from some other blog engines (depending on how much you blog) OR consider creating your own blog (here's why)........

Enjoy,
kt

Wednesday, September 27, 2006 6:31:38 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Friday, September 08, 2006

I am asked often.....

  • How can I get into presenting?
  • How do I get into external consulting?
  • How can I do something similar to what you do?

It's a challenge to work for yourself...

  • There are periods of feast and periods of famine
  • There are [way too many] times when the only person you can blame is yourself ;)
  • You need to do current work, work on getting future work and do the admin for past work (billing/receivables/paying subcontractors)
  • No one pays you when you are sick or when you want to vacation or when you just want to veg... and you really need to schedule all of that way in advance (which is challenging)

BUT - it's also very rewarding...

  • If you plan ahead - you can take time for yourself and you don't have to deal with a corporate limitation in vacation days
  • If the work is light you can take the morning to mess about and blog ;)
  • You feel really good when things go well

So...what's my point?

If you've been thinking that you'd like to expand your horizons and possibly get out and do something on your own, I'd *really* suggest NOT doing it without planning ahead. First, you should plan to have some cash reserves (the first 6-12 months are the hardest as you'll have a lot of setup costs (hardware, network, accounting and legal fees, etc.) AND it might take time to get customers and even more time to get paid). Second, you should already know from where your business will come. What I mean by this one is that you should start trying to figure out who might hire you before you take the leap. I'm not suggesting that you try to steal customers...not at all. You just want to think a lot about where you will get business before you have 0 income. And - that really leads me to the reason for this post. One of the best ways to get business is to be desirable. In this industry that means many things:

(1) Get really good with a technology that you like... which #2 will help you know if you are good or not AND whether or not you really do like it.

(2) Make a name for yourself - create a blog, write magazine articles, post on other people's blogs, answer newsgroup postings...start to learn more but also get more involved. Do a bit of research before posting and even test and/or create sample code. And present at conferences - this is a bit challenging BUT it can be done. Some conferences are community based (like SQLPASS) and they look for new and exciting presentations from the field... Others expect a lot more prior history and sometimes have a smaller number of speakers in general BUT, TechEd ITForum (week of Nov 13 in Barcelona) is doing something that's pretty interesting called "Speaker Idol" and this might be the break you need?

OK, so Speaker Idol is really the reason for why I've posted this blog entry. I wasn't really sure what the heck it was when I first bumped into it (accidentally actually - after someone asked me to verify something totally unrelated). Anyway, it's a competition for attendees of TechEd ITForum to submit a presentation that will be voted on as "best presentation" and the winner will get free attendance to next year's TechEd ITForum (and a VERY cool item to put on your blog, resume, etc.).

(3) Research the whole process of starting a business in your neck of the woods A LOT more before you actually do it!!! I truly wish you well if you do!

So, check out TechEd ITForum's Speaker Idol and have fun! It's certainly an interesting way to jump start the process if you've been thinking about this but just hadn't made the leap. GOOD LUCK!

Cheers,
kt

Friday, September 08, 2006 8:30:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Friday, September 01, 2006

Hey there everyone - Been a LONG time since I last blogged (sorry!)... key reason (fyi) is that I'm trying to find the ever-challenging work/life balance during the best months of the year (here in Seattle July/Aug are GREAT! months - September is almost always good too). Anyway, it's been a few weeks and I thought I'd catch you up... It all started with some travel (of course) and I was in Chicago for a SQLskills Immersion event and then off to London for another event with SQL Server FAQ (aka Tony Rogerson). I returned during the CRAZY travel restrictions and had to check two laptops (sigh) BUT they both made it back without damage after I purchased kitchen towels, bed sheets, a blanket and a duvet (and two new [cheap] suitcases) in which to pack them. When I got back, I relaxed! I've been up to see the Athabasca Glacier and Johnston Canyon and just this week I took off a day to do the "Lotus Experience" Advanced Driving Course at Pacific Raceways/ProFormance Racing School. I've always LOVED driving so driving 700 miles to Johnston Canyon didn't bother me at all (especially with great company and fantastic tunes - have you heard the latest Snow Patrol?) and the driving course in the Lotus was not only great fun but VERY informational. I've taken the Advanced Driving course before and I think EVERYONE should consider continuing their education in driving - things like Collision Avoidance, Advanced Braking (understanding proper driving under ABS), "high eyes" and so many other things...just make you safer on the road. OK, enough about all that fun/practical stuff.....let's get back to SQL.

Some great new resources are out:

And - it's now time to start enjoying Labor Day weekend, there ain't no labor going on here this weekend. Enjoy - and check out those links next week. ;-)

Cheers everyone,
kt

Friday, September 01, 2006 5:37:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Monday, July 10, 2006

Been thinking a lot about something that was mentioned in a few of my most recent posts... Especially when I get comments like "that's another item to add to our checklist" or "that's a good trick to add to our arsenal" and well, I thought in this blog entry I'd ask for your tricks that fall under the umbrella of designing for performance.

For example - do you change collations? I had a recommendation here.
For example - do you have a view that you want ordered? I had a recommendation (with caution) here. But - Adam Mechanic came back and said that he's used that trick to improve performance... and, I'm sure that's the case as well!
For example - do you have stored procedure parameters that are giving you grief? I had a series of recommendations in my Optimizing Procedural Code category here.

In fact, sometimes the best form of "hint" to SQL Server is NOT an optimizer hint but instead a more subtle change to the join (derived tables for example) or the infamous subquery -> join rewrite or the join -> subquery rewrite. I'm always asked "which is better - a subquery or a join" and I always answer YES. ;-)  OR taking a complex process and breaking it down into temp tables (I'd try to create views instead of temp tables first and see if the optimizer figures it out but there are cases when sometimes they just don't). Remember, it's not the optimizer's job to find the absolutely BEST plan; it's their job to find a good plan fast. And - they typically do. Really, no general "tricks" work ALL of the time and often they don't help at all but there are LOTS of things that I'm sure you've done and you really want to tell someone about it. How about here? I'm going to try to compile these tips/tricks into a best of...

Monday, July 10, 2006 7:27:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Monday, July 03, 2006

Hey there everyone - If you're into Analysis Services, Integration Services, Reporting Services and BI in general - you'll want to check out Elizabeth Vitt's new blog. Liz has been specializing in BI since SQL Server started adding BI-centric components. She's got a wealth of information to share and many great insights into performance tuning as she's working on a BI Performance Tuning resource that will probably hit 100 pages (from current guestimates).

And - no surprise from Liz, she's out the gates running with her first entry on Influencing Aggregation Candidates.

Subscribe now!

And a big welcome to blogging for Liz!
kt

Monday, July 03, 2006 10:20:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Friday, June 30, 2006

Well, it's been a GREAT week here in Switzerland while working with my partner Trivadis. Today, we wrapped up a two-day course on Designing for Performance (in Geneva) while on Monday/Tuesday we did a two-day course on Indexing for Performance in Zurich. The food, the wine, the cheese, the butter, yum! Oh... and the questions/comments/technical focus, etc. has all been great. :) I'm flying home today (Sat) so wish me luck on having internet access at 36,000 feet again (probably not...I'm flying United instead of Lufthansa - and it's only Lufthansa that has FlyNet). Wow - can you imagine where we're going to be in only a couple more years? Internet access everywhere! (hmmm.. how do we escape? well, that's another blog entry for another day :)

Anyway, one of the great things about teaching is that I get to meet all sorts of people and work through all sorts of interesting problems... And - this blog entry is based on a discussion I had with [a very blogless ;-] Meinrad Weiss - a Trivadis employee/consultant AND a fellow RD. (I was bullied into blogging by CV so now I do my part and do the same to others)

I can't remember how it started but somehow a discussion started on Top 100 PERCENT being used in views. I mentioned that while this was a good trick in SQL Server 2000, it has been REMOVED from SQL Server 2005 (meaning that TOP 100 PERCENT does NOT order the data within a view). Theoretically, I agree with this decision - data within a view should NOT be ordered. A view should SOLELY define a tabular set. It is up to the query which is accessing the view to define the presentation of the view. Using TOP within a view should be limited to ONLY when it is used to further define the data set (i.e. TOP 10 PERCENT...