Friday, February 02, 2007

Well, I've promised to blog more and I'm really going to try to do so. This morning I got the perfect question/comment (in email) to respond to and after working through a response that was taking me upwards of 3 hours (you'll learn later why I have 3 "spare" hours :)......... I figured that it was time to turn the response into a blog post. ;)

Background: The Clustered Index Debate
In the years since the storage engine was re-architected (SQL Server 7.0+) there's been constant debate on how to appropriately choose the clustered index for your tables. I've generally recommended an ever-increasing key to use as a clustered index and many find that counterintuitive. The primary reason people feel it's counterintuitive is that it creates a hotspot of activity. [If "hotspot" is not a familar term - a hotspot is solely an active place within your table.] Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there. In fact (and probably even more counterintuitive), the opposite is true. Hotspots (specifically hot PAGES not hot ROWS) can be very beneficial because they; minimize the number of pages needed in cache, improve the likelihood of the required page already being in cache and in general, they minimize the overall amount of cache required. So, this is why many of us have changed our recommendation on where to create the clustering key in 7.0+. Instead of focusing on range queries we now focus on placing the clustering key on an ever-increasing key. In earlier releases, focusing on range queries for the clustered index reduced hotspots for insert/update and this in fact was the PRIMARY motivation to choose them, NOT range query performance! But - there are even MORE reasons to choose an ever-increasing key and they are based on internals as well. These internals are based on the significant changes made in the storage engine for 7.0+. For a quick start on these, I went through them in the Blog entry here.

And, today's email is not uncommon. This is the basis for the title clustered index debate. In general, there are still a lot of questions related to creating clustered indexes to improve "range query" performance. Don't get me wrong, there's definitely a benefit in performance for some range queries but the first thing to remember is that you get only one CL index per table (therefore only one type of range query can benefit). In the real world, t's not likely that you want to see your data exactly in the same way all the time. Therefore it's very challenging to come up with the "right clustered" index if you're using range queries as your strategy. Even worse, the affect of choosing the clustering key to improve range queries causes problems for modifications against that table (INSERTs/DELETEs and UPDATEs). So.............. this is what started my day today. A great email from a reader that brought up these points. The question/comment (modified to hit only the highlights and to protect their identity :) was this:

The most important characteristic for a Clustered Index key is to satisfy range queries. More often than not, if a sufficient range of data will be scanned, the Optimizer will choose the Clustered Index over all others due to the excessive cost of Bookmark Lookup operations. As such, the table KEY is a more suitable clustered index candidate than any surrogate (few every query a database by range of surrogate keys).  [kt note: this second sentence is not entirely true... SQL Server will certainly choose a clustered index over non-clustered that require table scans but there are A LOT of algorithms that SQL Server can use instead of either of these and my examples later show this... non-clustered covering seekable indexes, non-clustered scanable indexes, index-intersection, etc. ] 

Now, when the default behavior for SQL Server was designed such that the PRIMARY KEY was chosen as the default clustered index, it was exactly for this reason.  It is the business key.  It would satisfy uniqueness (by definition of logical KEY).  And, it is well suited for a wide variety of range scans.  However, this is when the PRIMARY KEY is defined on the Business Key of the data.

But, when you introduce the usage of surrogate keys (i.e., IDENTITY) as a physical implementation, and thus transfer the PRIMARY KEY definition to it, two things must be considered.  First, the Business Key this IDENTITY will be a proxy for must still exist as it is still apart of the logical design.  As part of the physical design, the logical key needs to be implemented as a physical constraint to maintain logical uniqueness.  Second, just because a proxy has been defined does not make it a natural candidate for the clustered index.  The business key still maintains this distinction.

What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits.  However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list.  Page Splits are managed by proper FILLFACTOR not increasing INSERTS.  Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.

Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.

In short, the argument runs shallow.

Luckily, this email arrived with perfect timing for me as I'm sitting in a "bootcamp" event on Always On technologies and I'm not speaking this morning (my colleague Bob Beauchemin is doing lectures on Scale Out technologies: Scalable Shared Databases, Service Broker, DPVs, etc.). Anyway, in addition to listening to Bob, I've decided to continue the blog series on "the clustered index debate". The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).

The Clustered Index Debate Continued
First, there are many angles to look at wrt to "the clustered index debate" and it's not until all of the issues are reviewed, that this strategy (a monotonically increasing key) becomes obvious. So, I think it will probably take a couple of blog posts to really prove this. I'll start up this debate again here...... When you look at a general purpose table (which is most) where the table has ALL DML (S/I/D/U) then you are best off with an ever-increasing key (again, you have to look at the overall impact of all operations against the table - not just select... because I/D/U will also impact select in the long term). So, I'll break this down into each DML operation here. If you don't look at the overall impact, then large tables can end up having a tremendous number of problems once they're put into production. I've certainly heard this concern/debate before (and most people are skeptical at first glance) but when you look at the situation overall, you'll find that "finding the right balance" includes not just looking at range queries. In fact, here's a quick list of the things/tests/numbers/scenarios that help to prove my strategy:

  • Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing. I have some simple numbers but I'm thinking about creating a much larger/complex scenario and publishing those. Simple/quick tests on a laptop are not always as "exciting". But - this is a well documented issue (IAM/PFS lookups) and poor performance on a heap is also referenced in this KB: PRB: Poor Performance on a Heap. note: this KB is quite dated and I don't actually agree with everything in this article however, the general concern of poor performance for inserts is still true on SQL Server 2005.
  • Updates are often faster (when the row needs to be relocated) and for the same reason (IAM/PFS lookups) BUT there are many types of updates and not all updates cause records to be relocated. Here are a few things to think about wrt to updates:
    • Updates that are completely in-place (some examples are where the update is updating a fixed-width column OR to variable-width columns where the row size doesn't change, etc.). These types of updates don't really care.
    • Updates that cause record relocation (where the row size changes) are definitely better by having a clustering key because the record relocation (which will be handled by a split) is defined by the clustering key
    • Updates to the clustering key are the WORST (in this case) which is one of the key reasons for having a cl key that is static (so we have to keep this in mind when we choose a clustering key).
  • Deletes aren't nearly as big of a concern BUT deletes in heaps create more gaps and more gaps creates more work in PFS/IAM lookups and while this helps to reduce wasted space, it still requires the time to find the space........ hence the slowed performance of Inserts/Updates. I've also written some blog entries that cover very interesting test cases for large scale deletes and why you'd want to consider partitioning to optimize for the "sliding window scenario" in this blog entry: MSDN Webcast Q&A: Index Defrag Best Practices - Fragmentation, Deletes and the “Sliding Window” Scenario and it's the LAST one!.
  • Selects.............. now this is the hardest one to go through in just a couple of bullets (ah, I guess this will lead to another one or two posts :) BUT I'll start by saying that the best way to tune the vast majority of range queries is through non-clustered [covering] indexes. But, it's also important for me to stress that I do NOT advocate covering every query (it's impossible to do). What's important to realize in terms of covering is that SQL Server 7.0 and up continues to include internal algorithms to improve performance when you don't have the "perfect" non-clustered covering seekable index and instead still gives better performance than going to the base table (or performing bookmark lookups - as mentioned in the mail...and I completely agree that these [bookmark lookups] can be evil!). To start this discussion, I'll give one of my favorite examples of a large-scale aggregate. The absolute best way to improve the performance is through an indexed view but the data can be gathered through many other algorithms - ideally through a non-clustered covering index that is in order by the group by and that includes the column(s) being aggregated. For example, take this query:

SELECT c.member_no AS MemberNo,
 sum(c.charge_amt) AS TotalSales
FROM dbo.charge AS c
GROUP BY c.member_no

On a charge table of 1.6 million rows here are the performance numbers to handle this aggregation:

  • Clustered table scan (CL PK on Charge_no) with a hash aggregate = 2.813 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 1.436 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = .966 seconds
  • Indexed view = .406 seconds

Now this was a pretty small table (narrow rows and only 1.6 million rows) AND I didn't have any concurrent activity. The concurrent activity would have caused this to be even slower for hash aggregates, etc. Regardless, it proves the point (at least generally). Now, if I wanted to improve this range query then I'd have to cluster on the member_no column (and this is an ideal example because I often hear people say that clustering on a foreign key column helps to improve range/join queries - which can be true as well)......... But - this strategy has a few problems in addition to a few benefits (and we have to look at everything to be sure of our choice/decision). First, member_no is not unique (in the charge table) so SQL Server has to "uniquify" the rows. The process of "uniquification" impacts both time (on insert) and space (the rows will be wider to store each duplicate row's uniqufier). Also, theoretically it could change (in this case that's not true). Anyway, the time it takes for the clustered index is 2.406 seconds which is better than the clustered on the PK (of course) but if I were to also start modifying the rows (which creates splits) or even just insert 15% more rows........ then my table would become fragmented. At that point, the query performance should get worse in the table clustered by member_no table and it will continue to get even worse in the table clustered by charge_no (because of the worktable created in tempdb by the hash aggregate) BUT it won't be all that much worse in the non-clustered index examples (especially the covering index that's in the order of the group by - because this doesn't require a worktable).........

  • CL on member_no = 4.906 seconds
  • CL on charge_no = 6.173 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 3.906 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = 1.250 seconds
  • Indexed view = .516 seconds

This is a great start to furthering the clustered index debate but I do have to admit that it's a counterintuitive and difficult issue to tackle because often isolated tests lead you to different conclusions. In this case though, the non-clustered indexes are better for this range query and the indexed view is the best (but I wouldn't consider the Indexed unless this were more of a read focused database rather than read/write). [and - of course, that statement warrants yet another blog post :)]

So, depending on the tests that you do - especially if you focus only on selects and you don't have modifications (i.e. fragmentation) - then they will make "creating the clustered index for range queries" appear to be best. Again, I'm not just saying this to prevent fragmentation, I'm saying this because I wouldn't use the clustered index OR a non-clustered index with bookmark lookups to handle this query. I'd consider a non-clustered covering that's seekable OR even a non-clustered covering that's scanable before I'd even choose the clustered (and that's what the optimizer would prefer as well). In the end it's really a bit of an art and a science to "finding the right balance" of indexing.

Oh - and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help but many would prefer to use actual data... which means [potentially] creating your primary key with a new identity [or similar] column and this can impact your business logic (absolutely). I'm certain that certain tests can show that range queries are faster and it's absolutely correct that business application/usage can be a concern but when you look at the big picture (and the impact on I/D/U) then the benefits of the monotonically increasing key significantly outweigh these concerns. Simply put, a small/narrow key can help join performance and an ever increasing key can also help lookups for rows! (yes, definitely more coming)

Happy Friday! Have a great weekend. I'll try to continue more threads on this debate shortly!
kt

Friday, February 02, 2007 11:39:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Wednesday, January 31, 2007

Ok - a strange title indeed but it's been a strange couple of months. It all started with a "much needed" vacation and I thought that would get me through the persistent "cold" that I was having all through my travels. Anyway, vacation didn't help and I came back to find that what I had was a sinus infection. Janaury has been filled with antibiotics, sleep and well......still a lingering cough even though the month is over today. The long story short is that I'm starting to come out of it and I promise to start blogging a lot more frequently starting now. In fact, I have 3 or 4 entries in the queue that I'm plotting for upcoming posts.

To get you started - there are some great and NEW resources that were posted just this week by some of my SQLskills colleagues. If you read their blogs then you've probably already seen this but for completeness, I'm going to post them here:

Bob Beauchemin's Blog Entry: http://www.sqlskills.com/blogs/bobb/2007/01/30/TheFirstOfMyScaleoutWhitepapersIsAvailable.aspx
Bob's Whitepaper: Planning, Implementing, and Administering Scaleout Solutions with SQL Server 2005

Liz Vitt's Blog Entry: http://www.sqlskills.com/blogs/liz/2007/01/30/AnalysisServicesPerformanceTuningWhitepaperHasArrived.aspx  
Liz's Whitepaper: Analysis Services 2005 Performance Guide

Ok - there's my first post of 2007 and NOT my last. More to come. Thanks for reading and HAPPY NEW YEAR (at least I got that in January :),
Kimberly

Wednesday, January 31, 2007 9:41:13 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Thursday, November 30, 2006

OK, it's been a heck of a long time since I blogged... and for that I apologize. I'm also WAY overdue in my posting my demo scripts from a TON of conferences BUT... now everything has been posted. Check out the past events page on SQLskills and you can find the demo scripts that you're looking for......lots of fun stuff and TONS of scripts to play with and test.

Now - as for the reason(s).... many are business and for that I blame the following (yes, 17 flights [yes, one boarding pass is missing] over ONE 5 week trip with 7 events and 5 continent changes):

The other reason(s) are personal...suffice it to say that the last 6 months have been some life changing times for me and what I'm finding (or trying to find) is that ever important balance between work and life. During this holiday season (and always), I wish you and your loved ones well and I hope that you too can find (and cherish) what's most important to you.

So, you won't see anything else from me for this year but I do hope to be better (and more frequent) with blogging in the New Year and I especially hope to see you again at an upcoming conference. Let me leave you with the most exciting picture I've witnessed this year...it was during my one day of site seeing in Cape Town - where I went cage diving with Great White Sharks (and got horribly sea sick - which is rare for me) but where I was able to witness these amazing and powerful creatures....

Have a happy and safe holiday season!
kt

Thursday, November 30, 2006 10:42:45 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Wednesday, September 27, 2006

There are SO many new features in the updated DasBlog that I don't even know where to start. Here's just a quick list of the things I really love:

1) Scrolling through the entries - if you click a specific entry's title, you will get a "last blog entry" | main | "next blog entry" section above the entry's title and you can scroll through the blog entries. It's really a nice way of reading through a series of blog entries.

2) the SUPER cool complete archive by category - here's a link: http://www.sqlskills.com/blogs/kimberly/Archives.aspx

And - those are just a couple of my favorite new features with 1.9 but you can read more about it from the source - Scott Hanselman - here. Definitely worth the upgrade, definitely worth the consideration to move from some other blog engines (depending on how much you blog) OR consider creating your own blog (here's why)........

Enjoy,
kt

Wednesday, September 27, 2006 6:31:38 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Friday, September 08, 2006

I am asked often.....

  • How can I get into presenting?
  • How do I get into external consulting?
  • How can I do something similar to what you do?

It's a challenge to work for yourself...

  • There are periods of feast and periods of famine
  • There are [way too many] times when the only person you can blame is yourself ;)
  • You need to do current work, work on getting future work and do the admin for past work (billing/receivables/paying subcontractors)
  • No one pays you when you are sick or when you want to vacation or when you just want to veg... and you really need to schedule all of that way in advance (which is challenging)

BUT - it's also very rewarding...

  • If you plan ahead - you can take time for yourself and you don't have to deal with a corporate limitation in vacation days
  • If the work is light you can take the morning to mess about and blog ;)
  • You feel really good when things go well

So...what's my point?

If you've been thinking that you'd like to expand your horizons and possibly get out and do something on your own, I'd *really* suggest NOT doing it without planning ahead. First, you should plan to have some cash reserves (the first 6-12 months are the hardest as you'll have a lot of setup costs (hardware, network, accounting and legal fees, etc.) AND it might take time to get customers and even more time to get paid). Second, you should already know from where your business will come. What I mean by this one is that you should start trying to figure out who might hire you before you take the leap. I'm not suggesting that you try to steal customers...not at all. You just want to think a lot about where you will get business before you have 0 income. And - that really leads me to the reason for this post. One of the best ways to get business is to be desirable. In this industry that means many things:

(1) Get really good with a technology that you like... which #2 will help you know if you are good or not AND whether or not you really do like it.

(2) Make a name for yourself - create a blog, write magazine articles, post on other people's blogs, answer newsgroup postings...start to learn more but also get more involved. Do a bit of research before posting and even test and/or create sample code. And present at conferences - this is a bit challenging BUT it can be done. Some conferences are community based (like SQLPASS) and they look for new and exciting presentations from the field... Others expect a lot more prior history and sometimes have a smaller number of speakers in general BUT, TechEd ITForum (week of Nov 13 in Barcelona) is doing something that's pretty interesting called "Speaker Idol" and this might be the break you need?

OK, so Speaker Idol is really the reason for why I've posted this blog entry. I wasn't really sure what the heck it was when I first bumped into it (accidentally actually - after someone asked me to verify something totally unrelated). Anyway, it's a competition for attendees of TechEd ITForum to submit a presentation that will be voted on as "best presentation" and the winner will get free attendance to next year's TechEd ITForum (and a VERY cool item to put on your blog, resume, etc.).

(3) Research the whole process of starting a business in your neck of the woods A LOT more before you actually do it!!! I truly wish you well if you do!

So, check out TechEd ITForum's Speaker Idol and have fun! It's certainly an interesting way to jump start the process if you've been thinking about this but just hadn't made the leap. GOOD LUCK!

Cheers,
kt

Friday, September 08, 2006 8:30:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Friday, September 01, 2006

Hey there everyone - Been a LONG time since I last blogged (sorry!)... key reason (fyi) is that I'm trying to find the ever-challenging work/life balance during the best months of the year (here in Seattle July/Aug are GREAT! months - September is almost always good too). Anyway, it's been a few weeks and I thought I'd catch you up... It all started with some travel (of course) and I was in Chicago for a SQLskills Immersion event and then off to London for another event with SQL Server FAQ (aka Tony Rogerson). I returned during the CRAZY travel restrictions and had to check two laptops (sigh) BUT they both made it back without damage after I purchased kitchen towels, bed sheets, a blanket and a duvet (and two new [cheap] suitcases) in which to pack them. When I got back, I relaxed! I've been up to see the Athabasca Glacier and Johnston Canyon and just this week I took off a day to do the "Lotus Experience" Advanced Driving Course at Pacific Raceways/ProFormance Racing School. I've always LOVED driving so driving 700 miles to Johnston Canyon didn't bother me at all (especially with great company and fantastic tunes - have you heard the latest Snow Patrol?) and the driving course in the Lotus was not only great fun but VERY informational. I've taken the Advanced Driving course before and I think EVERYONE should consider continuing their education in driving - things like Collision Avoidance, Advanced Braking (understanding proper driving under ABS), "high eyes" and so many other things...just make you safer on the road. OK, enough about all that fun/practical stuff.....let's get back to SQL.

Some great new resources are out:

And - it's now time to start enjoying Labor Day weekend, there ain't no labor going on here this weekend. Enjoy - and check out those links next week. ;-)

Cheers everyone,
kt

Friday, September 01, 2006 5:37:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Monday, July 10, 2006

Been thinking a lot about something that was mentioned in a few of my most recent posts... Especially when I get comments like "that's another item to add to our checklist" or "that's a good trick to add to our arsenal" and well, I thought in this blog entry I'd ask for your tricks that fall under the umbrella of designing for performance.

For example - do you change collations? I had a recommendation here.
For example - do you have a view that you want ordered? I had a recommendation (with caution) here. But - Adam Mechanic came back and said that he's used that trick to improve performance... and, I'm sure that's the case as well!
For example - do you have stored procedure parameters that are giving you grief? I had a series of recommendations in my Optimizing Procedural Code category here.

In fact, sometimes the best form of "hint" to SQL Server is NOT an optimizer hint but instead a more subtle change to the join (derived tables for example) or the infamous subquery -> join rewrite or the join -> subquery rewrite. I'm always asked "which is better - a subquery or a join" and I always answer YES. ;-)  OR taking a complex process and breaking it down into temp tables (I'd try to create views instead of temp tables first and see if the optimizer figures it out but there are cases when sometimes they just don't). Remember, it's not the optimizer's job to find the absolutely BEST plan; it's their job to find a good plan fast. And - they typically do. Really, no general "tricks" work ALL of the time and often they don't help at all but there are LOTS of things that I'm sure you've done and you really want to tell someone about it. How about here? I'm going to try to compile these tips/tricks into a best of...

Monday, July 10, 2006 7:27:35 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Monday, July 03, 2006

Hey there everyone - If you're into Analysis Services, Integration Services, Reporting Services and BI in general - you'll want to check out Elizabeth Vitt's new blog. Liz has been specializing in BI since SQL Server started adding BI-centric components. She's got a wealth of information to share and many great insights into performance tuning as she's working on a BI Performance Tuning resource that will probably hit 100 pages (from current guestimates).

And - no surprise from Liz, she's out the gates running with her first entry on Influencing Aggregation Candidates.

Subscribe now!

And a big welcome to blogging for Liz!
kt

Monday, July 03, 2006 10:20:36 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Friday, June 30, 2006

Well, it's been a GREAT week here in Switzerland while working with my partner Trivadis. Today, we wrapped up a two-day course on Designing for Performance (in Geneva) while on Monday/Tuesday we did a two-day course on Indexing for Performance in Zurich. The food, the wine, the cheese, the butter, yum! Oh... and the questions/comments/technical focus, etc. has all been great. :) I'm flying home today (Sat) so wish me luck on having internet access at 36,000 feet again (probably not...I'm flying United instead of Lufthansa - and it's only Lufthansa that has FlyNet). Wow - can you imagine where we're going to be in only a couple more years? Internet access everywhere! (hmmm.. how do we escape? well, that's another blog entry for another day :)

Anyway, one of the great things about teaching is that I get to meet all sorts of people and work through all sorts of interesting problems... And - this blog entry is based on a discussion I had with [a very blogless ;-] Meinrad Weiss - a Trivadis employee/consultant AND a fellow RD. (I was bullied into blogging by CV so now I do my part and do the same to others)

I can't remember how it started but somehow a discussion started on Top 100 PERCENT being used in views. I mentioned that while this was a good trick in SQL Server 2000, it has been REMOVED from SQL Server 2005 (meaning that TOP 100 PERCENT does NOT order the data within a view). Theoretically, I agree with this decision - data within a view should NOT be ordered. A view should SOLELY define a tabular set. It is up to the query which is accessing the view to define the presentation of the view. Using TOP within a view should be limited to ONLY when it is used to further define the data set (i.e. TOP 10 PERCENT... ORDER BY TotalSales DESC makes perfect sense).

Now, having said that - it was a cool trick - but with Pros/Cons. The obvious Pro is simplicity in access. While adding the ORDER BY to the query against the view really isn't all that difficult, it does make it a bit easier for quick/simple query access. BUT - there's a HUGE con here too. If the view starts getting used for other purposes (like in joins to other tables), then the being ordered before the joins, etc. can cause you an additional step that is NOT necessary. As a result, performance was compromised.

Long story short, I generally recommended against it but it was still cool. So - then Meinrad started playing and came up with - what about 99.9 on a table that has < 100 rows OR 99.99% on a table that has < 1000 rows, etc. And - yes - that DOES work, because SQL Server rounds to 100%. So, you are back to getting 100% of your data, ORDERED within a view. But - you need to set your percentage to an appropriate percentage based on rows - but what if you don't know the row count? How about TOP n where n = the max value for a bigint (9,223,372,036,854,775,807)?? That should always work...and it does.

OK - so what's the point? Yes, we DO have a workaround for the removal of TOP 100 PERCENT in SQL Server 2005 - but be CAREFUL - you are potentially shooting yourself in the foot. If this view is NEVER used for anything but SELECT * FROM View, you're OK. If you start adding joins, etc. then you might get into trouble. In the showplan below - the data returned is EXACTLY the same.

SELECT C.ContactName, Sub.*
FROM OrderSubTotalsViewOrdered2 AS Sub
   
JOIN Orders AS O ON O.orderid = Sub.Orderid
   JOIN Customers AS c ON c.customerid = o.customerid
WHERE C.City = 'Madrid'
ORDER BY SubTotal DESC
go

SELECT C.ContactName, Sub.*
FROM OrderSubTotalsViewNOTOrdered AS Sub
   
JOIN Orders AS O ON O.orderid = Sub.Orderid
   JOIN Customers AS c ON c.customerid = o.customerid
WHERE C.City = 'Madrid'
ORDER BY SubTotal DESC
go

BUT - the first plan of execution queries against the ORDERED set and the second against the un-ordered. Check out the showplans below:

This is a VERY COMPELLING reason to BE CAREFUL ordering data within a view. While the trick does work, please use it sparingly.

If you want to play with the views created above, you'll need a copy of the Northwind Database and you'll need this script: Top 100 Percent in SQL Server 2005.sql (4.46 KB).

Have fun!
kt

PS - I'm adding my blog to Technorati so I need to post their link to start generating my profile... here we go: Technorati Profile

Friday, June 30, 2006 3:14:37 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5]  | 
Tuesday, June 27, 2006

OK - so, lots of exciting things going on. I've just completed my first two day partner event in Zurich (first of two in Switzerland). Today I head over to Geneva by train so that I can deliver the second two day partner event there. The partner I've been working with in Switzerland is Trivadis and it's always been fun coming over here and working with them (this is my third trip to Switzerland since Mar last year and second with Trivadis)! Trivadis is home to one of my fellow RDs Meinrad Weiss and he's always great to chat with and very clever too! In fact, I have a blog post coming based on something we chatted about during class (it's a cool trick... stay tuned). But, regardless, one of the things that came up at the end of the day was the type of training that I did this week... (well, really the type of training that we do at SQLskills). The discussion ended up talking about the complete learning/training cycle and how do you choose from ALL of the different choices??! First, (if we're talking about SQL Server), it takes a layered approach and a variety of different techniques, technologies and even a lot of planning/practice/coordination among staff, etc. to keep a system running WELL. In the end, it takes a combination of things (and probably training) to make that happen. As a result and based on all of these choices, we talked about training in three primary buckets...

Conferences
Conferences are great when you want get a wide variety of topics covered. They're great when you want to network with large groups on a variety of levels. They're great when you want to see a variety of products and how they work and/or what they do. Conferences really are great when you want to go wide more than deep. Don't get me wrong, I like doing 300-400 level sessions at conferences (and there are a lot of other speakers that do as well) BUT, in only a 60-75-90 minute session, you can only go so deep. You can do two sessions or three sessions in a topic but sometimes that's hard to coordinate and to be honest, there's always so much going on that you tend to lose people that are getting distracted by other tempting sessions so I'm not all that fond of going beyond two for a conference. Having said that, you can certainly hit a lot of best practices and technical depth BUT I think a conference is BEST at giving you breadth and there's nothing wrong with that! In fact, some of the best conferences I've been to in a long time were TechEd in Boston and SQL Connections in Orlando... lots of great sessions and I got some great ideas of different architectures, etc. and I also learned about some great upcoming products, etc.

Training
Depending on the vendor there are lots of options in this bucket. Probably the most obvious is MOC (Microsoft Official Curriculum). MOC has been around for a long time... in fact, that's where it all started for me! Back in '90 I got my MCP in something (hmmm...DOS?) and then shortly thereafter got my MCP in OS/2, LAN Manager and SQL Server and in '91, I became an MCT (all of this while I was working for a small ATEC in Chicago called LAN\mind). Then, I went to work for Microsoft as a trainer for Microsoft University (the building 0 of Microsoft even though by that time Microsoft was a lot larger and had moved to it's main campus in Redmond). It was a GREAT place to work. Both of these training centers (LAN\mind and then Microsoft) were a BLAST to work at and I learned an amazing amount with my team members. It seems like all of the trainers had great fun working together...one of my best friends actually rollerbladed through my class once (those were the good old days!). My point, is that this kind of training is a lot more focused... it allows you to focus for 3-4-5 days on one area of the product and it gives you some hands-on. This can be exactly what you need. The only problem is that you're somewhat at the mercy of the instructor (well, this is always true!) and you have to hope you get someone who has really worked with the product and/or customers and really knows depth (I’ve heard EVERYTHING in terms of comments – both good and bad about MOC and most of the time it’s about the instructor). To add to this, there are some companies that “host” MOC training but don’t even really have instructors on site (which is also good and bad). In fact, I’ve received a few emails that are clearly being blasted at a ton of people (I guess MCTs) and the mails are saying things like “Is anyone available NEXT week to teach course XYX” and that just completely blows me away. Don’t get me wrong I used to jump in when possible and say “yes” I’m available and I have to think that it’s still a good class BUT… it just seems like the RANGE of quality is the most widely varying here and that’s the part that can be disappointing. With regard to the courses, these are generally good. However, I do think that MOST of the courses try to cover too many topics. Basically, each class tends to focus on a discipline - like DBA or Dev or BI and while that can be good...sometimes you need more AND sometimes certain topics really hit at least two and possibly all three of those areas. That's where I *think* (well, I hope!), that I (we) come in with "Immersion" events.

Immersion Events
Other companies have things which are similar but my thoughts on Immersion Events are that some areas just need time and incredible depth to really understand. I always make fun of this old Chinese proverb but it's so true: give a man a fish and he eats for a day, teach a man to fish and he eats for a lifetime. The point is that anyone can tell you a command and how to implement something... but if I can really tell you why something works the way that it does and show you how you can see it and troubleshoot it, then you're going to be able to use that and leverage that when other things go right/wrong. I believe heavily in internals and the lower levels of how pages, extents, files, filegroups, data v. log, locking... if you start to get a feel for how all of these things are interrelated, you can make better architectural decisions AND you’ll know what the other impacts are of a variety of decisions. For example, transaction design (esp. long running transactions) impacts the types and lengths of time that locks are held (ok, that’s a “well duh”) but poorly written transactions not only cause blocking (which in turn, affects performance) but can also impact the size of your transaction log and whether or not the log can be cleared. This can end up not only causing performance problems but can impact recovery and even some of your secondary systems (or cluster) down the road in terms of restart recovery (because a long UNDO can take time and in SQL Server 2000 and SQL Server 2005 (in 2005 it’s ONLY the non-Enterprise Engine editions), users cannot access a database until UNDO has completed. (sidenote: the SQL Server 2005 Enterprise Engine editions (Developer, Enterprise Eval and Enterprise) have Fast Restart which allows users to access the database after REDO has completed and while UNDO is processing. The good news is that data integrity is maintained because the records being processed by UNDO are locked and inaccessible (unless lock hints are used).) Finally, if you’re waitinf for UNDO, this can lead to the perception of potentially A LOT of downtime. And – all of this could be because of a poorly written application. So, my point is that knowing the impacts of something from end to end and really diving into how things work – that’s what some classes do. That’s what Immersion Events do. We take one area and we really try to focus on a subset of features (new features/critical features) and then we just try to go deep. We’ll often hit on best practices in other areas and other options but we really try to go deep in the more critical areas. The other thing that we do with Immersion Events is that we try to extend the courses with other types of learning – group troubleshooting, guest speakers, hands-on, *interesting* demos and then we end the day with BYOP (“Bring Your Own Problem”). BYOP sessions are unstructured sessions that typically run from 4:30 - 6pm after an 8:30am-4:30pm course day. BYOP is designed so that you can apply what you’ve learned during the class day to your own systems and challenges. If it’s a tuning class, we strongly urge you to bring your laptops and external drives – loaded with database backups, sample applications/workloads and Profiler Traces – so that you have real world data/problems for real world (and specifically YOUR WORLD
J) tuning scenarios. If it’s an architectural class (like Bob’s classes in SODA), then we want you to come with your ideas and current architectures so that we can see if other features and/or alternatives could provide better x, y, z (insert performance, flexibility, control, etc…). The BYOP sessions begin with group discussions – focusing on similar systems and/or challenges and then the groups will move into deep troubleshooting discussion groups from there. Basically, the aim of an Immersion Event is to have an intense and focused event where not only do you come back with a few new/critical features to leverage but you may have even solved some of your most frustrating challenges.

OK, this blog entry really wasn’t planned to be this long but it’s a discussion that I got into last night that started to motivate this. At the end of my partner event at Trivadis (not quite a full blown Immersion Event but *very* close – certainly in terms of technical depth but just not the really long days and BYOP), they were really pleased that we had just spent two full days on nothing but Indexing (Internals to Statistics to Advanced Indexing Strategies to Maintenance). They told me that they felt like they were ready to tackle a few problems and a couple folks said that they already knew what they were going to do to fix a thing or two. One person came to me telling me that they really didn’t think that I could tell them WHY one of their more complex problems “worked” the way that it did but after the module on statistics he said he knew exactly what was going on. And – I love when I get mail from folks after class… I just chatted with someone from my last class who told me that they had a query that was taking 15 mins (over a 2.8 billion row table) and now (after some of my Advanced Indexing Strategies) the query is taking 9 seconds. Can’t complain about that, eh? J

Anyway, learning can be fun AND there are lots of ways to learn. I think the best way to a better system is to combine all of the above (and don’t forget webcasts, blogs, whitepapers, Hands-On Labs, etc.). The basic point is that you should attend some conferences to get a wider variety of topics and architectural options, attend some training to get started with a variety of technologies and then attend Immersion Events to really get the depth and focus to solve some of your more complex challenges. And – you can certainly substitute webcasts, blogs, whitepapers and a few other options for some of these training options but sometimes you really need to get away and “Immerse Yourself” in the technology!

And – OK, I have another reason for blogging on this topic today… we’ve (SQLskills) just announced our new Immersion Event lineup for summer/fall 2006. We’re hitting three cities in the US (Chicago[land], Redmond, and New York City) and we’re targeting some intense classes around Performance Tuning, Service Oriented Database Architectures, Tuning BI (IS/AS/RS) Environments AND Always On Technologies. They’re going to be great fun, in-depth, hands-on courses with long days so that we can all really focus and “get it done”. And we’re still adding some guest speakers (Gert (aka Data Dude) is joining us in New York and it looks like the famous “SQL Apprentice” will join us in Chicago for some BYOP on SQL – details to be posted shortly). Check out the complete lineup of events here.

Let me know your thoughts on training in general and/or in specific. Really, I think all types are good and all have their time/place. I just happen to like the deep “Immersion” ones more than anything!

Hope to see you there!
kt

Tuesday, June 27, 2006 10:52:29 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Friday, June 23, 2006

So, I've now spent the last couple of hours playing with Database Mail and HTML formatted messages being sent to the SQLskills subscribers. It's been a fun learning experience as I think I've found a bug with the email account name length...let's just put it this way - don't be too descriptive with your account names.

Outside of that - it's amazing how well queue based email works. The old mapi based mail would take a LONG time to complete the batch mail processing but now - with queue based mail it's done in seconds.

The best part is that I'm on a flight over the Atlantic right now...on my way to Frankfurt. I think this is the MOST productive flight I've ever had!

Have a great weekend,
kt

Friday, June 23, 2006 8:35:59 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 

I've posted all of my demo content from TechEd 2006 and wow - it was a lot of fun! I created many new and fun demos as I tried to keep everyone awake through the sessions (cause it seems like there were way too many evening events - didn't it? ;-)). All of the content is posted here.

Finally, can I just say that Paul (and team) has been on a roll. They're blogging machines. If you're not reading this blog... you're DEFINITELY missing out.

Have a great weekend... I'm off to Switzerland today.
kt

Friday, June 23, 2006 6:24:17 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Friday, June 16, 2006

What a GREAT week here in Boston. I'm just about to pack up and go home (kind of). Right now it's early and I think I'm still feeling really good from last night's fun. The big party at Fenway, dinner on the town, walking around the city, etc... It was a beautiful spring (almost summer) night last night and the stars were out. We couldn't have asked for a better day (weather-wise) to end the week. There's still one day left at TechEd but this is when a few of us start heading out. I'm going to Chicago for the night to hang out with my Father (don't forget it's Father's Day this weekend) and then on Sat I head back to Seattle. I want to get back fairly soon as I'm only home for a bit under a week before I head to Switzerland to deliver a couple of master classes with my partner Trivadis.

But - part of why I'm writing this is that this was one of the best weeks I can remember at TechEd. Everyone seemed to be enjoying themselves and for once (in a long time) I really liked the layout of the venue. Eventhough TechEd is a MONSTER conference (12K people), the venue was easy to navigate and (most of) the sessions never seemed too full. Even the Expo hall didn't feel overly crowded. Don't get me wrong - there were a lot of people in there a few times BUT, it always seemed manageable. And - even when sessions did fill, there were overflow rooms setup and capacity seemed to be managed well. So, I do feel like I walked quite a bit this year but it felt reasonable (compared to Orlando last year, this was a dream).

And - the WAY more fun part of TechEd (for me) is meeting people. I've meet quite a few customers - many of you - who read my blog and/or have attended some of my webcasts and it's GREAT to put faces with the names.

So, that was my week and now it's done (and if you didn't make it - check out Virtual TechEd). In my next few posts, I'll try to find some time to chat a lot more about the technology that I saw and the technology that I demoed. In fact, all of my demo scripts and sample code will be "tweaked" and posted over the weekend and then I have to move on to planning some of our (SQLskills) Immersion Events... At least I don't have to plan an event the size of TechEd! Our events are 5-days, hands-on, and intense technical events where we dive deep into one technology (and you can count the number of people with 2 digits instead of 5). In fact, I think that's the combination I like. Go to conferences for depth in small chunks and over a wide variety of topics and technologies (and futures too, so that's always fun/helpful). Go to training/"immersion events" for deep, technically-focused, hands-on courses to really become fluent in a specific area. I'll post all of the final details when we're all set to open for registrations (next week). But, if you think you might want to attend - subscribe on SQLskills and you'll get an early mail to let you know the details. Subscribe here!

Have a great weekend and HAPPY FATHER'S DAY!
kt

Friday, June 16, 2006 3:04:12 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Thursday, June 08, 2006

Another excellent podcast by Greg Low from SQLDownUnder. This week he hosted/interviewed Gert Drapers - the original Data Dude (even though he tells the real story in the podcast).

Download the show here: SDU17FullShow.mp3 or here: SDU17FullShow.wma.

Enjoy,
kt

Thursday, June 08, 2006 8:26:03 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, June 07, 2006

Well, if you're wondering why I've been so quiet this week... it's a myriad of events all coming together and/or being finalized right in time for TechEd. In working really hard (especially crazy was today) for some final TechEd content, I realized that a lot of people don't really know what goes on behind the scenes of some of these really huge events. Brian Marble has been blogging about this and you can learn some interesting things by checking out his blog. And for some fun statistics related to TechEd, here is an idea of the quantity of food and drink that will be consumed:

  • 1,250,000 pieces of "Mikes & Ikes" will be consumed over the course of the Tech Ed 2006 week
  • 18,750 pounds of salad will be prepared and offered at meals
  • 83,700 ice cream novelty/fruit and yogurt bars have been ordered for this function
  • The total amount of fruit ordered will fill 3/4 of full size tractor-trailer
  • 60,000 eggs will be eaten by attendees at breakfast (this is equal to 4,800 dozen cartons of eggs)
  • It will take 4 semis to transport the 150,000 bottles of water consumed
  • 1.6 million ounces of coffee will be poured and consumed (conservative estimate)
  • More than 50,000 pounds of carbohydrates will be consumed at Tech*Ed (Atkins who?)
  • 1,500 table cloths will be used and re-set on a daily basis (7,500 for the week)
  • A minimum of 2,000 antacid tablets are likely to be consumed at this event

As for the technical content, well that's not too shabby either. There are over 900 breakout sessions, chalk talks, ILLs (Instructor-led labs), HOLs (Hands-on labs) and general/keynote sessions. There's a lot of technology that comes together for a show like this and there's even a DVD that's available after the show with all of the breakout sessions on it. The key point is that there's a lot going on and I'd have a hard time believing that you couldn't find something to do during every timeslot (for me there are multiple time slots where I'm torn between delivering my own session and attending another...but, I have a feeling I know where I'll end up :).

One thing that you can do in almost every timeslot is an HOL (Hands-on Lab) and for SQL Server there are more than 10 of them. Each HOL is focused and technical and each covers a specific technology or topic. For TechEd 2006, I've written two of the HOLs: DAT007 and DAT010. Specifically, DAT007 is Database Mirroring in SQL Server 2005 SP1 and DAT010 is Table and Index Partitioning. These HOLs cover everything from design to implemenation to failover to monitoring - for Database Mirroring and for Partitioning the lab goes from design to implementation to performance to the sliding window scenario. They were a lot of fun to write and I hope a lot of fun to go through. If you're interested in hearing more about them, Mark Penaroza did a couple of interviews about them. He blogged about it here and mentioned that the interviews are available on Commnet (the Microsoft TechEd attendee website). I've also posted the interviews here (DAT007 Interview (4MB mp3 file) and DAT010 Interview (2.75MB mp3 file)) so that you can get some insight into the things we're doing to help get you started and ready with these new technologies.

Finally, since TechEd is sold out, I know that not all of you will be there. As a result, there's "Virtual TechEd". Virtual TechEd is a site dedicated to getting some of the content and resources out to folks that just couldn't attend. The Virtual TechEd site is here: http://virtualteched.com/default.aspx

So, I think that's it for now. Still enjoying the comments you're making on the last blog entry about the version you're running and why. Seems like we all have the same problem - time and money ;). Keep those comments coming!

Thanks for reading,
kt

Wednesday, June 07, 2006 9:05:23 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Sunday, June 04, 2006

Hey there everyone - Out of curiosity, can you let me know (via comments to this blog entry) what version you're using in production? AND - why? I guess I'd like to know some of the many reasons that you are staying with SQL Server 6.x or 7.0, if not SQL Server 2000 or SQL Server 2005. If you're staying with SQL Server 2000 - let me know when you plan to upgrade?

Just wondering... and thanks for your time!
kt

Sunday, June 04, 2006 11:59:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [23]  | 
Friday, June 02, 2006

Ha... did that get your attention? Well, what I really hope to do is make everyone aware of what's made the Developer Community rounds this week. On Wednesday, Microsoft announced "Data Dude" (aka the Visual Studio sku for Database Developers). This was an announcement that may be glanced over by many DBAs thinking it's just another tool for developers...what can it offer me? And, well, that's where I think there are some VERY cool things to point out. I've been following Data Dude for a couple of months now (ah... a little birdy told me :) and at first I wasn't sure how much it would impact me. However, after starting to get a better feel for their future directions, I've realized that even though I'm not their initial and/or direct target audience that I'll definitely find some great uses for their product. In fact, in getting ready for their announcement and in chatting with a few press folks, I wrote up a small amount of text. Some of this was quoted in the eWeek article here but there's a few more things that I really think you'll (yes, even DBAs) be interested in. This is the second half of the content that was quoted:

For Administration and Operations, I especially like their direction with regard to unit testing and sample data generation. I work with a myriad of customers who do not let development/operations perform testing/tuning on real production data (even a copy) due to data sensitivity requirements/policies. As a result, performance testing can be horribly flawed. With the ability to generate large sample volumes of statistically "real" data, real-world tuning will be possible without compromising data sensitivity. This is the area that I'm most interested in initially but refactoring and schema comparisons are very interesting as well. One of my favorite sayings is "The sooner you start to code, the longer the program will take." (Roy Carlson) as schema changes can be challenging at best and often things can be missed (data types, columns names, etc.). Often the changes are made on alternative systems and then they need to be integrated in - often through comparing schemas and with hand-created alteration scripts. With the ability to have intelligent refactoring, application and database logic can be fixed through a straightforward and flexible interface. This will help to minimize later errors or even harder to recognize performance problems caused by implicit conversions.

So, in the end, I'm not their primary target but I think I'll probably get really into it and try to consider a variety of ways to leverage it for Operations/Administrations teams for "after the fact" tuning cases. However, I do think teams will be even more productive if they adopt it earlier in their production lifecycle.
Now, if you're even slightly interested. You can get a lot more information about Data Dude already. Here's a beginning list:

And, if you're going to be at Microsoft TechEd 2006 - in Boston in a couple of weeks, there's a LOT more information coming. In fact, my pre-conference workshop co-presenter Brian Randell has authored some Hands-on Labs for Data Dude and those will be available in the HOL area. He's on DotNetRocks this week (to be released on DNR's site next Tuesday) and he's chatted with them in the past about Virtual Server/Virtual PC (for hours...now you know how I feel ;). Honestly though, we use VPC/VS a lot in our HOLs and Brian is REALLY knowledgeable about how to optimize them, compact drives, etc. Anyway, here's the link to his last show on DNR and here's Brian's blog entry on Data Dude. And, just as a small hint... you should consider making a Sched+ for the "Live from TechEd" show from DNR. All I have to say is that it might have some great guests on it (maybe even related to Data Dude, hint, hint)!

Now, the last thing that's the most exciting for me to announce is that there are some new bloggers as a result of the Data Dude annoucement. FINALLY, one of my best friends - Gert Drapers - has started blogging (don't forget his already awesome content site: http://www.SQLDev.Net). If you're at all interested in geeky database development stuff, subscribe now! And - many of his team members are great friends too (