I've always been concerned with security and I've always stressed the importance of auditing the REAL user context not just the current user (see this post on EXECUTE AS and auditing). So, I generally try to avoid using dynamic string execution and if necessary create well tested/protected parameters (fyi - using QUOTENAME can be a fantasic solution to protectng identifiers as input parameters but it can't protect more complex strings).

Having said that, what if I'm looking at a database for the first time... just poking around trying to see if there's anything that needs further attention? I've come up with a quick query... And, while it's not going to "solve" your problem (as that's going to take some re-writing of code) or even truly verify if you're vulnerable, it gives you a "quick list" of where you should look first! If your code uses dynamic strings AND it's elevated - then start there! 

SELECT OBJECT_NAME(object_id) AS [Procedure Name],
  CASE
      WHEN sm.definition LIKE '%EXEC (%' OR sm.definition LIKE '%EXEC(%' THEN
'WARNING: code contains EXEC'
      WHEN sm.definition LIKE '%EXECUTE (%' OR sm.definition LIKE '%EXECUTE(%' THEN
'WARNING: code contains EXECUTE'
  END AS [Dynamic Strings]
,
  CASE
     
WHEN execute_as_principal_id IS NOT NULL THEN N'WARNING: EXECUTE AS ' + user_name(execute_as_principal_id
)
      ELSE
'Code to run as caller - check connection context'
  END AS [Execution Context Status]
FROM sys.sql_modules AS sm
ORDER BY [Procedure Name]

Is this enough? Anything else you'd check? What do you think?

THANKS!
kt

OK, I'll definitely take a beating from all of you for having gone so long between my survey posts and now. I won't even go into the details but between some crazy work schedules, multiple sinus problems and even migraines... well, I've been a bit behind. Let's just say that April/May were rough at best. I'm feeling better and well, now I'm trying to catch up. I had really gotten the blogging bug in March but I completely lost it in April. But, this tipping point series is in dire need of lots of explaining so I'm really hoping to get a few posts done in this area for sure!

First, I started the discussion around this in a few surveys:

Survey/Question 1

Q1 was described as this: if a table has 1 million rows at 20 rows per page (50,000 pages), at what percentage (roughly) of the data would a nonclustered index no longer be used. Blogged here. Here's what the survey said as of today:

And, for Q1 the correct result (Between 0-2% of the rows) is actually the best result (but, by no means the overwhelming majority at only 28%). However, often people just "think" the answer is very small. So... I did a few more questions/surveys. 

Survey/Question 2

Q2 was described as this: if a table has 1 million rows at 100 rows per page (10,000 pages), at what percentage (roughly) of the data would a nonclustered index no longer be used. Blogged here. Here's what the survey said as of today:

And, for Q2 the correct result (Less than .5% of the rows) is actually at a tie for the best (but, again, even a small percentage at only 22%). Again, often people just "think" the answer is very small. So... I did one more question/survey. 

Survey/Question 3

Q3 was described as this: if a table has 1 million rows at 2 rows per page (500,000 pages), at what percentage (roughly) of the data would a nonclustered index no longer be used. Blogged here. Here's what the survey said as of today:

And, for Q3 the correct result (Between 10-20% of the rows) is actually NOT the highest answer. And, this is even more convincing that there's confusion around what's going on and why.

The Tipping Point

What is the tipping point?

It's the point where the number of rows returned is "no longer selective enough". SQL Server chooses NOT to use the nonclustered index to look up the corresponding data rows and instead performs a table scan.

When does the tipping point occur?

It depends... it's MOSTLY tied to the number of pages in the table. Generally, around 30% of the number of PAGES in the table is about where the tipping point occurs. However, parallelism, some server settings (processor affinity and I/O affinity), memory and table size - all can have an impact. And, since it can vary - I typically estimate somewhere between 25% and 33% as a rough tipping point (and, you'll see from a bunch of my examples, that number is not EXACT). Then, I translate that into rows.

Math for Tipping Point Query 3: If a table has 500,000 pages then 25% = 125,000 and 33% = 166,000. So, somewhere between 125,000 and 166,000 ROWS the query will tip. Turning that into a percentage 125,000/1million = 12.5% and 166,000/1million = 16.6%. So, if a table has 500,000 pages (and 1 million rows) then queries that return less than 12.5% of the data are likely to USE the nonclustered index to lookup the data and queries over 16.6% of the data are LIKELY to use a table scan. For this table, that percentage seems "reasonable". But, most of us say that the tipping point happens at a much lower percentage... why? Because row size - which determines table size (and therefore pages) is really what has the greatest impact. So, let's look at Tipping Point Query 2... 

Math for Tipping Point Query 2: If a table has 10,000 pages then 25% = 2,500 and 33% = 3,333. So, somewhere between 2,500 and 3,333 ROWS the query will tip. Turning that into a percentage 2,500/1million = .25% and 3,333/1million = .33% (not even 1%). So, if a table has only 10,000 pages (and 1 million rows) then queries that return less than a quarter of 1% of the data are likely to USE the nonclustered index to lookup the data and queries over one third of one percent are LIKELY to use a table scan. For this table, that percentage seems really low BUT, at the same time it makes sense (to a point) that a small table would be scanned... but, for less than 1%. 1% is NOT selective enough. For small tables, it might not matter all that much (they're small, they fit in cache, etc.) but for bigger tables - it might be a big performance problem. 

Math for Tipping Point Query 1: If a table has 50,000 pages then 25% = 12,500 and 33% = 16,666. So, somewhere between 12,500 and 16,666 ROWS the query will tip. Turning that into a percentage 12,500/1million = 1.25% and 16,666/1million = 1.66% (under 2%). So, if a table has 50,000 pages (and 1 million rows) then queries that return less than 1.25% of the data are likely to USE the nonclustered index to lookup the data and queries over 1.66% are LIKELY to use a table scan. Again, this seems like a low number. Again, for small tables, it might not matter all that much (they're small, they fit in cache, etc.) but as tables get larger and larger - it CAN be a big performance problem. 

Why is the tipping point interesting?

  • It shows that narrow (non-covering) nonclustered indexes have fewer uses than often expected (just because a query has a column in the WHERE clause doesn't mean that SQL Server's going to use that index)
  • It happens at a point that's typically MUCH earlier than expected... and, in fact, sometimes this is a VERY bad thing!
  • Only nonclustered indexes that do not cover a query have a tipping point. Covering indexes don't have this same issue (which further proves why they're so important for performance tuning)
  • You might find larger tables/queries performing table scans when in fact, it might be better to use a nonclustered index. How do you know, how do you test, how do you hint and/or force... and, is that a good thing?

Real example of an interesting tipping point

Earlier today, I went on facebook and twitter and gave the following information - very vaguely - and I asked "why" is Q2 so much slower than Q1 if Q2 returns only 10 more rows. Same table and no hints (other than MAXDOP)...

Q1: SELECT * FROM table WHERE colx < 597420 OPTION (MAXDOP 1)

  • returns 197,419 rows
  • takes 116,031 ms (1 minute, 52 seconds)
  • 1,197,700 logical reads, 5 physical reads, 137,861 read-ahead reads
  • 7,562 ms CPU time

    Q2: SELECT * FROM table WHERE colx < 597430 OPTION (MAXDOP 1)

  • returns 197,429 rows
  • takes 244,094 ms (4 minutes, 4 seconds)
  • 801,685 logical reads, 1410 physical reads, 801,678 read-ahead reads
  • 9,188 ms CPU time

There were lots of great guesses... but, it's the tipping point. SQL Server chose to "tip" the second query because it was "over the line". But, it's important to realize that there are cases when that's NOT a good idea. And, what are your options?

In SQL Server 2005 - the only option is to force the nonclustered index to be used:

Q2: SELECT * FROM table WITH (INDEX (NCInd)) WHERE colx < 597430 OPTION (MAXDOP 1)

But, this can be TERRIBLY bad on some machines where the IOs could be a lot faster (and where data might already be in cache). These specific numbers are exactly that - specific to this HARDWARE (and, I chose not-so-optimal HW in this case to highlight this problem). And, depending on what number you use (what if this is a parameter in sps?) you might force SQL Server to do WAY more IOs by forcing the index than allowing the tipping point to do its job. But, depending on your hardware (and/or what you know to be in cache at the time of execution), it might be better to force an index instead of letting SQL Server choose. So, should I force the index? Be careful, if you're wrong - it could take more time and actually be slower.

In SQL Server 2008 - there's a new hint - FORCESEEK:

Q2: SELECT * FROM table WITH (INDEX (FORCESEEK)) WHERE colx < 597430 OPTION (MAXDOP 1)

FORCESEEK is better because it doesn't tie you to a particular index directly but it also doesn't let SQL Server tip to a table scan. However, just like forcing an index - you can be wrong!

So, what should you do? It depends. If you know your data well and you do some extensive testing you might consider using a hint (there are some clever things you can do programmatically in sps, I'll try and dedicate a post to this soon). However, a much better choice (if at all possible) is to consider covering (that's really my main point :). In my queries, covering is unrealistic because my queries want all columns (the evil SELECT *) but, if your queries are narrower AND they are high-priority, you are better off with a covering index (in many cases) over a hint because an index which covers a query, never tips.

That's the answer to the puzzle for now but there's definitely a lot more to dive into. The Tipping Point can be a very good thing - and it usually works well. But, if you're finding that you can force an index and get better performance you might want to do some investigating and see if it's this. Then consider how likely a hint is to help and now you know where you can focus.

Thanks for reading,
kt

Whenever I setup a new machine, I like to set a few default options in SSMS. Here are my favorites:

Tools, Options

  • Environment
    • Fonts and Colors
      • Text Editor font: Lucida Console (a bit thicker and it's a fixed-width font)
      • Text Editor: Selected Text (under display items)
        • Item foreground: Black
        • Item background: Yellow (looks like a highlighter)
      • Sometimes I'll also make the results windows have larger fonts - especially if it's a presentation machine
    • Keyboard
      • Keyboard scheme: SQL Server 2000 (ok, maybe I'm old-school but the QA keyboard shortcuts still seem a lot more natural than the VS keyboard shortcuts... but, if you're more of a VS person, then stick with Standard)
      • Query shortcuts:
        • Ctrl+F1: sp_helpindex2 (if you highlight an object and then hit Ctrl+F1, then it passes the highlighted object in as a parameter. The only negative is that it doesn't delimit it so you can't highlight schema.object unless it's already quoted for the sp 'schema.object')
        • Ctrl+3: SELECT object_name(object_id) AS ObjName, * FROM sys.dm_db_index_physical_stats(db_id(), NULL, NULL, NULL, NULL) ORDER BY 1, 4
        • Ctrl+4: other common queries/procedures...
  • Text Editor
    • All Languages
      • Line numbers
    • Editor Tab and Status Bar - I usually change the colors of the "Group connections" setting to something very noticeable like Fuschia. I've already blogged why here.
  • Query Execution
    • By default, open new queries in SQLCMD mode (however, if you set this - you will disable Intellisense and Transact-SQL Debugging... and, there's no "hint" or warning that you're doing it.)
  • Query Results
    • SQL Server, Results to Grid
      • Include the query in the result set (this will show you what you executed in the results window while you wait for completion... also confirms what you think you executed. This gets annoying for large batches but is sometimes useful to see what you last executed)
      • Include column headers when copying or saving the results (VERY useful if you're pushing data over to Excel, etc.)
      • Display results in a separate tab (this gives you Tabbed mode instead of split-window mode meaning that results will go to a results Tab instead of the lower half of your window. I prefer this when I have a lot of results to review AND when I'm presenting and typically run with a lower screen res.)
        • Switch to results tab after the query executes (I prefer this so that I'm waiting for the results in the results window)
    • SQL Server, Results to Text
      • I set most of the same settings as Results to Grid (Include column headers, Include the query, Display results in separate tab and Switch to the results tab) BUT, I usually turn off Scroll results as received. If I'm looking at rows at the top of the set (while it's still processing) it often takes my cursor down to the end. I also turn this off in Profiler.

The primary reason for this post though - is to make sure that you realize that one specific option - Tools, Options, Query Execution, "By default, open new queries in SQLCMD mode" - can really surprise you. I used to think that this option was really helpful (and benign) because SQLCMD is a superset of Transact-SQL. So, I typically recommended that you turn it on. However, turning on this option disables Intellisense and the Transact-SQL Debugger. It's documented here: http://msdn.microsoft.com/en-us/library/ms174187.aspx (thanks Paul Mestemaker! :) and I thought this was a bit unintuitive at first but it really does make sense. Both the Transact-SQL Debugger and Intellisense are debugging and/or prompting you with Transact-SQL (not SQLCMD commands). So, disabling it seems right (well, I guess I could argue that it could work with Intellisense but debugging would be much harder). However, I just wish there were some better information in the Tools/Options dialog. I'd like my 2 hours back (trying to figure out which option disabled Intellisense, chatting with my friends on fb to see if they knew, filing a connect bug when I figured out that it was SQLCMD, and then learning from Paul Mest that it's actually expected behavior (duh!), changing the Connect bug to say that the dialog might be better with a warning... it was definitely one of those mornings!  :) :) :)

So, my main point for this post - save you the time I lost in trying to figure out where Intellisense went. It's expected behavior (and it does make sense........now :).

Cheers,
kt

Categories:
Manageability | Tools

I've seen (and heard of a few other cases where people received) the following error:

Index (zero based) must be greater than or equal to zero and less than the size of the argument list.

And, the problem (when I got this error) was the compatibility mode of the database I was in didn't support what was being attempted. In SQL Server 2005, the reports used to generate an "Incorrect syntax near '('." error. For what I was trying to do (execute/access some of the standard reports from within SQL Server 2008 Management Studio), I received this error and immediately thought it might be the compat mode. Sure enough, it was.

So, I don't know how often this will help you but just in case - I thought I'd post this!

Cheers,
kt

OK, I know I don't blog all that often but when I do, I do try and post as much useful information as I can :). I've got a few posts in the queue and a few more tests to do and code to write before I can wrap them up. In the interim, Paul and I have both decided to throw a bit of our "spare time" to keeping up with friends and family more and just staying "more connected." In that effort, we've both joined facebook and twitter (www.twitter.com/KimberlyLTripp and www.twitter.com/PaulRandal). Our end goal is:

  • Blogging: large/complex posts with detailed info/code, etc.
  • Twitter: short, quick posts on things to check out, review, etc. Maybe I'll do a weekly summary of tweets that I do and/or followed? Would you be interested in that? And, as for my interface, I'm currently using TweetDeck and Twitter.com and I've also joined via WeFollow.com: #sqlserver, #mvp and #womenintech.
  • Facebook: more fun stuff and keeping up with friends, etc. I have to admit that I really love the interface and I'm constantly impressed at how easy it is to upload video, photos, etc... It's just NOT what I expected before I tried it out. I'd truly recommend this to anyone that wants to asynchronously keep up with a large number of people and wants to share photos, comments, video, etc. It's really well done. However, beware of many of the facebook apps. They tend to spam your friends list - sometimes even when they don't ask. Outside of better requirements on fb app developers, I don't have a lot of complaints there.

Having said that, neither of us is doing anything else (no Plaxo, no LinkedIn, etc.) so if you want to find us - we're definitely around but we're going to stay somewhat focused. ;-)

And, now that I'm back home again, I hope to have a few more of my longer posts done.

Cheers everyone!
kt

Kimberly L. Tripp's Facebook profile Paul Randal's Facebook profile

Categories:
Opinions | Personal

Recently, the SQL Server 2008 Internals title was released (and only in the past few days have people actually received their copies)! In fact, I still haven't seen the book in person... soon though!

Anyway, Kalen, Paul, Conor, Adam and I worked to create a comprehensive resource on SQL Server 2008 internals and to supplement the written content, many of us created demo scripts and examples. I've now gone back and created a sample script based on ALL of the code in the entire chapter (and in many cases I extended the code in the samples). And, while this updated content is going to be located on the companion content site, I thought I’d also release it here with some notes.

The zip contains a solution with 3 projects, each with a few scripts:

I know the names seem a bit strange but everything is ordered EXACTLY as it is shown in the book. And, in the book, I referenced "a" script called EmployeeCaseStudy-AnalyzeStructures.sql but that script was so big that I broke it down into 7 parts (hence the naming convention of 06...01, 06...02, etc.). Regardless, each script contains a brief header and a few details about the script. To get the most from the script, do not just open the script and execute it. If you really work your way through the script, you should see all of the comments and they will help you to make instance specific changes so that everything runs without error. Just take your time and really try to step back and think about each command (and what you're expecting the output to be) to test yourself while your working through the results. Taking your time and really grapsing all of these internals is what makes it fun!

Quite a bit of this content can stand alone but it's really best as companion content to the title as there's a lot more "text" and detailed information in the book. But, the scripts are really a great way to dive deeper, learn documented/undocumented commands and really get to know what the heck is really going on internally.

Finally, I only worked on Chapter 6 so here it is: 20090413_IndexInternals-Chapter6-Resources.zip (12.22 mb). As for the other companion content, you'll need to get the links from the book. Ah, or from Kalen :). 

And, certainly, if you find a typo or anything that you think needs more clarification, let me know! I'm more than happy to post updates to this content. 

Have fun,
kt

UPDATES/ERRORLOG:
2009-Apr-13 (8am): Updated the zip after remembering in my sleep (yes, sad, but true!) that one of my comments that referenced some line numbers didn't get updated in the final version. So, no errors and if you don't get this update, it's not going to break anything. But, the script that's been tweaked is script: 05_EmployeeCaseStudy-TableDefinition.sql. Enjoy! kt
2009-Apr-13 (4:30pm): Ha... guess what arrived today. Yes - our copies of the book. Wow, it's great to see it in person. Again, enjoy!

If you haven't seen Paul's weekly survey, please take a look. We're trying to get a feel for the general state of database structures out there... Here's the link: http://www.sqlskills.com/BLOGS/PAUL/post/Weekly-survey-does-size-really-matter-or-is-it-what-you-do-with-it.aspx.

THANKS and have a great weekend!
kt

Categories:
Survey

A couple of weeks ago I wrote a blog post titled Whose job is it anyway? It's an interesting debate and something I've been hearing more and more - that SQL Server is a "set it and forget it" technology - a black box where you just don't need to know how it works to do well with it. In fact, I've even had a few folks comment that they think it would be better to "roll their own" database rather than have to learn how to work in a "general purpose" database. And, while there are certainly lots of different angle to this debate - one fact remains... if you don't know anything about the database on which you're developing (whether it's SQL Server, mySQL, Oracle, whatever), I *PROMISE* you won't have a truly scalable, optimal solution. Why do you think there are so many knobs? It's because there are so many different ways to work with data. There is more than one way to query, more than one way to design. This is also why every answer to a "how should I do this" question starts with "It depends". And, while that seems like a scary response it's actually a good one. It means that you have lots of options - options that can offer many different pros/cons. And, as a result of knowing these pros/cons, you can make better decisions - decisions that will ultimately determine how well you can scale.

So..... while I don't think this debate will EVER be finished (as to WHOSE job it is to know these things), I do think a lot of folks are seeing the effects of not knowing more about their store (and, again, this is NOT limited to SQL Server in any way, shape or form).

At a minimum, hear the discussion on RunAsRadio with Richard, Greg and I and let us know what you think!

Kim Tripp on the Roles of Developers and DBAs with the Database!

Cheers,
kt

I started the series here: http://www.sqlskills.com/BLOGS/KIMBERLY/post/Spring-cleaning-your-indexes-Part-I.aspx and I want to continue with Part II today by clarifying some great questions/comments that have come up on the series. In Part III, I'll give you a few more ways to get rid of (or consolidate) indexes. And, I think there's still a bit more that Paul and I will investigate further (wrt to operational stats) but, I want to address a few comments and a few interesting things that both Paul and I have found.

In the Part I post, I talked about using sys.dm_db_index_usage_stats to see if there are any indexes that just aren't being used at all... A few comments asked why I didn't use operational_stats instead. To address that first, there are a few key differences:

  • dm_db_index_operational_stats is persisted only as long as an object is in cache (however, it's not cleared when objects are forced out of cache with DBCC DROPCLEANBUFFERS). If you want to clear ALL DMVs for a specific database, then a relatively easy way to do this (IN TESTING) is to take the database offline and then immediately bring it online again.
    • ALTER DATABASE <dbname> SET OFFLINE
    • ALTER DATABASE <dbname> SET ONLINE
      • NOTE: If there are any suspect files, you will NOT be able to bring your database back online without FIRST taking all suspect files OFFLINE. And, if you take a FILE offline then it's even more important to know that THERE IS NO WAY TO BRING A FILE ONLINE without restoring it from backups. So, it's VERY important to understand that OFFLINE/ONLINE for a database is really easy IF AND ONLY IF there are no other problems with the DB. You really need to resolve those problems first (or at least know that you're going to need to resolve those problems later through backup/restore) before you take a database offline.
  • dm_db_index_operational_stats is (from BOL) neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used. For information about this, see sys.dm_db_index_usage_stats.

Having said that though, none of these are really any guarantee of perfect information. And, they're not meant to be. I look at these DMVs as being a quick and easy way to get some relatively descent insight into what is or is not happening in my environement. However, even though the sys.dm_db_index_operational_stats might give you insight that you have a problem it still doesn't give you good insight into exactly what that problem might be. For example, it *might* be splits that causes some of your wait times to increase (column: page_io_latch_wait_in_ms) but, it could be something else too (some other system issue).

The main point, you can use these to get insight into which tables have the biggest problems (i.e. the biggest waits) and where they might have a lot of splits (column:nonleaf_allocation_count) but, in all honesty, that's not a guarantee. In fact, the reason I said "might" is that pages that are allocated at the end of the leaf level STILL allocate a page and require an entry to be made in the next level up in the index. So, a lot of nonleaf_allocations COULD be for a perfectly unfragmented index. So, it still doesn't tell you how fragmented the objects are or what the REAL problem is (or even if it is a problem yet).

Basically, these just tell you where you have the most activity and give you a starting point for problem solving. But, none of these (usage or operational) really tell you how to solve the problem. However, sometimes even knowing where to start IS the problem in and of itself. So, I'm not against these DMVs and I really do think you can get some good insight from them. Just use them as a tool to help focus your investigations. Use better tools like sys.dm_db_index_physical_stats to really see if you have fragmentation and where it's the worst.

hth,
kt

This morning, I woke to a wonderful email and I thought I would share it:

Good morning Kimberly, I just wanted to send you an email to let you know how awesome you are! I have applied some of your indexing strategies to our website’s homegrown CMS and we are experiencing a significant performance increase. WOW....amazing tips delivered in an exciting way! 

Thanks again,
Jim

OK, I seriously can't complain about the awesome part ;-)... but, what I think is so cool is that he sent the email. No, I don't expect everyone to send me mail after they implement a tip/trick BUT, it's really nice when I do get an email (especially like this :). There are so many ways to improve performance and some can lead to significant gains (or minimized downtime/data loss, etc.) and this is exactly why we (speakers/RDs/MVPs, etc.) do this stuff!

The long story short is that it does cost you money to go to a conference (or, at least, time to read blogs, etc.) but what we try to do at events like Connections is distill down the key points into our sessions so that you can [hopefully] apply these techniques immediately. It makes for fast-paced sessions and an information packed week but our conference just ended yesterday and I already have email from it. That's just really cool!

So, I know the economy is in a tough state and I know that quantifying the gain of a conference is hard (especially when the cost of the conference is so easy to quantify) but the end result of some of this time away might be more than worthwhile!

THANKS Jim!
kt

Categories:
Conferences | Personal

First and foremost, happy spring! I truly hope we're on the path to summer (although who would know it here - we're in Florida for SQL Connections and the weather is a bit chilly and it's been raining off/on today - I hope this is short term (no, I don't want to look at the forecast as I don't want to jinx it :)). But, wherever you are - I hope you're on your way to nicer weather and minimal cold weather (ok, I guess I only have warm wishes for the northern hem... for you southies - I hope your fall is lovely!!).

But, for everyone - now's a good time to start thinking about cleaning out the [non-literal index] closet... and getting rid of some of those dusty indexes?

Why/when should you get rid of some of your indexes?

  1. It's possible that some of them aren't being used at all...
  2. Especially when they're not being used but even when they're "redundant" (or minimally useful) they're still costing you in many ways:
    1. Wasting space on disk
    2. Wasting space in memory (well, if they are being used then they're cluttering up your cache)
    3. Wasting space/time in your maintenance routines (so, here they're cluttering up your cache for sure!)
    4. Wasting space in your backups
  3. You might be able to reduce your overall indexes with index consolidation...

So, for this post, I'm going to target #1 - are there any indexes that just aren't being used at all...

First, how do you know if your indexes are being used?
In SQL Server 2005 and higher, there's a DMV (dynamic management view) called sys.dm_db_index_usage_stats and it's there to track index usage patterns. However, it's not persisted since the beginning of time and as a result, if you look at this and believe that it's telling you ALL of the indexes that have been used in your database - then you might be mistaken. The index usage stats DMV is cleared when SQL Server is restarted as well as when you detach/attach the database or when you backup/restore the database.

Therefore, you don't want to just run the following query and drop all of the indexes that aren't being used. A better way to "trust" this information is to periodically persist the data from the DMV in your own table and then query it after you've completed a business cycle's worth of activity - logging all of the usage stats. Then you can trust this much more. Again, here are a couple of negatives:
1) it's not persisted
2) it only keeps the database_id and the index_id (which could change over time). You're right in thinking it probably shouldn't change but, a nonclustered index's ID is not permanent so, it's better to track the index name in addition to the index_id. And, when you run your queries to determine what to delete, you can easily verify indexes against the current indexes because your comparison is within the same db (more on this below).

If you want to persist this, then you have two ways to do this:

Store the index usage patterns in a table within the specific database you're tracking:

Pro: it goes with the database when you back it up, etc. and, it's easier to reverse engineer which actual indexes you're referring to (grabbing the names and not just the IDs).

Con: it's a bit more complex of a query to run and you'll need to run it for all of your databases (ok, it's really not all that bad - but, using something like sp_msForEachDB will really help)

Store the index usage patterns from all databases in a table within master or your own "performance database":

Pro: you only need one job to handle all the index usage info AND object_name *does* support TWO parameters (object_id and database_id) so, as long as you trap the name at the time of insert then you'll be good.

Con: it doesn't go with the database (e.g. backup/restore - and if you're restoring to a test system and you want to see what the usage patterns were then you'll need to get this information as well...)

Here's a simple query that you can run that shows all the indexes used right now - and adds the databasename/objectname into the results - in a persisted table you'll also want to add the runtime:

SELECT getdate() AS RunTime
, DB_NAME(i.database_id) as DatabaseName
, OBJECT_NAME(i.object_id, i.database_id) as ObjectName
, *
FROM sys.dm_db_index_usage_stats AS i
WHERE object_id > 100

And, if you want to get a few more insghts into how to persist this on Paul's blog here: http://www.sqlskills.com/BLOGS/PAUL/post/Indexes-From-Every-Angle-How-can-you-tell-if-an-index-is-being-used.aspx.

OK, so, you have a few options to think about and I have a few more parts to post!
kt

PS - The Tipping Point is coming soon too. I'm still adding a few things to that one!!! ;-)

OK, I've been pretty quiet this week... I had high hopes that I'd blog at least once or twice this week but the week turned out to be crazy busy as I presented T-W-Th at the SQL MCM (Microsoft Certified Masters) and then Fr at the Sharepoint MCM. The groups were really fun and wanted to go deeper and deeper into the technologies so on most days, I didn't even end lecture (which started at 8am) until 6:30+. So, it was a looooonnnggg week.

If you're interested, here's a link to the "very intense but fun" MCM programs: Microsoft Certified Master Program

Tomorrow, Paul and I fly out at 6am (to Florida) for SQLConnections. It's a packed week at Connections (a precon, a postcon and 5 sessions between us) and we're really looking forward to seeing many of you there! I'll be back to blogging soon as I know I still owe you the results/answers for the Tipping Point queries.

In the interim, check out Sean's Midnight DBA videos... he seriously needs to get some sleep! ;-)

Cheers to you all!
kt

Categories:
Resources

OK, I was tagged by Ward Pond here. Ward was tagged by Jimmy May here. Jimmy was tagged by Kevin Kline here. Kevin was tagged by Chris Shaw here. Chris was tagged by Michelle Uford here. And, Michelle was tagged by Mike Walsh here.

Ah... at least I know who to blame. Ah, thanks Mike! ;-)

So, as for some things I know now that I wish I knew then and/or what I value the most. First, I took a peek at a lot of the other posts. There are some really good tips and tricks in all of them and so it's hard to come up with something really clever. But, I had to admit the following:

You don't know what you don't know..

Yeah, I know. Then you don't know. So, does it matter?! Actually, yes. There have been times where I wished I had listened better before I jumped to what turned out to be the wrong solution. Sometimes taking more time at design - can REALLY help. This is something that I've been working really hard to prove in a lot of my recent posts. We've all thrown together a schema... got things working... and then found out it wasn't ideal... But, how much can you really know if you don't know the workload? Well, this is where the listening part comes in. Talk to your users (as painful as it may be ;). Talk to the stakeholders. Even if you can't figure out the specific queries that are going to run, give them sheets of paper on which they can draw their own dialogs and tell you what they'd LIKE to see (tell them to visualize what they want to see on the screen - on the paper). This is an ideal way of getting to the meat of what they want/need - without getting lost in the "glitz" of the UI. Even if you only have some insight into what they'll be doing - some is WAY better than none and the more you listen - the better you'll design. The better you design; the better it will scale/perform.

You can't know it all!!!

Give up now. Don't try to know everything - you'll never sleep. I work really hard knowing a lot about a lot of different things - but, I still specialize AND, I'm HAPPY to say "I don't know" (and say it often). In fact, what I pride myself on the most is that I'm never afraid to prototype and/or try something out. One of my favorite sayings for SQL Server is: "Jack of all trades, master of SOME." Know enough to know you need to learn more (because you plan to use that feature)... And, know enough to know when a feature is NOT appropriate. But, don't get lost in all of the minutia. We just don't have time for it. Choose your battles.

And, since I'm starving and I know when to quit. ;-) I'm going to tag a few folks including a couple of NON-SQL folks and hear what they say!! And, Paul was also tagged - check out his post here: http://www.sqlskills.com/BLOGS/PAUL/post/Things-you-know-now.aspx.

And, I'd love to hear what you know now!

Thanks,
kt

Categories:
Opinions | Tips

We've opened up the call for abstracts for the FALL Connections show and Paul's already blogged all about it. Because most of you read both of our blogs, I won't repeat it here but - if you don't read Paul's blog (ah, really - you don't? Why not?? ;-)) then here's the link: http://www.sqlskills.com/BLOGS/PAUL/post/SQL-Connections-Fall-2009-Call-for-Abstracts.aspx.

Thanks and we look forward to your submissions!
kt

Categories:
Conferences | Events

This is a tough topic. It's a big topic and more than any other - I think there are a lot of misunderstandings about what the log is for, why it's so critical and ESPECIALLY when/why it gets extrememly large. Simply put, it gets large when it's not managed correctly. OK, there are times when it can become large - even if it is well managed. But, more often than not, when a transaction log is wildly out of control (orders of magnitude larger than the data itself) it indicates a management/maintenance problem.

There are a lot of places where you can go to find out the technical details behind the transaction log but I'm going to target this blog post to the relatively straightforward easy (no, really easy!!) facts about transaction log maintenance.

What kind of transaction log management is right for YOUR database?

First and foremost, you MUST decide whether or not you need to do log backups. SQL Server *requires* you to make some form of decision. Well, I take that back. They don't tell you anywhere that you need to make this decision but the transaction log can get wildly out of control if you don't (see the next section for more details on this one :)).

Why? Transaction log backups will allow you better recovery options in the event of a disaster. If you create a good backup strategy, you should be able to recover from a disaster very close (possibly even up-to-the-minute) to the time of the disaster. Howevre, you are not required to do log backups. Instead you can do only database-level backups and recover with those. That's fine. There's really nothing wrong with that strategy. However, it does mean that you have a greater potential for data loss. Basically, if you decide that you're doing to do weekly full backups - then you need to be OK with losing everything that's happened since your last full backup. If that's OK, then performing full database backups (and never worrying about the log) is absolutely fine.

However, if you want more granular control and more recovery options (again, possibly even up-to-the-minute recovery - which is transactional recovery up to the time of the disaster), then you MUST add transaction log backups into your disaster recovery strategy.

So, make this decision FIRST:

  1. Am I OK with some data loss? (then you're probably OK with just database-level backups... but, you will need to do something else! be sure to keep reading!!!)
  2. Do I want to minimize data loss to the smallest amount possible? (then you're going to want to AUTOMATE transaction log backups)

But I didn't do anything - why is the log WAY out of control (in terms of size)?

OK, even if you consciously make the decision to ONLY do database-level backups, you are NOT DONE!!! In fact, this is actually what led me to do this post. I found these two (relatively dated but interesting nonetheless) MSDN forum discussions for TFS (Team Foundation Server) databases:

    MSDN Forum discussion "Recommended SQL Maintenance Plan": http://social.msdn.microsoft.com/forums/en-US/tfsadmin/thread/b23f7018-3eaa-4596-96e4-728b02cf6211/ 
    MSDN Forum discussion "Huge log files":
http://social.msdn.microsoft.com/forums/en-US/tfsadmin/thread/605d51f7-23fd-470c-945e-53fa7ed5aa87/

And, I know EXACTLY what happened in ALL of these cases (and MANY more... Paul and I see this ALL the time, in fact). In the "Huge log files" thread, there's a database mentioned (TfsWareHouse) with a 124MB mdf and a transaction log of 61.8GB. It didn't mention whether or not there were other data files but my guess is that there weren't. My guess is that they were completely shocked by why the data portion had grown to a size that's 510 TIMES the size of the database... The reason is actually somewhat simple (no pun intended). If you're not going to do transaction log maintenance (meaning transaction log backups), then you need to tell SQL Server that. (This is the part that's completely unexpected.)

When a database is created, SQL Server runs that database in a "pseudo simple recovery model". (Yes, I know - that didn't help.) What that means is that SQL Server automatically clears inactive records from the transaction log once it knows that it no longer needs them. It no longer needs them to be stored in the log because no one is using the log (i.e. you're not doing ANY backups). However, once you do start to do backups (and, people generally start by doing a full database backup), then SQL Server looks to your recovery model to determine what to do with log records. If the recovery model is set to full (and, yes, this is the default), then SQL Server gives you the "full feature set" with regard to backup/restore. SQL Server is expecting YOU to manage the transaction log by backing it up. Once it's backed up, SQL Server can remove the inactive records from the transaction log (and when you do a transaction log backup, it automatically clears the inactive records by default).

So, there are really two choices - and ONLY two choices here:

  1. Perform transaction log backups as part of your maintenance plan
  2. Change the recovery model to the SIMPLE recovery model so that SQL Server clears inactive transactions from the log automatically

Is there anything else to do for the transaction log? 

Yes! If you decide that you want to do transaction log backups then I would recommend a few things. I'd first recommend reading 8 Steps to Better Transaction Log Throughput and when you decide how large your transaction log needs to be, then also read Transaction Log VLFs - too many or too few?. These two posts will help you to create a more appropriately sized log as well as one that won't be prone to performance problems (such as internal VLF fragmentation).

If you want to learn more about the transaction log, I'd suggest a few of Paul's resources (it's probably because he has such a fantastic tech editor... oh, I'm asking for trouble with this comment!! ;-):

  1. Read Paul’s blog post to his TechNet article on Logging & Recovery. It’s a great article that covers a lot of different aspects of logging. He also did a great short video on why the transaction log grows wildly out of control. Here’s a link to the blog post that pulls all of the TechNet resources together: http://www.sqlskills.com/BLOGS/PAUL/post/TechNet-Magazine-feature-article-on-understanding-logging-and-recovery.aspx.
  2. Read Paul’s blog post to his TechNet article on Database Maintenance. It’s a great overview of all of these maintenance tasks and will give you a good overview of what each one does. Here’s a link to the blog post that pulls all of the TechNet resources together: http://www.sqlskills.com/BLOGS/PAUL/post/TechNet-Magazine-Effective-Database-Maintenance-article-and-August-SQL-QA-column.aspx

OK, so, I think that sums up part III. I think that's the last one in the series for now. I'll go through and explain "The Tipping Point" next. However, I was hoping for more results to my brain teasers (in those two posts)!!

Cheers,
kt

When you decide to rebuild or reorganzie an index, you have an option to set something called FILLFACTOR. This is probably the MOST IMPORTANT thing to understand about index maintenance and reducing fragmentation (especially in databases that are prone to it). Unfortunately, we need to dive into some internals to really understand why this is so helpful...

What is FILLFACTOR?

FILLFACTOR defines how full the leaf level pages of the index are filled when rebuilding or reorganizing an index. The leaf level of an index is the largest structure of an index and it's the most prone to fragmentation. A good analogy would be to use the phone book (ah, loosely for this one) and imagine that new people come to the area. The *entire* phone book needs to get reprinted. Obviously this is terribly expensive and therefore it's not done often (well, we all use the internet anyway...). But, imagine if EACH page of the phone book were only filled to 90%. When new people come to the area we could just add their information - on the specific page where it should reside - and the phone book stays relatively intact and we don't have to kill a bunch of trees (OK, you could argue that you would have had to kill more trees to make a phone book that has 10% whitespace...but, let's not go there right now. :). OK, I know this analogy is weak at best but think back to "a page in the phone book" and think of it representing a "page" within SQL Server (quite literally here - SQL Server actually stores data on 8KB pages and indexes are doubly-linked lists of these 8KB pages).

So, if the pages are full and there's nowhere to put a particular row (let's say someone with a lastname of 'Tripp'), how does SQL Server "insert" this new row? They can't possibly shift all of the data down and rewrite all of the pages AFTER 'Tripp' could they? I suppose they could but it would get REALLY expensive - especially if someone with the last name of 'Anderson' moves to town...

How does SQL Server make room for new rows? They do something called a page split. Now, this can be A LOT more technical than I'm going to explain here but simply put, they split the page (on which the row is to be located) into half. The first 50% of the rows STAYS on the current page and the 50% MOVES to a new page (and this new page is probably nowhere near the page being split). However, they keep this list LOGICALLY in order by maintaining a doubly-linked list. So, SQL Server knows where to go when it's reading the data - even if the pages are not physically/sequentially in order. If this happens once or twice, it's not a huge problem. But as it continues to happen, it makes scans of your data more and more expensive and the process of the split can be expensive in and of itself. The biggest problems are:

  1. The data is not physically in order (and can create lot random IOs when scanning)
  2. The pages are no longer very dense (they've gone from 100% to 50%). Splitting does has a positive side in that it's going to take awhile before the page splits again BUT, splitting is meant as a temporary fix and ideally, it shouldn't be happening to every page. But, this also negatively impacts cache - you're only using 50% of the page that's split so now you're wasting half of that page in memory (and, on disk). So, while making a table 10% wider from the start seems bad; it reduces the need for splitting and therefore your table will stay more intact over time. 

What indexes have fragmentation?

Some indexes are prone to more fragmentation than others. For example, indexes with a high-order element (the first column of the index) that is NOT in ever-increasing order. This does NOT mean that ALL indexes should be created such that their first column is ever-increasing (that's NOT a good way to create indexes!). But, it does mean that you should pay special attention to some of your larger tables and therefore larger indexes (meaning ESPECIALLY the clustered index since the clustered index is over ALL of the columns). 

In terms of index management, here are the main things to think about for each index type: 

Rebuilding the clustered index is the most expensive index to rebuild. It's also the one that you'd rather avoid if you can. It has the most limitations... A clustered index CANNOT be rebuilt as an online operation IF the table has ANY LOB columns in it at all. What this means is that the clustered index - if fragmented - must be rebuilt using an offline operation. This translates into downtime for that table - while it's being maintained. So, if you can create a good clustering key EARLY, then you might be able to minimize clustered index maintenance.

Rebuilding a nonclustered index is much easier and likely to be an ONLINE operation. Because nonclustered indexes contain a much smaller number of columns - and rarely ever include a LOB column - they can be managed/maintained through ONLINE index operations. So, even if these are prone to more fragmentation (and, in general, they are) then it's not as big of a deal.

How do I solve this problem?

So, the BEST thing to do is to visually picture the ORDER of the data in the index (just think of it like a phone book - a LIST of records on PAGES - ordered by the KEY of the index). If you can visualize that structure then you can visualize what happens on an INSERT, UPDATE and a DELETE. Once you have a good visual you should be able to define a FILLFACTOR based on these two things:

  1. How much fragmentation is happening (is it becoming massively fragmented quickly) - then set a lower fillfactor (like 70 or 80 - depending on how big the table is)
  2. How frequently are you going to be running maintenance (if you can do this daily or weekly - for some objects) then you might not need much free space. If you can run maintenance against the table frequently then you won't need as much free space.

And, of course, you can directly check the level of fragmentation your table has by using the DMV: dm_db_index_physical_stats (which I referenced in Part I) and ideally, you'll create a procedure that runs through your objects determining how much fragmentation they have and the appropriate FILLFACTOR. Sometimes the best way to do this is to create your own "management table" which lists all of your tables and then have definitions for when to rebuild, when to reorg and what fillfactor to use. There are a few folks that have done things like this. Check out Ola's site here. He's put a lot of effort into a more robust and flexible combination of options for database maintenance. It might be more than you need but it's a GREAT EXAMPLE of all of the things that you can evaluate and consider wrt to maintenance.

Solving this using Database Maintenance Plans

Finally, if you CHOOSE to use a maintenance plan you ONLY have two options:

Reorganize pages with the default amount of free space
Does NOT sound like what it actually does (IMO). It sounds like it uses some form of SQL Server default value. However, it actually uses the "OrigFillFactor" value that's set for the index. This is actually a good thing IF you've set the FILLFACTOR on the CREATE INDEX statement OR on a rebuild. Reorganizing does NOT allow you to set this ONLY a rebuild does. So, if you have set this then this choice is probably best. If you run a query against sysindexes (yes, the backward compatibility view NOT the new sys.indexes catalog view), then you can see the OrigFillFactor setting for ALL of your indexes. (And, it's probably 0 - which sounds strange... 0 actually means 100... I could explain this but just trust me on this one.)

Change free space per page percentage to _____ %
Well, this is certainly an option you can consider. If you've never set the FILLFACTOR (or, even if you have...), this would overwrite it AND change the OrigFillFactor value to the percent you use here. If you decide to take this "sledgehammer" approach then I'd suggest something quite high - like 90-95% so that you don't end up wasting a lot of space. Some indexes (ones that have a high-order element that IS ever-increasing) don't really need free space and this might be completely wasted. So, I think it's better to set this indivually using stored procedures and/or a more custom maintenance procedure.

NOTE/TIP FROM COMMENTS POSTED: While I explain this concept (above) using the index option/setting (called FILLFACTOR) the UI is wanting the FREE SPACE percentage. So, if you want to use 90-95% FULL then be sure to set this value to 5-10%. Using the syntax, you will set page fullness.

OK, I still have at least 2 more in this series and EVEN THOUGH I've been completely sucked into facebook (as Paul blogged here), I'm still trying to manage to work and save SQLskills. Paul, on the other hand, just uploaded more photos. ;-)

Cheers,
kt

OK, it seems as though there's A LOT of confusion about what steps are required for proper database maintenance. And, it seems as though some recommendations are being given as "quick fixes" without any real recommendation for root cause analysis. I'm not saying that the generalizations are horribly wrong but in many cases they're just too broad and/or unspecific to actually be useful (and, well, in all honesty, some of them are just really bad recommendations because they’re so ambiguous). And, in my random internet trolling for the day, I found 4 different references that I want to go through (which is why this is only Part I). For this post, I’ll focus just on Sharepoint.

First, what did I see that’s motivating this post?
I found the following KB article – which was referenced by numerous sites as recommended reading. And, without knowing a lot about SQL (and, that’s NOT meant as a dig at all – it’s just that most apps that sit on SQL don’t ever even recommend that you need to know SQL and I can argue certain aspects of that point as well BUT, wrt to maintenance, it can really become a problem if you don't know a few things about these tasks), it does seem like good reading: Information about the Maintenance Plan Wizard in SQL Server 2005 and about tasks that administrators can perform against SharePoint databases

Here is the part that over-simplifies picking what maintenance tasks to run vs. what not to run:

DIRECTLY TAKEN FROM THIS KB ARTICLE IS THE FOLLOWING:

We have tested these tasks and the effects that these tasks have on database schema and performance. The following table summarizes the results of these tests.

Task                                        Safe to perform this task?
Check database Yes
Reduce a database Yes
Reorganize an index Yes
Clean up the history Yes
Update statistics Yes. However, this task is unnecessary because the SharePoint Timer service performs this task automatically.
Rebuild an index No. The task does not restore existing index options before the rebuild operation. However, you can use scripts that restore index options.
Note This problem was corrected in SQL Server 2005 Service Pack 2.

We used the following criteria to determine whether a task was safe to perform:

  • Whether the task modified the database schema from its default state
  • Whether the task decreased performanceResults may vary depending on the environment.

However, if you use the Maintenance Plan Wizard to perform the tasks that are listed in the table as "safe to perform," you are likely to experience increased performance in SQL Server 2005.

The big problem is: this is just too little information about too many VERY important tasks!

Let me break this down task by task and give you a few other places to go for more information.

Check database

The check database task refers to DBCC CHECKDB. This is definitely an important part of any maintenance plan. And, it really is a safe task to run as it’s NOT corrective by default. However, there is nothing mentioned about how this command may completely flush your buffer pool as it reads all of the pages of all of the objects it’s checking. So, this might impact performance but, of all of the tasks, this is the safest to run and it’s definitely a recommended task.

If you want to learn more about DBCC CHECKDB, check out these things: 

  1. Read Paul’s blog post to his TechNet article on Database Maintenance. It’s a great overview of all of these maintenance tasks and will give you a good overview of what each one does. Here’s a link to the blog post that pulls all of the TechNet resources together: http://www.sqlskills.com/BLOGS/PAUL/post/TechNet-Magazine-Effective-Database-Maintenance-article-and-August-SQL-QA-column.aspx  
  2. Read Paul’s blog post on Myths around causing corruption – so that you can get better insight into where/why the actual corruptions are occurring. 
  3. Finally, if you’re really interested in the internals of CHECKDB and how it works – Paul’s written a ton about it in his CHECKDB from Every Angle category. FYI, 3 of his 9 years on the SQL Server Development Team were spent writing CHECKDB and repair for SQL Server 2005 (so, he definitely knows how it works J). Here’s the link to the category: http://www.sqlskills.com/BLOGS/PAUL/category/CHECKDB-From-Every-Angle.aspx

Reduce a database 

OK, I’m sure I’ll get a lot of responses to this one but IMO, a database maintenance plan SHOULD NEVER INCLUDE A SHRINK.

Let me explain… J
To be honest, I'm not even a fan of manually running database-level shrinks (DBCC SHRINKDATABASE) either. Don't get me wrong - there are ACCEPTABLE times to shrink parts of a database but, in general, I'd recommend only using DBCC SHRINKFILE for individual file-level shrinks. I wouldn't schedule shrinks nor would I EVER turn on [the database option] autoshrink. I don't think shrinks should EVER be automated - either through the database option OR through maintenance plans.

If you need to do regular shrinks - then it's likely that you have some other problem. And, without DIRECTLY addressing this problem, you *might* be making things worse.

This is a bigger discussion and I’ve found a few other references that I want to pull together. I’ll post another post about this within the next day or so – and link to it from here BUT, for right now…Know this – free space is generally GOOD. Excessive free space has happened FOR A REASON. Maybe there’s a pattern to it but often shrinking is worse than just leaving the free space for the next data explosion (a bunch of data comes in, the database grows, the data is archived, the free space remains for the next set of data that comes in).

If you shrink the database you might make things worse by fragmenting everything. Paul’s video that goes with the TechNet article on Database Maintenance shows you the [shocking if you didn’t know this] effect of shrinking a database on indexes.

Reorganize an index, Update statistics and Rebuild an index

These need to be grouped together to start because this KB article does NOT address the impact of running these together. In fact, the problem, if you run these together – is that you MIGHT make things worse. First, let me give you an overview of each:

Reoganizing an index removes fragmentation in the largest part of an index (it’s called the leaf level of the index) and removing fragmentation in this level has the greatest (and positive) effect on range query scans and cache. So, this is really the most important type of fragmentation to remove. However, this is NOT the only way to do it… 

Rebuilding an index completely and totally removes ALL forms of fragmentation in all levels of an index; however, this is the most expensive (yet most effective) way to do it. As a result of rebuilding an index, SQL Server also updates the statistics for the indexes that were rebuilt. Therefore you do not need to update statistics OR reorganize an index if it gets rebuilt.

Updating statistics is important for query processing and optimization. The query processor uses statistics on your data to help determine how many rows will be processed by your query/statement. If SQL Server can accurately estimate the rows, then it can choose a more effective plan. However, if it doesn’t have good statistics, then it may not do as good of a job at accurately estimating rows and therefore it might not come up with as optimal of a plan. So, this is an integral part to good database health. However, some of this might be done via the database option: auto update statistics which is ON by default in SQL Server (and, YES, you should leave this on). Check out this post on: Auto update statistics and auto create statistics - should you leave them on and/or turn them on??

However, if you use a maintenance plan then I really see two problems: 

  1. You’ll end up doing maintenance on things that may not need it. The default behavior for these tasks is just to run them on the selected objects. And, since many people will choose all objects (possibly even of all databases) then you’ll probably select objects that won’t really need this as frequently as you run this maintenance plan. 
  2. You might end up running a combination of things that either – wastes cycles/CPU and a MASSIVE amount of log space (which can translate into all sorts of concerns for DR technologies like database mirroring which will need to send all log rows to the secondary server). For example, if you run ALL three of these things then they’ll have to be run in a certain order (you can change this in a maintenance plan). However, the default order is: Reorganize Index(es), Rebuid Index(es), Update Statistics. This means that the work that’s done by reorganizing is effectively wasted as the rebuild would have taken care of it AND the work that’s done for updating statistics could mean that they update statistics TWICE (during the rebuild AND after) and the end result is WORSE because the update statistics command might use a sampling mechanism to generate statistics (which can lead to LESS EFFECTIVE statistics information). However, this is ONLY if you change the wizard’s default. The default is for the updating statistics command to do a “full scan”. So, even if the statistics will end up being the same – it’s still problematic because it means that for all indexes you’ve just rebuilt – you’ve now updated their statistics TWICE.

SUMMARY

A database maintenance plan is CRITICAL for best performance (especially for databases that are prone to some of the problems corrected by these maintenance tasks (yes, you can read SHAREPOINT into that statement). Sharepoint uses GUIDs as PRIMARY KEYs (read this post to hear more about the side effects of this choice) and as a result, as clustering keys. This means that many Sharepoint tables are prone to [potentially a MASSIVE amount of] fragmentation.

You absolutely need to have a maintenance plan. But, what should it be?

My absolute preference is to NOT use the Database Maintenance Plan Wizard UNLESS you really know what you’re doing. It just doesn’t give enough prescriptive advice. And, if you just select the defaults, you will end up with an inoptimal maintenance plan.

A better approach would be to create your own maintenance plan. If you write the code yourself (or leverage one of the custom ones that are already out on the web) then you can strategically target ONLY the objects that have the warning signs and/or are out of date and you can set when to rebuild vs. when to reorganize (generally people rebuild if a table has more than 30% fragentation and they reorganize when it's less than 30%). Fragmentation is something that can be detected programmatically using the DMV: dm_db_index_physical_stats (in SQL 2005/2008) or by using DBCC SHOWCONTIG (in SQL 7.0/2000). Here are a few places to go to see the more flexible and programmatic way of rebuilding/reorganizing indexes:

Smart Indexing Part II - Conditional Rebuilding a blog post (with conditional index rebuild code) from SQLMCA Bob Duffy (a good friend who is located in Dublin, Ireland and whose wife (Carmel) just had a baby last week – congrats again Bob!! You guys are seriously outnumbered now!!!) here: http://blogs.msdn.com/boduff/archive/2007/06/08/smart-indexing-part-ii-conditional-rebuilding.aspx

Custom Index Defrag / Rebuild Procedures - a blog post with some posted code as well. http://www.sqlstuff.info/post/2008/03/Custom-Index-Defrag--Rebuild-Procedures.aspx

Rebuild and Reorganize Indexes in SQL 2005 – an article (with conditional index rebuild code) from SQL Server Central here: http://www.sqlservercentral.com/scripts/31857/  (NOTE: You will need to become a subscriber to get to this article.)

Rebuild Only the Indexes that Need Help - an article by Andrew Kelly (SQL MVP) on SQL Server Magazine here: http://www.sqlmag.com/articles/index.cfm?articleid=99019&pg=1 (NOTE: You will need to become a suscriber to get to the full text of the article.)

Or, build your own! Check out the BOL topic for the sys.dm_db_index_physical_stats for SQL 2005 here: http://msdn.microsoft.com/en-us/library/ms188917(SQL.90).aspx, Example D has sample code to help you get started! For SQL 2008 it’s here: http://msdn.microsoft.com/en-us/library/ms188917.aspx. It’s still Example D for the sample code to leverage. J

The most important thing I can tell you is that a SMALL amount of time getting familiar with what’s really happening in SQL as well as WHY it’s happening to you WILL BE A LOT MORE PRODUCTIVE then just slamming in a maintenance plan that solves some problems but probably creates others.

Hope this helps!
kt

OK, so this is interesting. I've got a few answers to my last survey (Tipping Point Query #1) and well, there's a good mix of answers (and, yes, some are correct! ;)). Be sure to go back and review that last post so that you can evaluate it and these two tipping point questions completely. So, now I want to see if people really know the basis of "the tipping point".

Try these two:

Tipping Point Query #2

Table1 (t1) has 1 million rows at 100 rows per page. The table has 10,000 pages. A nonclustered index exists (on name) but it does not cover the query. At what percentage (of the table) is this nonclustered index no longer selective enough to use:

Tipping Point Query #3

Table2 (t2) has 1 million rows at 2 rows per page. The table has 500,000 pages. A nonclustered index exists (on name) but it does not cover the query. At what percentage (of the table) is this nonclustered index no longer selective enough to use:

OK, so I'd really love to see quite a few responses to these *3* "tipping point" questions. I PROMISE to do a nice long (and detailed) post for what is the actual tipping point AND the answers to all three of these questions. I'll explain the math as well as how you can generalize "what is selective enough" so that you can better create your nonclustered indexes!!!

Thanks for reading - and responding to these brain teasers!!

Cheers,
kt

PS - It's snowing here (ah...again)... maybe I'll spend the day creating brain teasers??! Do you guys like this kind of a post? (well, I suppose you won't really know until I post the answer part of it... but, just in general??). I think it's pretty cool. But, don't worry, I won't (nor will Paul) make all of my posts surveys. But, I think this is a really good one. I'm anxious to see if the asnwers come in correctly for these two as well! Have at it!

Along the same lines of improving database design and getting better performance on SQL Server (which [IMO] DOES take an experienced SQL Server database developer - but, we'll talk more about "whose job this really is" in many more posts and probably even a RunAs - which Richard and I just setup to record on Thursday (Mar 12)), I started thinking about how I could convince people of why they NEED a database developer. So, I thought I'd ask this VERY important question...

What percentage of data IS selective enough to use a nonclustered index which doesn't cover the query... in other words (just in case you're not entirely sure of what I mean :)), think of indexes in the back of a book... if you need to go to the back of the book to reference a bunch of data (this is called a [bookmark] lookup in SQL Server), there's a point where the randomness of the lookups (especially if you think in terms of many rows on a page) becomes too expensive. For example, imagine that the index is customer name and the data (the book) is customer orders - and, each page (of this rather weird book ;)), has 20 orders on it. Doing a query to lookup customer number 12's orders might be really easy (if they only have only a few orders) BUT, what if the query is "show me all of the orders for people that have an 'e' in their name". First, the number of people have have an 'e' in their name is probably better than 50% (that's TOTALLY a guess) and, if there's 20 orders per page then a lookup from the index into the book would require SQL Server to touch every page roughly 10 times. If the table has 50,000 pages (therefore 1 million rows - at 20 rows per page), then to find the 500,000 rows (remember, I'm estimating half), SQL Server would have to do 500,000 bookmark lookups. For a table with only 50,000 pages that's terribly expensive.

So, here's the question - what's the tipping point? When is a nonclustered index on customer name NOT going to be used to lookup rows of sales orders? I'm going to use a survey to see what you think and then within a week, I'll give the specific SQL Server math AND a query you can run within your own DBs to see EVERY one of your table's "tipping points". It's really interesting and I think will really help you to understand why SQL Server might not be using those nonclustered indexes.............

Cheers,
kt

Expanding on the topic of "are you kidding me"... one of the MOST PREVALENT problems I see today is the dreaded "GUIDs as PKs" problem. However, just to be clear, it's not [as much of a] problem that your PRIMARY KEY is a GUID as much as it is a problem that the PRIMARY KEY is probably your clustering key. They really are two things BUT the default behavior in SQL Server is that a PRIMARY KEY uses a UNIQUE CLUSTERED INDEX to enforce entity integrity. So, I thought I'd take this post to really dive into why this is a problem and how you can hope to minimize it.

Relational Concepts - What is a PRIMARY KEY? (quick and basic reminder for what is what and why)

Starting at the very beginning... a primary key is used to enforce entity integrity. Entity integrity is the very basic concept that every row is uniquely identifiable. This is especially important in a normalized database because you usually end up with many tables and a need to reference rows across those tables (i.e. relationships). Relational theory says that every table MUST have a primary key. SQL Server does not have this requirement. However, many features - like replication - often have a requirement on a primary key so that they can guarantee which row to modify on a related database/server (like the subscriber in a replication environment). So, most people think to create one. However, not always...

What happens when a column(s) is defined as a PRIMARY KEY - in SQL Server?

The first thing that SQL Server checks is that ALL of the columns that make up the PRIMARY KEY constraint do not all NULLs. This is a requirement of a PRIMARY KEY but not a requirement of a UNIQUE KEY. They also check to make sure (if the table has data) that the existing data meets the uniqueness requirement. If there are any duplicate rows, the addition of the constraint will fail. And, to check this as well as to enforce this for [future] new rows - SQL Server builds a UNIQUE index. More specifically, if you don't specify index type when adding the constraint, SQL Server makes the index a UNIQUE CLUSTERED index. So, why is that interesting...

What is a clustered index?

In SQL Server 7.0 and higher the internal dependencies on the clustering key CHANGED. (Yes, it's important to know that things CHANGED in 7.0... why? Because there are still some folks out there that don't realize how RADICAL of a change occurred in the internals (wrt to the clustering key) in SQL Server 7.0). It's always (in all releases of SQL Server) been true that the clustered index defines the order of the data in the table itself (yes, the data of the table becomes the leaf level of the clustered index) and, it's always been a [potential] source of fragmentation. That's really not new. Although it does seem like it's more of a hot topic in recent releases but that may solely because there are more and more databases out there in general AND they've gotten bigger and bigger... and you feel the effects of fragmentation more when databases get really large.

What changed is that the clustering key gets used as the "lookup" value from the nonclustered indexes. Prior to SQL Server 7.0, SQL Server used a volatile RID structure. This was problematic because as records moved, ALL of the nonclustered indexes would need to get updated. Imagine a page that "splits" where half of the records are relocated to a new page. If that page has 20 rows then 10 rows have new RIDs - that means that 10 rows in EACH (and ALL) of your nonclustered indexes would need to get updated. The more nonclustered indexes you had, the worse it got (this is also where the idea that nonclustered indexes are TERRIBLY expensive comes from). In 7.0, the negative affects of record relocation were addressed in BOTH clustered tables and heaps. In heaps they chose to use forwarding pointers. The idea is that the row's FIXED RID is defined at insert and even if the data for the row has to relocate because the row no longer fits on the original page - the rows RID does not change. Instead, SQL Server just uses a forwarding pointer to make one extra hop (never more) to get to the data. In a clustered table, SQL Server uses the clustering key to lookup the data. As a result, this puts some strain on the clustering key that was never there before. It should be narrow (otherwise it can make the nonclustered indexes UNNECESSARILY wide). The clustering key should be UNIQUE (otherwise the nonclustered indexes wouldn't know "which" row to lookup - and, if the clustering key is not defined as unique then SQL Server will internally add a 4-byte uniquifier to each duplicate key value... this wastes time and space - both in the base table AND the nonclustered indexes). And, the clustering key should be STATIC (otherwise it will be costly to update because the clustering key is duplicated in ALL nonclustered indexes).

In summary, the clustering key really has all of these purposes:

  1. It defines the lookup value used by the nonclustered indexes (should be unique, narrow and static)
  2. It defines the table's order (physically at creation and logically maintained through a linked list after that) - so we need to be careful of fragmentation
  3. It can be used to answer a query (either as a table scan - or, if the query wants a subset of data (a range query) and the clustering key supports that range, then yes, the clustering key can be used to reduce the cost of the scan (it can seek with a partial scan)

However, the first two are the two that I think about the most when I choose a clustering key. The third is just one that I *might* be able to leverage if my clustering key also happens to be good for that. So, some examples of GOOD clustering keys are:

  • An identity column
  • A composite key of date and identity - in that order (date, identity) 
  • A pseudo sequential GUID (using the NEWSEQUENTIALID() function in SQL Server OR a "homegrown" function that builds sequential GUIDs - like Gert's "built originally to use in SQL 2000" xp_GUID here: http://sqldev.net/xp/xpguid.htm

But, a GUID that is not sequential - like one that has it's values generated in the client (using .NET) OR generated by the newid() function (in SQL Server) can be a horribly bad choice - primarily because of the fragmentation that it creates in the base table but also because of its size. It's unnecessarily wide (it's 4 times wider than an int-based identity - which can give you 2 billion (really, 4 billion) unique rows). And, if you need more than 2 billion you can always go with a bigint (8-byte int) and get 263-1 rows. And, if you don't really think that 12 bytes wider (or 8 bytes wider) is a big deal - estimate how much this costs on a bigger table and one with a few indexes...

  • Base Table with 1,000,000 rows (3.8MB vs. 15.26MB)
  • 6 nonclustered indexes (22.89MB vs. 91.55MB)

So, we're looking at 25MB vs 106MB - and, just to be clear, this is JUST for 1 million rows and this is really JUST overhead. If you create an even wider clustering key (something horrible like LastName, FirstName, MiddlieInitial - which let's say is 64bytes then you're looking at 427.25MB *just* in overhead..... And, then think about how bad that gets with 10 million rows and 6 nonclustered indexes - yes, you'd be wasting over 4GB with a key like that.

And, fragmentation costs you even more in wasted space and time because of splitting. Paul's covered A LOT about fragmentation on his blog so I'll skip that discussion for now BUT if your clustering key is prone to fragmentation then you NEED a solid maintenance plan - and this has it's own costs (and potential for downtime).

So............... choosing a GOOD clustering key EARLY is very important!

Otherwise, the problems can start piling up!

kt

Theme design by Nukeation based on Jelle Druyts