SSMU Web Seminar: Performance Tuning Series
Part II: Optimizing Query Performance

Kimberly L. Tripp

 

Q & A

 

Q: Is there a reason that the query optimizer doesn't manipulate queries to isolate the column name for you?

AND

Q: So, just to confirm, regarding SARGs like LEFT(MyField,5) will that always scan?  Is there no way to exploit indices?

Both of these questions are similar enough that I will take one answer for both… This is a historical issue and EVERY release of SQL Server and even a few competing products behave this way. The primary reason is that NOT all queries can be easily re-written/converted and always guaranteed to produce the same results. Yes, it’s fairly obvious for certain types of operations – especially mathematical – but the behavior of the server was to go the “safe” route at the expense of performance. At least that’s how I’ve always understood it. SQL Server has always done this…

 

If you have ANY computation on the column then SQL Server will absolutely force a scan. I took the specific case listed about LEFT(LastName, 5) = ‘TRIPP’  and compared that to column like ‘Tripp%’ and the performance difference was staggering!

 

Here is the result:

SELECT * FROM Member

WHERE LEFT(LastName,5) = 'Tripp'

 

-- Causes a TABLE SCAN

-- Performs 143 I/Os (small table, luckily J)

-- Showplan estimate is 92.71% overall cost between these

two queries

 

SELECT * FROM Member

WHERE LastName LIKE 'Tripp%'

 

-- Uses an Index

-- Performs 9 I/Os

-- Showplan estimate is 7.29% overall cost between these

two queries

 

Simple rule – try re-writing SARGs and get users to understand some of these benefits!

 

Q: Does it matter if you isolate column on left or right?

No.

 

Q: If "*" is slower, would selecting a constant (like "1") work even better? To clarify, in an EXISTS subquery, would selecting a constant value (like "1") work better than selecting "*"?

I’ve had this question before! So – I’ll post my previous questions/answers. These are from the Indexing Session’s Q&A.

 

What about using "*" in this case: IF EXISTS(SELECT * FROM Users WHERE UserID = X ?

Another very common question and also relates to some “folklore” that seems to be going around… An existence test does NOT return columns. Because of this the * (or constant for that matter) is irrelevant. In the ANSI standard the definition of an existence test is that it is introduced with * - this is preferred and does NOT cause any performance problems. In fact, it is preferred syntax. No benefit in using a constant.

 

What about using * in COUNT(*)?

Ah – this used to be a big one… People would often put in a column and SQL Server would be forced to read that data – even when the column did not allow NULL values. However, in SQL Server 2000 they are aware that * and any (column) where the column does not allow NULL values is a row count. If you need a row count then they know where they can find it – in a leaf scan of the smallest non-clustered index. So – yes, * is just fine here and in fact, it is preferred.

 

OK – you guys made some good points… * is not ALWAYS evil. In these cases it is just fine. I would also add that it is appropriate when you are selecting from something which already limits the column list (i.e. a view, stored procedure or table valued functions). I’m sure we could probably come up with a few more…just remember that less is more when it comes to optimization!

 

Q: So in general you weigh cost (execution plan-wise) more than I/O?

This is a great question! Yes, I usually do – as SQL Server does as well… SQL Server uses a “cost-based” optimizer. The primary goal is for the “overall” cost of a query’s plan to be as low as possible. This means they have to account for I/O, memory, etc. What may appear to have less I/Os may be more costly in terms of temporary resources, etc. The showplan tells you the “overall” estimated cost – this is a better view of the whole picture but more than anything you need to look at all numbers – when I/Os are less than a few thousand apart it’s probably not that significant as a few thousand logical I/Os can be performed in milliseconds.

 

Q: What columns to add when looking for scans in Profiler?  Can I filter for specific tables?

Well – you’d think you could just add “ObjectName” as a data column and all would be well… but it’s not. Some “events” don’t produce values for every data column. Within the books online there are topics for Profiler event types and within each one they list the data columns for which there are values provided. Here are a couple of BOL links: mk:@MSITStore:C:\Program%20Files\Microsoft%20SQL%20Server\80\Tools\Books\adminsql.chm::/ad_mon_perf_8h2x.htm. Specifically the section is titled: “Scans Event Category”

 

To see a list of all of the categories with links to their data column lists, etc. check out this link mk:@MSITStore:C:\Program%20Files\Microsoft%20SQL%20Server\80 titled “Monitoring with SQL Profiler Event Categories.”

 

Here are the KEY things I would do to get better information about SCANs:

Ø      Make sure you use FILTERS – one that you REALLY need if you use Profiler to look at scans is “Exclude system IDs” (the checkbox on the filters tab). This will SIGNIFICANTLY reduce the excessive information produced.

Ø      As for data columns – you must add ObjectID NOT ObjectName. Unfortunately using filters here is a bit harder BUT you can use either a system table OR a system function to determine your table’s IDs. The system table is sysobjects and the function is called object_id. Here are two sample queries:

SELECT Name, ID FROM sysobjects WHERE type = ‘U’

-- or to get just the ID for a specific table

SELECT object_id(‘member’)

Ø      Finally, make sure you set a filter for the specific database you want to profile.

 

Q: We have 5000 to 8000 I/O sec at peak.  Unsure how much of that is INSERTing? These inserts are in a heap.

Not sure where you’re going with this one other than you’re trying to decide if you would be better off clustering instead of using a heap. In this case I’d need more info to really decide – and it’s mainly because you might be exceeding the 400 inserts/sec rule here. BUT – typically the performance of inserts can be better with a clustered index over a heap (with the exception of bulk loading into an empty heap – it is better to load into an empty table and create the indexes after the load rather than load into an already indexed table). So – in your specific case I would try to setup a reproducible test environment. I would:

  1. Backup the database and load thousands (via simulated application clients and/or osql executions) of rows into the heap
  2. Restore the database, make the table clustered – with an identity (just to start) and then try loading the same thousands of rows – if the performance is worse then you might try one other option...
  3. Again re-restore the first version of the database, make the table clustered – with a combination of columns that create multiple hotspots (possibly use a foreign key column and the primary key column – use a foreign key column that has 10-20 distinct values – ie. lots of duplicates). I would aim for something that creates 10-20 hot spots. Finally, try loading the same thousands of rows – if again, the performance is worse then you might be a rare case where the heap structure is better than a clustered table

 

But I’ll stress that looking at inserts is only ONE piece of the puzzle – this table has other issues. For example, scans are likely to be VERY expensive if there are updates and therefore forwarding pointers. You might want to find out how many pages the table has and then perform a query which does a scan – to see how many forwarding pointers you have.

To determine pages:

     exec sp_spaceused tname, true

Take the data column and divide the number by 8 to determine pages

 

Execute:

SET STATISTICS IO ON

SELECT count(*) FROM tname WITH (index (0))

 

If difference between the number of pages scanned in the select is higher than the number of pages calculated by the sp_spacesused pages values divided by 8 then you have forwarding pointers – and possibly lots of them. Each one adds unnecessary overhead in a scan. There are many pro’s and con’s here – again, this is just a couple but a clustered table wouldn’t have these.

 

Finally, if you want to determine the estimated number of I/Os needed to do each insert – use statistics I/O in query analyzer and run the same process to insert data while capturing I/Os.

 

Q: When I see Clustered Index Scan, I have NO IDEA how to coerce it instead to Seek?

The easiest way to start is to highlight the query and run it through index tuning wizard. This will work very well on most queries. When ITW doesn’t give any advice and you still want to try and move forward – manually – implementing indexes then you’ll need to evaluate quite a bit about a query to get the “right” index(es) and the right balance of indexing pros to maintenance and overhead cons. There’s a lot of info needed to fully answer this but it’s a good recommendation to consider the online webseminar series on Indexing: Internals, Performance and Maintenance – as this is what I talked about for 4 sessions! Here are the links:

Indexing for Performance, Part I: Index Overview and Internals http://www.sqlmag.com/SSMU/Seminars/Index.cfm?Event_ID=18

 

Indexing for Performance, Part II: SARGs and Joins http://www.sqlmag.com/SSMU/Seminars/Index.cfm?Event_ID=19

 

Indexing for Performance, Part III: Aggregates and Indexes Views http://www.sqlmag.com/SSMU/Seminars/Index.cfm?Event_ID=20

 

Indexing for Performance, Part IV: Index Maintenance http://www.sqlmag.com/SSMU/Seminars/Index.cfm?Event_ID=21

 

Q: I'm getting a lot of institutional resistance to views.  Can views negatively impact performance?

This is becoming a frequently asked question! I’m hearing a lot of comments like this these days and I have a feeling it has to do with rumors that are FLYING all over the place that say that SQL Server creates temp tables for views when they are used within other queries…people also say that views add unnecessary overhead. Ugh! There are cases where when views are used improperly they might cause unnecessary work to be performed but generally a view does not add any additional costs v. the same query executing. Views are not usually the problem – it’s the usage of the views that can be problematic. In fact, I was hearing this so much that I ended up writing a FREE TSQLSolutions article specifically detailing this problem. All you need to do is register to read these… I think this will help convince you that views are not inherently evil. J It’s part II that speaks more to performance issues.

 

Part1 - The View from Here, December 2002 InstantDoc #27229 

http://www.sqlmag.com/Articles/Index.cfm?ArticleID=27229&

 

Part2 - Viewer Advisory, February 2003 InstantDoc #37660 

http://www.sqlmag.com/Articles/Index.cfm?ArticleID=37660

And one other thing I’ll add to this is that OFTEN views can IMPROVE performance by focusing the users on JUST they data they need, providing a well written join to handle complex schemas (with the users struggling and/or executing poorly written code against the server) and simplifying permissions. I’m a fan of views – as long as they’re used properly!

 

Q: Where is the best place to learn about the individual elements of the showplan, for instance Hash?

          AND

Q: Where can we learn EXACTLY what all the various icons mean in the Graphical ShowPlan?  They're explained, sort of, in BOL--it's helpful, but the info is shallow.  I want MORE.  Indeed, I want it ALL!

Another set of questions I can lump into one answer BUT the bad news is that I’m not aware of any single reference that covers all of these. There are quite a few KB Articles that do a great job of focusing you on specific elements and explain them really well and I’d certainly start with those and scan through them (start with KB Q224587 – Troubleshooting Application Performance with SQL Server).

 

Also, books like Kalen Delaney’s Inside SQL Server 2000 also cover quite a few of them along with lots of internals and examples of what’s going on BUT there’s no single reference. Would be a good idea for a book…. Hmmm. Also, if more general “algorithms” are desired you could search sites like the acm.org for information about Hashing/Hash algorithms – these are fairly standard. Kalen’s book does have good information about the different types of hashing that occur within SQL Server.

 

Q: If I have a table with a few indices, but SysIndexes reveals that there are, say, a half-dozen colums with AutoStatically generated indices--does that mean I could benefit from more indices?

Another set of questions I can lump into one answer BUT the bad news is that I’m not aware of any single reference that covers all of these. There are quite a few KB Articles that do a great job of

 

Q: How can we get last week's resource list?

From last week’s Q&A and/or last week’s PowerPoint slides. You should have received information via email.

 

Resources:

Whitepaper: Statistics Used by the Query Optimizer in Microsoft SQL Server 2000  http://msdn.microsoft.com/library/en-us/dnsql2k/html/statquery.asp?frame=true

 

Search SQL Server Magazine’s website http://www.sqlmag.com

 

Wow – lots of questions this week! Thanks for the great questions everyone!!

 

See you in Part III next week,

Kimberly

 

Kimberly L. Tripp

President, SYSolutions, Inc. www.SQLSkills.com

Principal Mentor, Solid Quality Learning www.SolidQualityLearning.com