SSMU Web Seminar: Indexing for Performance
Part II: Indexing for SARGs and Joins

Kimberly L. Tripp

 

Q & A

 

Q: If you need more columns out of the table than any of your indexes cover, so you have to do a bookmark lookup, is it still so bad to use SELECT * ?

This is an excellent question! In fact, only the day before this particular webseminar someone said something similar and even said “don’t I get these columns for free” if I have to do a table scan or access a view… I responded – nothing’s free; there’s no free lunch. Whenever you ask for something within the select list (or anywhere for that matter) you are telling SQL Server that it is REQUIRED. Currently, there is no designation for “return these if they’re not expensive” and in fact, regardless, you are potentially wasting bandwidth so it’s always better to limit your column list as much as possible. In fact, there are some optimizer choices that are somewhat surprising, like the fact that they can use multiple indexes to cover a query – which is less possible when the query is unnecessarily wide.

 

What about using "*" in this case: <If Exists(Select * from Users where UserID = X> ?

Another very common question and also relates to some “folklore” that seems to be going around… An existence test does NOT return columns. Because of this the * (or constant for that matter) is irrelevant. In the ANSI standard the definition of an existence test is that it is introduced with * - this is preferred and does NOT cause any performance problems. In fact, it is preferred syntax.

 

What about using * in COUNT(*)?

Ah – this used to be a big one… People would often put in a column and SQL Server would be forced to read that data – even when the column did not allow NULL values. However, in SQL Server 2000 they are aware that * and any (column) where the column does not allow NULL values is a row count. If you need a row count then they know where they can find it – in a leaf scan of the smallest non-clustered index. So – yes, * is just fine here and in fact, it is preferred.

 

OK – you guys made some good points… * is not ALWAYS evil. In these cases it is just fine. I would also add that it is appropriate when you are selecting from something which already limits the column list (i.e. a view, stored procedure or table valued functions). I’m sure we could probably come up with a few more…just remember that less is more when it comes to optimization!

 

What if you had individual nonclustered indexes rather than composite indexes?

This is a great question! I’ve been to a few sites where their indexing strategy is to create single column indexes on every column – YIKES! Without getting into all of the gory details here I want to stress that wider indexes tend to handle more and different types of requests. BUT – I would not work so hard to cover queries if all of your queries are highly-selective point queries. In another way, I recommend that you create narrow indexes for highly selective SARGs and slightly wider covering indexes for the low selectivity queries. If you end up with one index on ColumnA for point queries and another index on ColumnA, ColumnB and ColumnC for a low selectivity range query then I would DEFINITELY drop the single column index on ColumnA alone. Don’t get me wrong – there are queries which could benefit from the narrower single column index on just ColumnA BUT if the queries are all point queries then the width of the index will not likely impact performance at all. It is definitely not worth it having both indexes. So, look at it another way – you won’t be adding a lot of new indexes, instead you will be slightly widening a few of the existing indexes.

 

Let me see if I can summarize a bit here.

Let’s look at a few of the ways each of the following indexes can be used:

An index on one column (remember that the index will always include the clustering key if the table is clustered – which is interesting in that SQL Server just inherently covers more queries) but this is still a less useful index. An index on title and includes title_id (because that’s the clustering key) is useful for these types of queries:

            SELECT title and/or title_id FROM titles

            SELECT title and/or title_id FROM titles WHERE title like ‘Gone with%’

            SELECT title and/or title_id FROM titles WHERE title_id > 100 – SQL Server will use this index when the query is not selective enough to benefit from using the clustered index – because the table is that much wider than this index…

            This index is also beneficial when ONLY these columns are requested in a join, or when a count is requested (although there might be a smaller non-clustered index to cover a count(*)).

 

An index on multiple columns – may make the index only a shade wider (for example if the column added is reasonably narrow (remember that the row is the widest structure) and yet cover a much larger number of queries! In fact, the index is useful when you submit ANY combination of these columns – even when you are not joining or performing a SARG on the high-order element (the first column) in the index.

 

Can you discuss the downsides of having to many indexes?

This question goes right along with the previous couple… I’m not recommending that you index every column or even cover every query. Instead I recommend that you minimize having a lot of narrow indexes (again, unless you have a lot of point queries). Too many indexes can cause blocking problems on volatile columns and too much overhead as SQL Server always maintains index data on every INSERT/DELETE and every UPDATE which directly affects a column which is indexed.

 

Say you have a table with columns: id (int), startdate and enddate.  How would you define the clustered index if SARG was select * from table where '1/1/2003' between startdate and enddate?  You also might query select * from table where id = 7.  The id column is unique.

This is a tough one! SQL Server MUST do a comparison between two columns and in fact, you’re better off saying

            WHERE startdate <= value and value <= enddate

And if this query returns a LARGE set then the indexes which would be most beneficial would be

            (startdate, enddate, id)

or         (enddate, startdate, id)

as to which one SQL Server would choose? It would depend on how selective that value really is… If there are more row that start before that date then end after then the second index will be more useful then the first. If there are more rows that end after that date then start before then the first index will be best. BUT – there will be a scan of the entire set before OR after (depending on which index is chosen) because they’ll need to check the enddate or the startdate, respectively.

 

So – as for the real question… How would I define the clustering key I don’t really have enough information yet… J More than anything I’d need to know the general activity on this table as well as its width (meaning are these the ONLY columns in the table or are there others). If the table is quite a bit wider then I would strongly consider covering; however, that might change if you also tell me that the table is very volatile. If the data is very volatile then I would tell you to do the low-selectivity queries within an analysis database and leave the data gathering for the OLTP server and I would certainly cluster (again, especially if the table is very wide and volatile) on the ID. If the performance is not significantly affected by these indexes or the data is not volatile then I might consider clustering on startdate, enddate, and id – this might be OK because it’s likely to be increasing and may not cause a lot of fragmentation… but I’d also need to know a lot more about the table. I hope this gives you a few ideas?

 

Would you use the combination of multiple indexes to cover a query which has too many columns to be included in a single index?

This is probably the toughest to try and answer… Covering a query with two indexes is not necessarily something I strive for purposely. Often I use this as a benefit of covering a couple of the higher priority queries… Having said that – if the query performs extremely poorly and it is a predominantly read environment then an indexed view might be appropriate…good lead in for Part III. J

 

I sometimes use a construct in an SP such as WHERE (Col1 = @Param1 OR @Param1 IS NULL) to provide for optional parameters to stored procs.  What are the performance ramifications of this?

Without seeing more of the code I’d have to suggest that you avoid this like the plague! J OK – I’d need to see more code BUT after a quick test I think I have a good code sample which shows why it’s better to have multiple procedures and limit the flexibility a bit OR use dynamic string execution…which unfortunately will start a whole separate thread as far as “sql injection” issues/attacks and security (the user executing the procedure will need permission to directly execute the statement). Before I show the code I want to stress that this procedure could not be hacked because of the addition of QUOTENAME (for character params) which protect the value coming in.

OK – from this test it is VERY obvious that you ONLY want DSE of these two procedures but I would still argue that more than one procedure is likely to produce better results (and not have the same security issues) BUT one procedure with all purpose parameters is definitely NOT a good choice! See the ProcedureParameters.sql procedure to test this within the credit database. You will need to modify some data to get rows to return to these queries.

 

CREATE PROC GetMemberInfoParam

            @Lastname     varchar(30) = NULL,

            @Firstname    varchar(30) = NULL,

            @member_no int = NULL

AS

SELECT * FROM member

WHERE (lastname LIKE @lastname OR @lastname IS NULL)

            AND (member_no = @member_no OR @member_no IS NULL)

            AND (firstname LIKE @firstname OR @firstname IS NULL)

go

exec GetMemberInfoParam   @Lastname = 'Tripp' with recompile

exec GetMemberInfoParam   @Firstname = 'Katie' with recompile

exec GetMemberInfoParam   @Member_no = 9912 with recompile

go

CREATE PROC GetMemberInfoParam2

            @Lastname     varchar(30) = NULL,

            @Firstname    varchar(30) = NULL,

            @member_no int = NULL

AS

IF @LastName IS NULL AND @FirstName IS NULL AND @Member_no IS NULL

            RAISERROR ('You must supply at least one parameter.', 16, -1)

 

DECLARE @ExecStr varchar(1000),

            @MemberNoStr          varchar(100)

SELECT @ExecStr = 'SELECT * FROM member WHERE '

 

IF @LastName IS NOT NULL

            SELECT @Lastname = 'lastname LIKE ' + QUOTENAME(@lastname, '''')

IF @FirstName IS NOT NULL

            SELECT @Firstname = 'firstname LIKE ' + QUOTENAME(@firstname, '''')

IF @Member_no IS NOT NULL

            SELECT @MemberNoStr = 'member_no = ' + convert(varchar(5), @member_no)

 

SELECT @ExecStr = @ExecStr + ISNULL(@LastName, ' ')

            +

            CASE

                        WHEN @LastName IS NOT NULL AND @FirstName IS NOT NULL

                                    THEN ' AND '

                        ELSE ' '

            END

            +

            ISNULL(@FirstName, ' ')

            +

            CASE

                        WHEN (@LastName IS NOT NULL OR @FirstName IS NOT NULL)

                                                            AND @MemberNoStr IS NOT NULL

                                    THEN ' AND '

                        ELSE ' '

            END

            +

            ISNULL(@MemberNoStr, ' ')

 

EXEC(@ExecStr)

go

exec GetMemberInfoParam2 @Lastname = 'test', @FirstName = 'Katie'

exec GetMemberInfoParam2 @Firstname = 'Katie'

exec GetMemberInfoParam2 @Firstname = 'Katie', @Member_no = 842

exec GetMemberInfoParam2 @Member_no = 9912

exec GetMemberInfoParam2 @Lastname = 'Florini', @Member_no = 9912

go

The first procedure generates a horrible plan (I knew it would be bad but I’m somewhat stumped by its choice) but the second procedure generates a good plan for each and every execution (as expected). Also, I spent a few minutes trying to come up with a way you could copy this code as well… Let me know. I think this is a good example of what you can do!

 

In many cases, I have found in working with large tables it is easier to use temp tables to get the smaller result set before the joins.  Comments?

And you’re not alone…this was an extremely common practice in 6.5 – especially due to the 16-columns-per-join limit; however this might do more harm than good in SQL 7.0 or 2000 (NOT ALWAYS). So, since you’re already doing this – do NOT re-write your code…yet. Something I would try – which will help you make the final decision as to whether or not you need to re-write your code – is to create views using the same definition which goes into creating the temp tables and then join the views… If that yields better performance than stick with it! It’s possible that SQL Server might beat your join (and the cost of creating all of those temp tables). If it’s good news – let me know… J Ha – let me know either way!

 

What would be the index strategy for very large (millions of records), thin (2-3 columns) lookup tables where all/most columns are used in association with the EXISTS clause?

An existence test should be treated just like a join (because that’s in fact what it is). What you should look at are the columns on which you are testing existence. Depending on the clustering key and the selectivity of the join you might consider an index on the column(s) defined in the EXISTS clause. Again, since the tables are narrow this is a tough decision. Since the table is narrow then you might be better off clustering for the join – it would be great if the join column is narrow, unique and an identity but if not it might not be a problem. Since an existence test only tests for the existence of a row then the clustered index could be on this column (if there are two columns put the more selective first) but this table might become significantly fragmented. You should use DBCC INDEXDEFRAG to clean up the fragmentation (while keeping the table accessible…good lead-in for Part IV on Index Maintenance).

 

Can you compare/contrast the indices that BOL says that SQL Server creates for constraints, including unique constraints?

There’s quite a bit of confusion around indexes created with constraints (Primary Key and Unique Key) v. indexes created with the CREATE INDEX statement but in fact they’re really similar. The only difference is that a constraint is a logical concept – enforced through an index. An index is a physical concept that can enforce uniqueness or aid in lookups, etc. In terms of performance – a unique constraint on the SocialSecurityNumber column will behave exactly the same as the CREATE UNIQUE INDEX statement; however, I recommend you create unique columns with constraints as SQL Server allows foreign keys to reference them – that’s the only difference!

 

Resources:

Whitepaper: Index Tuning Wizard for Microsoft SQL Server 2000 http://msdn.microsoft.com/library/en-us/dnsql2k/html/itwforsql.asp?frame=true

 

Here’s the general link for all technical resources on msdn: http://msdn.microsoft.com/library/default.asp?url=/nhp/Default.asp?contentid=28000409

 

Whitepaper: Index Defragmentation Best Practices at http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/maintain/Optimize/SS2KIDBP.asp

 

Here’s the general link for all technical resources on TechNet: http://www.microsoft.com/technet/prodtechnol/sql/default.asp?frame=true

 

Webcast: SQL Server 2000 Profiler: What’s new and how to effectively use Profiler!

http://support.microsoft.com/default.aspx?scid=/servicedesks/webcasts/wc111400/wcblurb111400.asp

 

Here’s the general link for all webcasts from Microsoft:

http://support.microsoft.com/default.aspx?PR=pwebcst&FR=0&SD=MSDN&LN=EN-US&CT=SD&SE=NONA

 

Whitepaper: Statistics Used by the Query Optimizer in Microsoft SQL Server 2000  http://msdn.microsoft.com/library/en-us/dnsql2k/html/statquery.asp?frame=true

 

Search SQL Server Magazine’s website http://www.sqlmag.com

 

Thanks for the great questions everyone!

 

See you in Part III,

Kimberly

 

Kimberly L. Tripp

President, SYSolutions, Inc. www.SQLSkills.com

Principal Mentor, Solid Quality Learning www.SolidQualityLearning.com