I’ve been doing a lot of performance tuning work over the last couple of months and this weekend found something that’s very pervasive out in the wild. Kimberly was helping me optimize a gnarly query plan and spotted something in the code I hadn’t noticed that was causing an index scan instead of an index seek. Even though the query plan was using a covering nonclustered index, the way the code was written was causing the index to be scanned.
I’ll set up a test to show you what I mean.
First, create an example sales tracking table and populate it (took 3m45s on my laptop):
-- Create example table CREATE TABLE [BigTableLotsOfColumns] ( [SalesID] BIGINT IDENTITY, [SalesDate] DATETIME, [Descr] VARCHAR (100), [CustomerID] INT, [ProductID] INT, [ModifyDate] DATETIME DEFAULT NULL, [PayDate] DATETIME DEFAULT NULL, [Quantity] INT, [Price] DECIMAL (6,2), [Discount] INT DEFAULT NULL, [WarehouseID] INT, [PickerID] INT, [ShipperID] INT); GO -- Populate table SET NOCOUNT ON; DECLARE @Date DATETIME; DECLARE @Loop INT; SET @Loop = 1; SET @Date = '01/01/2010'; -- Insert 1500 sales per day WHILE @Loop < 547600 BEGIN INSERT INTO [BigTableLotsOfColumns] ( [SalesDate], [Descr], [CustomerID], [ProductID], [Quantity], [Price], [WarehouseID], [PickerID], [ShipperID]) VALUES ( @Date, 'A nice order from someone', CONVERT (INT, (RAND () * 1000)), CONVERT (INT, (RAND () * 1000)), CONVERT (INT, (RAND () * 10)), ROUND (RAND () * 1000, 2), 1, CONVERT (INT, (RAND () * 4)), CONVERT (INT, (RAND () * 4))); IF @Loop % 1500 = 0 SET @Date = @Date + 1; SET @Loop = @Loop + 1; END GO -- And a clustered index CREATE CLUSTERED INDEX [IX_BigTableLotsOfColumns_Clustered] ON [BigTableLotsOfColumns] ([SalesID]); GO
Now I’ve got a simple stored procedure to find the total sales amount for any particular date.
CREATE PROCEDURE [TotalSalesForDate] (@TargetDate DATETIME) AS SELECT SUM ([Quantity] * [Price]) FROM [BigTableLotsOfColumns] WHERE DATEDIFF (DAY, [SalesDate], @TargetDate) = 0 OPTION (MAXDOP 1); GO
The idea is that the stored procedure will pick up all sales for a particular date, no matter what time of that day the sale occurred. I’ve got OPTION (MAXDOP 1) to simulate a system that’s been tuned for OLTP queries (and to make the query plans fit better in my blog post :-)
If I run the stored procedure for, say, December 26th 2010, the query plan it uses is as follows:
EXEC TotalSalesForDate '12/26/2010'; GO
And it uses the following CPU and I/O (you can get these by using SET STATISTICS IO ON and SET STATISTICS TIME ON, but be careful – the can generate lots of output for big procs):
Table 'BigTableLotsOfColumns'. Scan count 1, logical reads 7229, physical reads 0, ... SQL Server Execution Times: CPU time = 125 ms, elapsed time = 186 ms.
I ran the proc twice; once to get it to compile, and the second time to factor out the compile time.
Pretty obvious that I can speed this up by creating a covering nonclustered index, and in fact you can see the query processor telling me that with the missing index suggestion. I create the covering nonclustered index:
CREATE NONCLUSTERED INDEX [IX_BigTableLotsOfColumns_Date] ON [BigTableLotsOfColumns] ([SalesDate]) INCLUDE ([Quantity], [Price]);
Now when I run the proc, I get a better query plan:
And the following CPU and IO:
Table 'BigTableLotsOfColumns'. Scan count 1, logical reads 2113, physical reads 0, ... SQL Server Execution Times: CPU time = 78 ms, elapsed time = 115 ms.
Much better, as I’d expect. But look at how the nonclustered index is being used: it’s being scanned when instead there should be a seek. It’s still doing more than 2,000 logical reads because it’s scanning the whole index.
The reason for this is that the DATEDIFF in the proc is forcing every SalesDate column value to be evaluated, hence the scan. If I know that the proc only ever is called with a date without a time portion, I can do some rearranging of the logic.
The updated proc is below. Notice that there’s no reason to scan any more as the WHERE clause doesn’t require every column value to passed through the DATEDIFF:
CREATE PROCEDURE [TotalSalesForDate] (@TargetDate DATETIME) AS SELECT SUM ([Quantity] * [Price]) FROM [BigTableLotsOfColumns] WHERE [SalesDate] >= @TargetDate AND [SalesDate] < DATEADD (DAY, 1, @TargetDate) OPTION (MAXDOP 1); GO
And when I run it I get the best plan:
And the following CPU and I/O:
Table 'BigTableLotsOfColumns'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, ... SQL Server Execution Times: CPU time = 0 ms, elapsed time = 64 ms.
Wow! 200x fewer logical I/Os and not even enough CPU time to register as 1ms.
Summary: although you may have nonclustered indexes that are being used, make sure that they’re being used for seeks as you expect. For small queries that are executed hundreds of times a second, the difference between seeking and scanning can be huge in overall performance.