People always ask me how I keep up to date with SQL Server and how I could possibly know so many intricate details about SQL Server… Well, I test/play/figure it out until it’s obvious (painfully sometimes and sometimes it’s not obvious). So, this blog entry is based on a specific question that made me want to see if I could come up with a test and positive proof in patterns and performance – when working with large tables and range deletes. I generally recommend a clustered index be created on a unique, narrow and static column – and best if it’s an identity column – yet there is an exception to this when the table has VERY specific administrative/operational/archival patterns a.k.a. the “sliding window” scenario. The question starts by asking about my point that “deletes aren’t as big of a deal” in fragmentation. While my point was solely referring to the internal v. external fragmentation issue that led us to take some more time evaluating delete performance. In talking about delete performance, we arrive at some best practices in index creation as well as horizontal partitioning! (fyi – the scripts used to create/build this scenario are fairly cool in and of themselves – not a bad idea to spend some time reviewing that!)


 


Q: You mention that deletes are not as “big of a deal” since they leave gaps. I had a situation where I had a table with around 50 million rows, about a million rows per day. After about a month and a half, I delete some old ones. I can delete a day’s worth of data (1 million rows) in about 2 minutes with a delete statement, if there are no indexes. If I have a clustered index based on the date, it took about 22 minutes. Inserts were instantaneous, about two pages worth of rows every second, and even updates were quicker. Is there anything obvious that would cause this?


OK, so I need to clarify first. My comment that deletes aren’t as “big of an issue” is specifically regarding fragmentation (not really performance of the actual delete)… A delete (when it’s a range delete – as opposed to a lot of singleton deletes) typically de-allocates the entire page freeing it back to the table (remember, it’s still “reserved” by this table) or if the entire extent is freed then it goes back to the DB. However, remember, this is only when the range is contiguous or grouped, per se. So, looking solely at the result and impact to the base table – we would say that range deletes aren’t that big of a deal because they free entire pages and even entire extents – effectively, the table doesn’t have gaps. OK, so having said that, you make a good point: what about performance?


 


First, I’m somewhat intrigued by the fact that you say a delete was faster with no indexes rather than with a clustered especially if the clustered was date BUT I have a feeling that there might have been nonclustered indexes as well? (In the upcoming test matrix, you may see some patterns in timing that corroborate this idea).  Without knowing whether you had non-clustered indexes or not, let me at least speculate and test (and hopefully prove) a few points? First, I would ask what the exact definition of the CL index was – was it just date? Remember, I really feel as though there are three types of tables – tables with the wrong clustered index, tables with NO clustered index (i.e. HEAP) and tables with the right clustered index. However, the “right” clustered index CAN vary based on pattern usage of the table. In a table that has range deletes based on dates – as well as an identity column – I often create a clustered index on the composite combination of these two columns. If the CL Key is on the date itself then this should not be that much slower (that the heap – in fact, I would have expected it to be faster even though there will be some overhead in maintaining the index). I really think there’s something else there – probably nonclustered indexes, maybe a blocking scenario? So, to help – I think I can prove a few points (in terms of perf) with a script/test to see if I can give you some rough numbers (I’m really, really intrigued to see if I can’t prove my point as well J). So, I decided to try an interesting combination – and really test the numbers. My test (albeit not entirely scientific) should show a pattern! First, I’m going to take the 1.6 million row charge table (in my “pumped up version of the credit database) and I’m going to copy it 10 times into slightly different table structures. Details and the creation for each table follow:


 


ChargeHeap – A table with NO Clustered index and NO non-clustered indexes.


SELECT *


INTO ChargeHeap


FROM Credit.dbo.Charge


 


ChargeCLPKForDelete – Is really NOT that bad of a choice… I typically recommend this choice for everyday tables where there are lots of inserts/updates/deletes. However, with the specific “sliding window” scenario (where data comes in at the end AND is periodically archived off the other end) then I might make a different choice. But – still a good table to create and compare against. This will certainly improve inserts and minimize fragmentation BUT it might not be ideal for range deletes and archiving. This table has a clustered index on the Primary Key; this column is an ever-increasing identity value on charge_no.


 


SELECT *


INTO ChargeCLPKForDelete


FROM Credit.dbo.Charge


go


 


ALTER TABLE ChargeTesting.dbo.ChargeCLPKForDelete


ADD CONSTRAINT ChargeCLPKForDeletePK


          PRIMARY KEY CLUSTERED (Charge_NO)


go


 


Again, because this table will follow a “sliding window” scenario then I would change the clustering key to include the date and then the ID. I don’t typically chose date (as a single column primary key) because dates alone are rarely, if ever, unique (for more details on datetime data check out the articles I’ve written on Datetime data for SQL Magazine here). Because you already have an identity column in the table (which uniquely identfies every row), there’s no need to add a “uniqifier.”  The process of “uniqification” happens internally if the clustereing key is defined on a non-unique column (see last month’s webcast if you’re not familiar with this “uniqification” process). In the case where you already have a row identifier, it’s best to use it in the clustering key with date – especially as it has useful data in it. So – this leads us to the next TWO tables…


 


ChargeCLDateForDelete – Is a clustered index on the date ALONE.


 


SELECT *


INTO ChargeCLDateForDelete


FROM Credit.dbo.Charge


go


 


CREATE CLUSTERED INDEX ChargeCLDateForDelete


ON ChargeTesting.dbo.ChargeCLDateForDelete (Charge_DT)


go


 


ChargeCLDateForDeleteWithCompPK – Is a composite clustered primary key index on the date, charge_no together.


 


SELECT *


INTO ChargeCLDateForDeleteWithCompPK


FROM Credit.dbo.Charge


go


 


ALTER TABLE ChargeTesting.dbo.ChargeCLDateForDeleteWithCompPK


ADD CONSTRAINT ChargeCLDateForDeleteWithCompPK_PK


          PRIMARY KEY CLUSTERED (Charge_DT, Charge_NO)


go


 


The last base table will have a poor choice for the clustering key (for these deletes) as they will not have “ranges” with which to work.


 


ChargeCLReallyBadForDelete – This clustered index will be on something NOT used at all in this delete. The deletes will result in lots of internal fragmentation. In other words, the table will be larger and “internally” fragmented. The table will essentially look like “swiss cheese” meaning that the table will have lot of little gaps within the pages themselves.


 


SELECT *


INTO ChargeCLReallyBadForDelete


FROM Credit.dbo.Charge


go


 


CREATE CLUSTERED INDEX ChargeCLReallyBadForDelete


ON ChargeTesting.dbo.ChargeCLReallyBadForDelete (Charge_amt)


go


 


And what about nonclustered indexes? Finally, we will add five final tables – each of the above – with two non-clustered indexes added to see what their impact has on overall performance.


 


ChargeHeapWNCIndexes – Is a heap table with two non-clustered indexes.


 


SELECT *


INTO ChargeHeapWNCIndexes


FROM Credit.dbo.Charge


go


 


CREATE NONCLUSTERED INDEX ChargeHeapWNCIndexesNC2


ON ChargeHeapWNCIndexes (Statement_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeHeapWNCIndexesNC3


ON ChargeHeapWNCIndexes (Member_no)


go


 


ChargeCLPKForDeleteWNCIndexes – Is clustered on the Primary Key and has two additional non-clustered indexes.


 


SELECT *


INTO ChargeCLBadForDeleteWNCIndexes


FROM Credit.dbo.Charge


go


 


ALTER TABLE ChargeTesting.dbo.ChargeCLBadForDeleteWNCIndexes


ADD CONSTRAINT ChargeCLBadForDeleteWNCIndexesPK


          PRIMARY KEY CLUSTERED (Charge_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeCLBadForDeleteWNCIndexesNC2


ON ChargeCLBadForDeleteWNCIndexes (Statement_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeCLBadForDeleteWNCIndexesNC3


ON ChargeCLBadForDeleteWNCIndexes (Member_no)


go


 


ChargeCLDateForDeleteWNCIndexes – Is clustered on the Date alone and has two additional non-clustered indexes.


 


SELECT *


INTO ChargeCLDateForDeleteWNCIndexes


FROM Credit.dbo.Charge


go


 


CREATE CLUSTERED INDEX ChargeCLDateForDeleteWNCIndexes


ON ChargeTesting.dbo.ChargeCLDateForDeleteWNCIndexes (Charge_DT)


go


 


CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWNCIndexesNC2


ON ChargeCLDateForDeleteWNCIndexes (Statement_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWNCIndexesNC3


ON ChargeCLDateForDeleteWNCIndexes (Member_no)


go


 


ChargeCLDateForDeleteWithCompPKWNCIndexes – Is the composite primary key on the date and identity. This table also has the same two non-clustered indexes.


 


SELECT *


INTO ChargeCLDateForDeleteWithCompPKWNCIndexes


FROM Credit.dbo.Charge


go


 


ALTER TABLE ChargeTesting.dbo.ChargeCLDateForDeleteWithCompPKWNCIndexes


ADD CONSTRAINT ChargeCLDateForDeleteWithCompPKWNCIndexes_PK


          PRIMARY KEY CLUSTERED (Charge_DT, Charge_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWithCompPKWNCIndexesNC2


ON ChargeCLDateForDeleteWithCompPKWNCIndexes (Statement_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWithCompPKWNCIndexesNC3


ON ChargeCLDateForDeleteWithCompPKWNCIndexes (Member_no)


go


 


ChargeCLReallyBadForDeleteWNCIndexes – Is the poor choice (for this delete) of a clustered index on charge_amt. Again, this table will end up with a lot of internal fragmentation after these deletes (not to mention that inserts will probably cause a lot of fragmentation as well). Over time, this table would probably be one of the worst performers – for almost all DML.


 


SELECT *


INTO dbo.ChargeCLReallyBadForDeleteWNCIndexes


FROM Credit.dbo.Charge


go


 


CREATE CLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesCL


ON ChargeTesting.dbo.ChargeCLReallyBadForDeleteWNCIndexes (Charge_amt)


go


 


CREATE NONCLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesNC2


ON dbo.ChargeCLReallyBadForDeleteWNCIndexes (Statement_NO)


go


 


CREATE NONCLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesNC3


ON dbo.ChargeCLReallyBadForDeleteWNCIndexes (Member_no)


go


 


Well, now that you know the players we can setup the test. The idea of the test is to take a large chunk of early data “off” the table. Presumably to archive but I will not address the whole archiving process at this time. Once archived the data must be removed (for many – just removed – as they can always rely on backups if they need the earlier data… just make sure you keep those backups!)


 


In the tests we will delete a varying percentage of the table based on a specific date for the charge rows. I have modified and increased the size of credit in many ways before getting to these scripts so it might be hard to reproduce this BUT there are some great samples to use from these.


 


First, I increased the number of rows in credit.dbo.charge by just “doubling” the rows:


 


INSERT charge


SELECT c.[member_no] ,


            c.[provider_no] ,


            c.[category_no] ,


            c.[charge_dt] ,


            c.[charge_amt] ,


            c.[statement_no] ,


            c.[charge_code]


FROM charge AS c    – to go to 200K rows


go


INSERT charge


SELECT c.[member_no] ,


            c.[provider_no] ,


            c.[category_no] ,


            c.[charge_dt] ,


            c.[charge_amt] ,


            c.[statement_no] ,


            c.[charge_code]


FROM charge AS c    – to go to 400K rows


go


 


 


However, once I get the rows there I’ve realized that I have a bunch of data that doesn’t really follow an increasing ID and increasing date –which isn’t realistic at all… And, of course, I realized this on a version of my database which already had the rows increased and which I always use for all sorts of samples/examples SO – I’d rather fix the current data rather than re-write the INSERT/SELECTs. But (and as it would always happen, I found other problems – like that I had a gap in the identity values). Anyway, I decided that the best way to get dates to be ever increasing along with the identity is to FIRST cleanup the identity. I’m going to do that first by dropping and re-adding the identity col. Re-creating the identity column to reseed it and then updating the charge_dt column based on a starting time and then adding the charge_no to it. OK, so that’s not what I did BUT that would work to give you a relatively increasing value – there would be duplicates though as SQL Server rounds ms values to the nearest timetick. AND – I’m not going to get into this here at all but I did write a series of articles on datetime data for SQL Server Magazine, see here.


 


OK, so crank up your data, have ever-increasing identities to correspond to ever-increasing data and then create duplicate test tables based on the settings and patterns above. You can then create the testing patterns. The tests are going to be based on deleting a percentage of your data (you will need to determine your date ranges and estimate the correct date which is 5-10% of your table). 5-10 percent is a reasonable percentage to delete based on an “archiving” principle/idea (in fact, this might be large in many situations). Since the date values are the same for all tables this will make testing easier. Also, there a few simple testing practices here – use a separate database in which you track test run numbers, do multiple test runs and take averages (consider removing the highest and lowest values to get rid of any strange anomalies as well). SO, here’s the (TestingDeletePerformance SetupScript1 ChargeTestingResults.sql (2.49 KB)) that I used to track/store the test results. Note: you may need to change directory/path/instance names. Once the results database (ChargeTestingResults) has been created then you can create your testing database (based on your modified version of Credit.dbo.charge). The setup script to create this database (ChargeTesting), the testing procedure (DeleteTestRun) and perform a backup (to restore for reproducible test runs) is here (TestingDeletePerformance SetupScript2 ChargeTesting.sql (9.41 KB)).


 


Now, on to the testing…. You should run a series of test runs of slightly varying numbers of rows being deleted and then analyze your findings. I have a feeling that we’ll all get slightly different results on the extreme ends of a few tests BUT if you average the timings over multiple test runs I think you’ll find the same thing I’ve found!


 


So, you’ll need to set your @DateToDeleteTo and make sure your path is correct. based on your dates) but here’s the general idea… so that you input the timings into your ChargeTestingResults database:


 


DECLARE @DateToDeleteTo              datetime,


                   @NumberOfTestRuns     tinyint,


                   @CurrentTestRun           tinyint


 


—————————————–


—————————————–


SET @NumberOfTestRuns = 1


SET @DateToDeleteTo = ‘yyyymmdd


—————————————–


—————————————–


 


SET @CurrentTestRun = 0


 


WHILE @CurrentTestRun < @NumberOfTestRuns


BEGIN


          USE ChargeTesting


          EXEC DeleteTestRun @DateToDeleteTo


          USE Master


          RESTORE DATABASE ChargeTesting


                   FROM DISK = ‘fix path


          SET @CurrentTestRun = @CurrentTestRun + 1


END


go


 


OK, so now we’re ready to roll! I’ve run this 23 times (yes, thanks to my machine being idle for a few of my sleeping hours – it’s either a great time for SETI or automated tests…)


 


I’ve run the following query:


SELECT TableName, AVG(DeleteTime) AS [Average Time]


, MIN(DeleteTime) AS [Min Time]
, MAX(DeleteTime) AS [Max Time]


FROM ChargeTestingResults.dbo.DeleteTestResults


WHERE TestRunID NOT IN


                             (SELECT TestRunID


                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTR


                             WHERE DeleteTime = (SELECT min(DeleteTime)


                                                                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTRC — correlated


                                                                             WHERE DTR.TableName = DTRC.TableName GROUP BY TableName)


                             OR


                                        DeleteTime = (SELECT max(DeleteTime)


                                                                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTRC2 — correlated


                                                                             WHERE DTR.TableName = DTRC2.TableName GROUP BY TableName))


GROUP BY TableName


ORDER BY [Average Time]


 


This query removes the highest and lowest values from the test runs and then averages the final results into this data:


 




























































TableName


Average Time


Min Time


Max Time


ChargeCLDateForDelete


1566


1143


2903


ChargeCLDateForDeleteWithCompPK


2359


1253


8080


ChargeHeap


10392


9603


11476


ChargeCLPKForDelete


11377


9453


18476


ChargeCLReallyBadForDelete


17491


12640


35440


ChargeCLDateForDeleteWithCompPKWNCIndexes


19594


8020


29414


ChargeCLDateForDeleteWNCIndexes


22467


9243


63520


ChargeCLPKForDeleteWNCIndexes


30132


17343


59366


ChargeCLReallyBadForDeleteWNCIndexes


44208


25946


62000


ChargeHeapWNCIndexes


49407


19716


78383


 


Now, I’m sure you’ve completely forgotten what brought us to this result set but it all started when we were concerned about large scale range deletes… I really created this test set for multiple reasons:


1)     To give you ideas for tracking test runs (separate database, testing tables, dates/times and interesting data)


2)     Show you a few practices for reproducible tests (i.e. backup after setup and then restore before each test run)


3)     Show a variety of examples of dynamic string execution


4)     Show you an interesting query which removes the best case and worst case test scenario and then averages the results (fyi – that query is easier to write in SQL Server 2005 with common table expressions!).


5)     Prove my original point… (of course) J


6)     Lead you to wanting to learn more about the BEST point of all… consider partitioning.


 


So, let’s talk about #5 for a second – the original point was that I *think* the reason the deletes were slower with the clustered index is that you also had non-clustered indexes. Hard to say now but that would be my guess. The MORE interesting point is that the Date alone in the CL key is faster than the composite clustering key. Now, I’m not about to start doing insert testing or space testing but having a clustering key which is NOT unique has other problems. Given the VERY small difference in times for these deletes (yes, it’s almost 50% here but the numbers are VERY small comparatively speaking regardless – ok, we can certainly debate this BUT are you really ONLY focusing on performance for deletes or do you need to look at the big picture? When you do I would guess that INSERTs will be more important overall). But – what you really get to when you look over all of this you see the ultimate problem – NON CLUSTERED INDEXES…. Yes, if you have non-clustered indexes they will kill your delete performance. So – the point here is going to be – how can you get GREAT archiving performance when deleting ranges AS WELL AS great performance for INSERTs AS WELL AS a more manageable environment……… partitioning.


 


So – there was another reason for this whole entry… it’s partially to help try and convince a few of you (especially those of you that have tables in double-digit or triple-digit millions of rows) that large tables are harder to manage and large archival operations can be EXTREMELY costly!