This is a case that happened on a client system last year: occasionally a common query on a tiny table appeared to ‘hang’ and had to be killed and re-run. What’s going on?
The table in question only had a few million rows of data in it, with a maximum row size of 60 bytes, and the query usually ran in a few seconds, but occasionally the query would ‘hang’ and would either be killed or take tens of minutes to run. Troubleshooting instrumentation when the issue happened showed no out-of-the-ordinary waits occurring, no pressure on the server, and the query plan generated when the query took a long time was essentially the same.
The only thing noticeable was that when the problem occurred, a column statistics update happened as part of query compilation, but with such a tiny table, how could that be the root cause of the issue? The calculated disk space for the row size and count worked out to be about 250MB, but with a statistics sample rate of only 4%, extended events showed an auto_stats event taking close to an hour!
Further investigation showed that although the table only had a few hundred MB of data in it, it was taking up more than 25GB of disk space! Jonathan and I were stumped as to how that could be the case. True, there had been bugs in earlier versions of SQL Server, such ass with LOB data types where only one page from each dedicated extent (of 8 pages) was actually used, but none were currently known, and even something like that wouldn’t account for a table taking up more than 100x more disk space than data in the table.
More and more curious, all the space was in a nonclustered index. How could *that* be possible?
I suggested looking at what the pages were that were allocated to the table. Bingo! 25GB of IAM pages (one-per-4GB allocation bitmaps) were allocated to that single nonclustered index.
Further investigation showed that the nonclustered index had thousands of partitions, each with a handful of index pages and around 1,500 IAM pages. The nonclustered index leading key was a ROWVERSION column and the churn rate on the table was very high, so essentially the schema and usage pattern were creating a long IAM chain for each partition, with hardly any data.
In a nutshell, this meant a statistics update on a column covered by the nonclustered index would have needed to read and process 25GB of IAM pages, looking for allocated extents to then process the index records from to produce the statistic.
Solution? The initial super-quick fix was to drop and recreate the nonclustered index to remove the long IAM chains (rather than having an index rebuild have to go through the same, laborious task of reading all the IAM pages!), and then to implement regular index maintenance to prevent the IAM chains from becoming long in the first place.