Last weekend there was a suggestion on the MVP distribution list about having temporary nonclustered indexes on AG readable secondaries, in the same way that you can have temporary statistics on them. I replied that in my opinion it would be extremely difficult to do that, and said I’d blog about why. Here’s my explanation. Note that this isn’t an exhaustive list of all the problems, just the major ones that I see.
Where to store them?
There are several options that spring to mind for how to store these temporary indexes, with varying degrees of difficulty:
- Automatically create a temporary filegroup on the readable secondary
- This option has the problem that the readable secondary is read-only, and adding a filegroup would mean adding entries in a bunch of system tables (including sys.sysfiles1, sys.sysdbfiles, sys.sysbrickfiles, sys.sysprufiles). Even if this problem could be surmounted, there’s still the problem of…
- The readable secondary is read-only, so where to store all the information about the indexes themselves? There are a large number of system tables that have information about an index (including sys.sysallocunits, sys.sysrowsets, sys.sysrscols, sys.sysidxstats, sys.sysiscols). Even if this problem could be surmounted, there’s still the problem of maintaining the index (see below).
- Create an empty filegroup on the primary replica so there’s an empty filegroup to use on the readable secondary
- This only solves 1.a above.
- Store them in tempdb, the same as the temporary statistics
- This solves 1.a and 1.b, but then has the added difficulty of…
- Keeping track of the fact that there are indexes in tempdb that are really for another database, which already happens for temporary statistics, so that’s doable but there’s the much more complex problem of…
- Making the Storage Engine (specifically the Access Methods) get data from the tempdb index instead of an index in the real database. That’s not a code change in the Storage Engine (because the Access Methods just creates and uses a data set over whatever the Query Processor asks it to), but the query plan will have to know that the index it’s referencing which purports to be on a table in the real database is actually in tempdb, so it asks the Storage Engine to read from the correct place. I can imagine this being quite a challenge for things like a key lookup based on a nonclustered index seek/scan.
- This is the only potentially practical solution in my mind for where to create the temporary indexes.
That’s the easier part taken care of.
How to maintain them?
Assuming they are stored as in #3 above, then there’s the really hard problem of how to maintain the indexes, to keep them up-to-date with the underlying table (i.e. when an INSERT, UPDATE, or DELETE happens, make sure that all nonclustered indexes are updated accordingly). Creating the index would be relatively trivial using an online index build-like method, so I won’t go into that.
Back in SQL Server 2000, it would be much easier (although there weren’t any AGs back then :-) because the Storage Engine was responsible for real-time maintenance of nonclustered indexes. Since SQL Server 2005, however, this has been the purview of the Query Processor, so it drives maintaining indexes and the Storage Engine just does what it’s asked (insert a record here, update this column, etc.). There is functionality available to the Storage Engine to ask the Query Processor to manipulate nonclustered indexes without an operation occurring on the table – DBCC CHECKDB uses it for fixing missing or extra records in nonclustered indexes during repair operations.
Here’s the major problem with temporary indexes: the only data coming from the primary replica are physical log records. How to translate those physical log records into nonclustered index maintenance operations?
The only log records that are interesting are those related to the table itself (i.e. changes to the heap or clustered index). But log records are replayed by the recovery portion of the Storage Engine, not the Access Methods, so in the recovery portion, there is no context at all about the table, it’s columns, indexes, and so on. To make this work, the following would be required:
- The recovery portion would have to know that the log records LOP_INSERT_ROWS, LOP_DELETE_ROWS, LOP_MODIFY_ROW, LOP_MODIFY_COLUMNS, plus anything else like truncating the table, changing from a heap to a clustered index and vice versa, for certain allocation units (i.e. those for any table with a temporary nonclustered index) need to be passed to a new piece of code to do the temporary index maintenance.
- The new piece of code would have to know about the new indexes and do the index maintenance. In other words, Storage Engine nonclustered index maintenance like in SQL Server 2000 would have to be built again. There’s a complication with table record modifications, as the LOP_MODIFY_ROW and LOP_MODIFY_COLUMNS log records don’t say which columns are changing – just the offset and length of the change (they’re physical log records remember). The log record could potentially be changed to have a bitmap of column IDs being changed, that’s only present when there’s a temporary index on a readable secondary. More conditional code.
- Another option is for the Query Processor on the primary replica to know that there’s a temporary index and generate special logical operation log records to aid in the maintenance of the index on the readable secondary (in the same way that transactional replication works). I think this would be easier than having to interpret regular log records and figure out what to do.
- All this new code to be executed would potentially slow down the replay of log records, with the slow down increasing with each temporary index that’s added.
To summarize, there are possible solutions to how to store the temporary indexes and how to maintain them, but it’s a lot of tricky work and involves a big change to the recovery code, which is always fraught with danger as it’s such a critical portion of the Storage Engine. And I haven’t touched on partitioning, rebuild/reorganize, and other things you can do with indexes.
IMHO, the easiest solution is to not use temporary indexes and just create the indexes on the primary replica that you want to be there on the readable secondary. This doesn’t require any code changes and is available today. Of course there’s the potential downside of extra space being taken up, extra logging, and extra time to send that log to the secondaries, but I think this is the way to go.
Hope you found this interesting!