SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
One of the concepts I find people misunderstand frequently is the recovery interval, either for the server as a whole or the per-database setting that was introduced in SQL Server 2012 for indirect checkpoints.
There are two misconceptions here:
- The recovery interval equals how often a checkpoint will occur
- SQL Server guarantees the recovery interval (i.e. crash recovery for the database will only take the amount of time specified in the recovery interval)
A bit of background: crash recovery has two tasks to perform: replaying log records from committed transactions (called REDO) and removing the effect of log records from uncommitted transactions (called UNDO). REDO only has to occur if there have been committed transactions where the changed data pages have not been written to disk (which is done by periodic checkpoints or a lazy writer background thread if there’s memory pressure on the buffer pool).
The recovery interval specifies an ideal upper bound on how long the REDO portion of crash recovery will take. The length of time REDO takes depends on how many log records need to be replayed from committed transactions. To help REDO stay on track, the recovery interval setting forces a checkpoint to occur when a certain number of log records have been generated (a number calculated using the chosen recovery interval).
So although recovery interval *does* control how often a checkpoint occurs, the recovery interval time does not equal the time between checkpoints.
For instance, if the recovery interval is set to one minute (the default), and a checkpoint occurs *now*, when will the next checkpoint occur? If activity in the database is very infrequent, a checkpoint may not occur for a very long time because the calculated recovery time for that activity will be less than one minute. If activity in the database is extremely high, checkpoints may occur every few seconds. Checkpoint frequency entirely depends on how fast log records are being generated *and* the recovery interval setting.
And of course SQL Server cannot *guarantee* the recovery interval. Imagine the recovery interval is set to one minute, and I start a transaction that updates millions of records over several hours. If SQL Server crashes just before the transaction commits, how long do you think it will take to roll back the millions of updates? Certainly it will take much longer than one minute – the configured recovery interval.
So I can update my definition to be: the recovery interval is therefore the ideal upper bound on how long the REDO portion of crash recovery will take, assuming no long-running transactions.
Hope this helps clear up any misconceptions.