SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
One of the concepts I find people misunderstand frequently is the recovery interval, either for the server as a whole or the per-database setting that was introduced in SQL Server 2012 for indirect checkpoints.
There are two misconceptions here:
- The recovery interval equals how often a checkpoint will occur
- SQL Server guarantees the recovery interval (i.e. crash recovery for the database will only take the amount of time specified in the recovery interval)
A bit of background: crash recovery has two tasks to perform: replaying log records from committed transactions (called REDO) and removing the effect of log records from uncommitted transactions (called UNDO). REDO only has to occur if there have been committed transactions where the changed data pages have not been written to disk (which is done by periodic checkpoints or a lazy writer background thread if there’s memory pressure on the buffer pool).
The recovery interval specifies an ideal upper bound on how long the REDO portion of crash recovery will take. The length of time REDO takes depends on how many log records need to be replayed from committed transactions. To help REDO stay on track, the recovery interval setting forces a checkpoint to occur when a certain number of log records have been generated (a number calculated using the chosen recovery interval).
So although recovery interval *does* control how often a checkpoint occurs, the recovery interval time does not equal the time between checkpoints.
For instance, if the recovery interval is set to one minute (the default), and a checkpoint occurs *now*, when will the next checkpoint occur? If activity in the database is very infrequent, a checkpoint may not occur for a very long time because the calculated recovery time for that activity will be less than one minute. If activity in the database is extremely high, checkpoints may occur every few seconds. Checkpoint frequency entirely depends on how fast log records are being generated *and* the recovery interval setting.
And of course SQL Server cannot *guarantee* the recovery interval. Imagine the recovery interval is set to one minute, and I start a transaction that updates millions of records over several hours. If SQL Server crashes just before the transaction commits, how long do you think it will take to roll back the millions of updates? Certainly it will take much longer than one minute – the configured recovery interval.
So I can update my definition to be: the recovery interval is therefore the ideal upper bound on how long the REDO portion of crash recovery will take, assuming no long-running transactions.
Hope this helps clear up any misconceptions.
4 thoughts on “SQLskills SQL101: Is the recovery interval guaranteed?”
this is how I always understud the SQL Server recovery process to be working. Yet, I’ve just stumbled upon a new Azure/SQL Server 2019 feature called Accelerated Database Recovery (ADR), where Microsoft compares the new with the current way of database recovery in their documentation. Here, they describe the redo-phase as having to start at the point of the oldest uncommitted transaction, which could mean, having to redo hours worth of transaction data. (or is this just how Azure does recovery?)
That does not make sense to me, as that’s what checkpoints are for, ensuring, that all those informations are already in the data files and that we dont have to redo them.
Can you shed some light on my confusion and explain, what’s what? :)
(I do get how they’re planing to improve the undo-phase, still waiting for some additional technical details here. Especially, how they new Persisted Version Store will be implemented, as it seems crucial to the recovery process. It’s just the description of the redo-phase that confuses me)
Here is a link to the documentation aswell as a link to an interview with one of the developers of ADR, explaining their new technology.
Yeah – well aware of that but I’m not blogging about it or the trade-offs (quite impactful potentially) until 2019 actually ships.
Please it’ll be helpful if you please also write something about Indirect checkpoint
It’s exactly the same argument as for old-style checkpoints – recovery interval is not guaranteed.