Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING.
I did some digging around and all I could find was an old blog post from 2013 that shows the behavior and mentions that this happens with mirroring and Availability Groups. I hadn’t heard of this behavior before but I guessed at the reason, and confirmed with the SQL Server team.
When an AG secondary is going to be added, at that point in time, the maximum LSN (the Log Sequence Number of the most recent log record) present in the restored copy of the database that will be the secondary must be part of the ‘active’ log on the AG primary (i.e. that LSN must be in a VLF on the primary that has status = 2). If that’s not the case, you need to restore another log backup on what will be the new secondary, and try the AG joining process again. Repeat until it works. You can see how for a very busy system, generating lots of log records and with frequent log backups (which clear the log on the primary), catching up the secondary enough to allow it to join the AG might be difficult, or necessitate temporarily stopping log backups on the primary (possibly opening up a window for increased data loss in a disaster).
To make this whole process easier, when a database is an AG primary, when log clearing occurs, the VLFs don’t go to status = 0; they remain ‘active’ with status = 2. So how does this help? Well, the fact that lots more VLFs are ‘active’ on the AG primary means that it’s more likely that the maximum LSN of a new secondary is still part of the ‘active’ log on the primary, and the AG-joining succeeds without having to repeat the restore-retry-the-join over and over.
(Note: the log manager knows that these VLFs are really ‘fake active’ and can reuse them as if they were ‘inactive’ if the log wraps around (see this post for an explanation) so there’s no interruption to regular operations on the AG primary.)
It’s a clever little mechanism that someone thought of to make a DBA’s life a bit easier and AGs less problematic to set up.
And now you know – log clearing won’t *always* set VLF statuses to zero.
14 thoughts on “Lazy log truncation or why VLFs might stay at status 2 after log clearing”
Your take on this would be nice
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-accelerated-database-recovery
thanks
I’ve looked into it a lot but I won’t blog about it until it hits the box product – looks very useful but with trade-offs.
Paul,
Is there a way to tell if the VLF is ‘fake active’?
Thanks in advance,
Frank
You’d need to look at all open transaction begin LSNs, oldest unreplicated txn begin LSN, last log backup LSN, and so on. Not practical and no point as it has no performance impact whatsoever.
Paul,
What is the impact of all that on filestream? My understanding is this will impact deletion of files, as filestream garbage collector will not remove deleted files unless VLF with that operation is marked as inactive. Not clear how to address that…
Thank you,
I believe you’re correct.
Hi Paul,
While I appreciate the the reasons behind this technique, I’m in a position where all my VLFs are in status 2. I know most of them are “fake active” because the active VLF wraps around as you’d expect, and the log neither grows or complains about being full. However the issue is with all VLFs remaining as 2, I can never shrink the transaction log! Is there any known solution to this predicament?
Thanks,
Chris.
I don’t know the answer to that – great question. I will find out and reply, possibly another post too.
Jonathan said this:
“As long as the following conditions are met, the log will shrink:
All secondary replicas are synchronized
The log has been backed up
The replication log reader has processed the log records for CDC/replication if required
All Secondary replicas have processed redo – if a secondary is blocked during redo by a SCH_M lock being blocked by a SCH_S due to readonly workload it can’t clear the log records till processed.
Usually in an AG I do a shrink file to remove the inactive portion at the end, take a log backup to clear anything and get it to wrap around, then another shrink and it works just fine. Sometimes I have to do an extra log backup followed by a shrink to get it to fully reduce. I don’t ever use DBCC LOGINFO for that type of operation.”
This has proved invaluable. My general experience is that there is so much that goes on under the hood and which DBAs have to solve – and yet so often have to resort to discussion forums, blogs and support sites to figure it out. That is a real shame.
Regardless – to add: on SQL Server 2016 having an AG, performing a shrink operation on the log file (where >95% of VLFs showed status 2) led to almost all VLFs being set to zero – which then propagated through to the other replicas’ log files too.
Should select * from sys.dm_db_log_space_usage include fake 2s? I have used_log_space_used_in_percent just continuously clocking upward despite log backups and indications that the AG secondary (readable..) is up-to-date.
I do not know for certain but my expectation is that it will.
Previously I would have waited until there were 0’s at the end of DBCC LOGINFO to indicate it is worth trying to shrink log if there is a large amount of free space.
What is the correct mechanism to understand whether a shrink will be successful?
Just try it and see, over and over?
Yes – there are a bunch of it depends so if you really want to shrink the log and that space isn’t going to be grown again, just try it after making sure there isn’t something obvious holding the log (see log_reuse_wait_desc).