Lazy log truncation or why VLFs might stay at status 2 after log clearing

 Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING.

I did some digging around and all I could find was an old blog post from 2013 that shows the behavior and mentions that this happens with mirroring and Availability Groups. I hadn’t heard of this behavior before but I guessed at the reason, and confirmed with the SQL Server team.

When an AG secondary is going to be added, at that point in time, the maximum LSN (the Log Sequence Number of the most recent log record) present in the restored copy of the database that will be the secondary must be part of the ‘active’ log on the AG primary (i.e. that LSN must be in a VLF on the primary that has status = 2). If that’s not the case, you need to restore another log backup on what will be the new secondary, and try the AG joining process again. Repeat until it works. You can see how for a very busy system, generating lots of log records and with frequent log backups (which clear the log on the primary), catching up the secondary enough to allow it to join the AG might be difficult, or necessitate temporarily stopping log backups on the primary (possibly opening up a window for increased data loss in a disaster).

To make this whole process easier, when a database is an AG primary, when log clearing occurs, the VLFs don’t go to status = 0; they remain ‘active’ with status = 2. So how does this help? Well, the fact that lots more VLFs are ‘active’ on the AG primary means that it’s more likely that the maximum LSN of a new secondary is still part of the ‘active’ log on the primary, and the AG-joining succeeds without having to repeat the restore-retry-the-join over and over.

(Note: the log manager knows that these VLFs are really ‘fake active’ and can reuse them as if they were ‘inactive’ if the log wraps around (see this post for an explanation) so there’s no interruption to regular operations on the AG primary.)

It’s a clever little mechanism that someone thought of to make a DBA’s life a bit easier and AGs less problematic to set up.

And now you know – log clearing won’t *always* set VLF statuses to zero.

New VLF status value

At least since I started working on the SQL Server team (just after 7.0 shipped) and since then there have only been two VLF status codes:

  • 0 = the VLF is not active (i.e. it can be (re)activated and overwritten)
  • (1 = not used and no-one seems to remember what it used to mean)
  • 2 = the VLF is active because at least one log record in it is ‘required’ by SQL Server for some reason (e.g. hasn’t been backed up by a log backup or scanned by replication)

A few weeks ago I learned about a new VLF status code that was added back in SQL Server 2012 but hasn’t come to light until recently (at least I’ve never encountered it in the wild). I went back-and-forth with a friend from Microsoft (Sean Gallardy, a PFE and MCM down in Tampa) who was able to dig around in the code to figure out when it’s used.

It can show up on an Availability Group secondary replica (only, not on the primary, and isn’t used in Database Mirroring):

  • 4 = the VLF exists on the primary replica but doesn’t really exist yet on this secondary replica

The main case where this status can happen is when a log file growth (or creation of an extra log file) has occurred on the primary replica but it hasn’t yet been replayed on the secondary replica. More log records are generated on the primary, written to a newly-created VLF, and sent across to the secondary replica to be hardened (written to the replica’s log file). If the secondary hasn’t yet replayed the log growth, the VLFs that should contain the just-received log records don’t exist yet, so they’re kind of temporarily created with a status of 4 so that the log records from the primary can be hardened successfully on the secondary. Eventually the log growth is replayed on the secondary and the temporary VLFs are fixed up correctly and change to a status of 2.

It makes perfect sense that a scheme like this exists, but I’d never really thought about this case or experimented to see what happens.

Anyway, now you know what status 4 means if you ever see it (and thanks Sean!)

What is the FCB_REPLICA_SYNC spinlock?

A question came up on the Data Platform MVP email list last night asking what the FCB_REPLICA_SYNC spinlock is. I answered the question and then promised to do a quick blog post, as there’s no information online about it that I could find.

Explanation

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

  • If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.
  • If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Example

For example, using our sample SalesDB database (zip file here), I created this query and set it running:

WHILE (1=1)
BEGIN
    UPDATE [SalesDB].[dbo].[Sales] SET [Quantity] = [Quantity] + 1;
END;
GO

Then I took my script to capture spinlock metrics over a period of time (see this post), changed it to capture over 20 seconds, and then ran a DBCC CHECKDB on the SalesDB database, which took 18 seconds.

The spinlock metrics returned were:

Spinlock                  DiffCollisions DiffSpins  SpinsPerCollision DiffSleepTime DiffBackoffs
------------------------- -------------- ---------  ----------------- ------------- ------------
BUF_HASH                  2              500        250               0             0
DBTABLE                   5              1250       250               0             0
FCB_REPLICA_SYNC          5716270        1513329500 264               0             154380
LOCK_HASH                 12             3500       291               0             1
LOGCACHE_ACCESS           6              387        64                0             3
LOGFLUSHQ                 4              75840      18960             0             3
LOGPOOL_HASHBUCKET        15             3750       250               0             0
LOGPOOL_SHAREDCACHEBUFFER 32             8000       250               0             0
LSLIST                    8              2000       250               0             0
SOS_SCHEDULER             3              1114       371               0             0
SOS_TASK                  1              356        356               0             0

You can see that even for a 20-second test, a single DBCC CHECKDB produced 1.5 billion spins on the FCB_REPLICA_SYNC spinlock.

This is perfectly normal.

One of the dangers of looking at spinlock metrics is that the numbers involved can be so high that it’s easy to convince yourself that there’s some kind of problem, especially as there’s so little information available online about what the spinlocks actually mean. The vast majority of the time, there’s no problem, but it takes a lot of internals knowledge to know what’s going on.

About the only thing I’d be concerned about is if there are multiple concurrent snapshots on a database with heavy update workload, as that will cause synchronous writes to all the snapshots when a page in the source database is first updated, slowing down the workload.

Investigation

One thing you can always do if you’re interested in what a specific spinlock means is to investigate with Extended Events. There’s a whitepaper I helped review called Diagnosing and Resolving Spinlock Contention on SQL Server that you can download here. In it there’s an Extended Event session that I use to see where spinlocks backoffs occur.

Here’s the session I used for FCB_REPLICA_SYNC (which maps to the type value of 136 in sys.dm_xe_map_values):

-- Drop the session if it exists. 
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE [name] = N'WatchSpinlocks')
    DROP EVENT SESSION [WatchSpinlocks] ON SERVER
GO

CREATE EVENT SESSION [WatchSpinlocks] ON SERVER
ADD EVENT [sqlos].[spinlock_backoff]
    (ACTION ([package0].[callstack])
    WHERE [type] = 136) -- FCB_REPLICA_SYNC only
ADD TARGET [package0].[asynchronous_bucketizer] (
    SET filtering_event_name = N'sqlos.spinlock_backoff',
    source_type = 1, -- source_type = 1 is an action
    source = N'package0.callstack') -- bucketize on the callstack
WITH (MAX_MEMORY = 50MB, MAX_DISPATCH_LATENCY = 5 seconds)
GO

-- Start the session
ALTER EVENT SESSION [WatchSpinlocks] ON SERVER STATE = START;
GO

-- TF to allow call stack resolution
DBCC TRACEON (3656, -1);
GO

-- Cause some spinlock backoffs

-- Get the callstacks from the bucketizer target
SELECT
    [event_session_address],
    [target_name],
    [execution_count],
    CAST ([target_data] AS XML)
FROM sys.dm_xe_session_targets [xst]
INNER JOIN sys.dm_xe_sessions [xs]
    ON ([xst].[event_session_address] = [xs].[address])
WHERE [xs].[name] = N'WatchSpinlocks';
GO

-- Stop the event session
ALTER EVENT SESSION [WatchSpinlocks] ON SERVER
STATE = STOP;
GO

You’ll need to download the debug symbols for the build you’re using – see here for my instructions on how to do this.

I started the event session and re-ran the test. A sampling of the call stacks is below, with matching explanations.

Explanation:

  1. Pushing a page into a snapshot just before it gets modified in the source database
  2. Reading a page from a snapshot (in this case, from one of DBCC CHECKDB‘s parallel threads performing readahead)
  3. Pulling a page in to a snapshot while crash recovery is running on the new snapshot to make it a transactionally-consistent view of the source database (in this case, it’s a ‘transient’ database snapshot that DBCC CHECKDB has created)

And plenty more similar call stacks.

Call stack:

  1. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    SpinlockBase::Backoff+0x145
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartWrite+0x7f
    FCB::CopyPageToReplicas+0x212
    BUF::CopyOnWrite+0x60
    BPool::PrepareToDirty+0x180
    IndexPageRef::Modify+0x146
    BTreeRow::UpdateRecord+0x20ab
    IndexDataSetSession::SetDataInternal+0x9a03
    DatasetSession::SetData+0x16d
    RowsetNewSS::SetData+0x6a
    CValRow::SetDataX+0x63
    CEsExec::GeneralEval4+0xe7
    CQScanUpdateNew::GetRow+0x24b
    CQueryScan::GetRow+0x81
    CXStmtQuery::ErsqExecuteQuery+0x5be
    CXStmtDML::XretDMLExecute+0x31c
    CXStmtDML::XretExecute+0xad
    CMsqlExecContext::ExecuteStmts<0,1>+0x8bd
    CMsqlExecContext::FExecute+0xa68
    CSQLSource::Execute+0x86c
    CStmtPrepQuery::XretPrepQueryExecute+0x464
  2. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartRead+0x86
    FCB::ScatterRead+0x1b3
    RecoveryUnit::ScatterRead+0xa9
    BPool::GetFromDisk+0x719
    BPool::ReadAhead+0x7e
    MultiObjectScanner::GetNextPageAndReadAhead+0x38e
    MultiObjectScanner::GetNext+0x98
    MultiObjectScanner::GetNextPageAndBatch+0x2fc
    CheckTables::ProcessNextData+0x1bb
    CheckAggregateSingleInstance::GetNextFact+0x28e
    CTRowsetInstance::FGetNextRow+0x3c
    CUtRowset::GetNextRows+0xa0
    CQScanRmtScanNew::GetRowHelper+0x3b8
    CQScanXProducerNew::GetRowHelper+0x53
    CQScanXProducerNew::GetRow+0x15
    FnProducerOpen+0x57
    FnProducerThread+0x8c3
    SubprocEntrypoint+0xa7f
    SOS_Task::Param::Execute+0x21e
    SOS_Scheduler::RunTask+0xab
    SOS_Scheduler::ProcessTasks+0x279
  3. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartWrite+0x7f
    FCB::PullPageToReplica+0x35
    FCB::CopyPageToReplicas+0x12c
    BUF::CopyOnWrite+0x60
    BPool::PrepareToDirty+0x180
    PageRef::ModifyRow+0x24a
    IndexPageRef::Modify+0x19f2
    BTreeRow::UpdateRecord+0x20ab
    IndexDataSetSession::UndoSetData+0x4d9
    XdesRMReadWrite::IndexModify+0x61
    XdesRMReadWrite::UndoPageOperation+0x10da
    XdesRMReadWrite::RollbackToLsn+0x7d6
    RecoveryMgr::UndoRegularXacts+0xb09
    RecoveryMgr::RollbackRemaining+0x137
    RecoveryUnit::DoRollbackRecovery+0x19
    RecoveryUnit::CompleteRecovery+0x6b8
    RecoveryUnit::PhaseStart+0x87
    DBTABLE::StartupPostRecovery+0x4d
    DBTABLE::ReplicaCreateStartup+0x284
    DBMgr::SyncAndLinkReplicaRecoveryPhase+0x787
    DBMgr::CreatePhasedTransientReplica+0x717

Summary

So there you have it. The FCB_REPLICA_SYNC spinlock is to do with database snapshot reads and writes, and high numbers around it are expected with concurrent updates in the source database and reads in the snapshot.