What is the FCB_REPLICA_SYNC spinlock?

A question came up on the Data Platform MVP email list last night asking what the FCB_REPLICA_SYNC spinlock is. I answered the question and then promised to do a quick blog post, as there’s no information online about it that I could find.

Explanation

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

  • If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.
  • If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Example

For example, using our sample SalesDB database (zip file here), I created this query and set it running:

WHILE (1=1)
BEGIN
    UPDATE [SalesDB].[dbo].[Sales] SET [Quantity] = [Quantity] + 1;
END;
GO

Then I took my script to capture spinlock metrics over a period of time (see this post), changed it to capture over 20 seconds, and then ran a DBCC CHECKDB on the SalesDB database, which took 18 seconds.

The spinlock metrics returned were:

Spinlock                  DiffCollisions DiffSpins  SpinsPerCollision DiffSleepTime DiffBackoffs
------------------------- -------------- ---------  ----------------- ------------- ------------
BUF_HASH                  2              500        250               0             0
DBTABLE                   5              1250       250               0             0
FCB_REPLICA_SYNC          5716270        1513329500 264               0             154380
LOCK_HASH                 12             3500       291               0             1
LOGCACHE_ACCESS           6              387        64                0             3
LOGFLUSHQ                 4              75840      18960             0             3
LOGPOOL_HASHBUCKET        15             3750       250               0             0
LOGPOOL_SHAREDCACHEBUFFER 32             8000       250               0             0
LSLIST                    8              2000       250               0             0
SOS_SCHEDULER             3              1114       371               0             0
SOS_TASK                  1              356        356               0             0

You can see that even for a 20-second test, a single DBCC CHECKDB produced 1.5 billion spins on the FCB_REPLICA_SYNC spinlock.

This is perfectly normal.

One of the dangers of looking at spinlock metrics is that the numbers involved can be so high that it’s easy to convince yourself that there’s some kind of problem, especially as there’s so little information available online about what the spinlocks actually mean. The vast majority of the time, there’s no problem, but it takes a lot of internals knowledge to know what’s going on.

About the only thing I’d be concerned about is if there are multiple concurrent snapshots on a database with heavy update workload, as that will cause synchronous writes to all the snapshots when a page in the source database is first updated, slowing down the workload.

Investigation

One thing you can always do if you’re interested in what a specific spinlock means is to investigate with Extended Events. There’s a whitepaper I helped review called Diagnosing and Resolving Spinlock Contention on SQL Server that you can download here. In it there’s an Extended Event session that I use to see where spinlocks backoffs occur.

Here’s the session I used for FCB_REPLICA_SYNC (which maps to the type value of 136 in sys.dm_xe_map_values):

-- Drop the session if it exists. 
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE [name] = N'WatchSpinlocks')
    DROP EVENT SESSION [WatchSpinlocks] ON SERVER
GO

CREATE EVENT SESSION [WatchSpinlocks] ON SERVER
ADD EVENT [sqlos].[spinlock_backoff]
    (ACTION ([package0].[callstack])
    WHERE [type] = 136) -- FCB_REPLICA_SYNC only
ADD TARGET [package0].[asynchronous_bucketizer] (
    SET filtering_event_name = N'sqlos.spinlock_backoff',
    source_type = 1, -- source_type = 1 is an action
    source = N'package0.callstack') -- bucketize on the callstack
WITH (MAX_MEMORY = 50MB, MAX_DISPATCH_LATENCY = 5 seconds)
GO

-- Start the session
ALTER EVENT SESSION [WatchSpinlocks] ON SERVER STATE = START;
GO

-- TF to allow call stack resolution
DBCC TRACEON (3656, -1);
GO

-- Cause some spinlock backoffs

-- Get the callstacks from the bucketizer target
SELECT
    [event_session_address],
    [target_name],
    [execution_count],
    CAST ([target_data] AS XML)
FROM sys.dm_xe_session_targets [xst]
INNER JOIN sys.dm_xe_sessions [xs]
    ON ([xst].[event_session_address] = [xs].[address])
WHERE [xs].[name] = N'WatchSpinlocks';
GO

-- Stop the event session
ALTER EVENT SESSION [WatchSpinlocks] ON SERVER
STATE = STOP;
GO

You’ll need to download the debug symbols for the build you’re using – see here for my instructions on how to do this.

I started the event session and re-ran the test. A sampling of the call stacks is below, with matching explanations.

Explanation:

  1. Pushing a page into a snapshot just before it gets modified in the source database
  2. Reading a page from a snapshot (in this case, from one of DBCC CHECKDB‘s parallel threads performing readahead)
  3. Pulling a page in to a snapshot while crash recovery is running on the new snapshot to make it a transactionally-consistent view of the source database (in this case, it’s a ‘transient’ database snapshot that DBCC CHECKDB has created)

And plenty more similar call stacks.

Call stack:

  1. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    SpinlockBase::Backoff+0x145
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartWrite+0x7f
    FCB::CopyPageToReplicas+0x212
    BUF::CopyOnWrite+0x60
    BPool::PrepareToDirty+0x180
    IndexPageRef::Modify+0x146
    BTreeRow::UpdateRecord+0x20ab
    IndexDataSetSession::SetDataInternal+0x9a03
    DatasetSession::SetData+0x16d
    RowsetNewSS::SetData+0x6a
    CValRow::SetDataX+0x63
    CEsExec::GeneralEval4+0xe7
    CQScanUpdateNew::GetRow+0x24b
    CQueryScan::GetRow+0x81
    CXStmtQuery::ErsqExecuteQuery+0x5be
    CXStmtDML::XretDMLExecute+0x31c
    CXStmtDML::XretExecute+0xad
    CMsqlExecContext::ExecuteStmts<0,1>+0x8bd
    CMsqlExecContext::FExecute+0xa68
    CSQLSource::Execute+0x86c
    CStmtPrepQuery::XretPrepQueryExecute+0x464
  2. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartRead+0x86
    FCB::ScatterRead+0x1b3
    RecoveryUnit::ScatterRead+0xa9
    BPool::GetFromDisk+0x719
    BPool::ReadAhead+0x7e
    MultiObjectScanner::GetNextPageAndReadAhead+0x38e
    MultiObjectScanner::GetNext+0x98
    MultiObjectScanner::GetNextPageAndBatch+0x2fc
    CheckTables::ProcessNextData+0x1bb
    CheckAggregateSingleInstance::GetNextFact+0x28e
    CTRowsetInstance::FGetNextRow+0x3c
    CUtRowset::GetNextRows+0xa0
    CQScanRmtScanNew::GetRowHelper+0x3b8
    CQScanXProducerNew::GetRowHelper+0x53
    CQScanXProducerNew::GetRow+0x15
    FnProducerOpen+0x57
    FnProducerThread+0x8c3
    SubprocEntrypoint+0xa7f
    SOS_Task::Param::Execute+0x21e
    SOS_Scheduler::RunTask+0xab
    SOS_Scheduler::ProcessTasks+0x279
  3. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartWrite+0x7f
    FCB::PullPageToReplica+0x35
    FCB::CopyPageToReplicas+0x12c
    BUF::CopyOnWrite+0x60
    BPool::PrepareToDirty+0x180
    PageRef::ModifyRow+0x24a
    IndexPageRef::Modify+0x19f2
    BTreeRow::UpdateRecord+0x20ab
    IndexDataSetSession::UndoSetData+0x4d9
    XdesRMReadWrite::IndexModify+0x61
    XdesRMReadWrite::UndoPageOperation+0x10da
    XdesRMReadWrite::RollbackToLsn+0x7d6
    RecoveryMgr::UndoRegularXacts+0xb09
    RecoveryMgr::RollbackRemaining+0x137
    RecoveryUnit::DoRollbackRecovery+0x19
    RecoveryUnit::CompleteRecovery+0x6b8
    RecoveryUnit::PhaseStart+0x87
    DBTABLE::StartupPostRecovery+0x4d
    DBTABLE::ReplicaCreateStartup+0x284
    DBMgr::SyncAndLinkReplicaRecoveryPhase+0x787
    DBMgr::CreatePhasedTransientReplica+0x717

Summary

So there you have it. The FCB_REPLICA_SYNC spinlock is to do with database snapshot reads and writes, and high numbers around it are expected with concurrent updates in the source database and reads in the snapshot.

Investigating the proportional fill algorithm

This is something that came up recently on the Microsoft Certified Master DL, and is something I discuss in our IEPTO1 class because of the performance implications of it, so I thought it would make an interesting post.

Allocation Algorithms

The SQL Server Storage Engine (SE) uses two algorithms when allocating extents from files in a filegroup: round robin and proportional fill.

Round robin means that the SE will try to allocate from each file in a filegroup in succession. For instance, for a database with two files in the primary filegroup (with file IDs 1 and 3, as 2 is always the log file), the SE will try to allocate from file 1 then file 3 then file 1 then file 3, and so on.

The twist in this mechanism is that the SE also has to consider how much free space is in each of the files in the filegroup, and allocate more extents from the file(s) with more free space. In other words, the SE will allocate proportionally more frequently from files in a filegroup with more free space. This twist is called proportional fill.

Proportional fill works by assigning a number to each file in the filegroup, called a ‘skip target’. You can think of this as an inverse weighting, where the higher the value is above 1, the more times that file will be skipped when going round the round robin loop. During the round robin, the skip target for a file is examined, and if it’s equal to 1, an allocation takes place. If the skip target is higher than 1, it’s decremented by 1 (to a minimum value of 1), no allocation takes place, and consideration moves to the next file in the filegroup.

(Note that there’s a further twist to this: when the -E startup parameter is used, each file with a skip target of 1 will be used for 64 consecutive extent allocations before the round robin loop progresses. This is documented in Books Online here and is useful for increasing the contiguity of index leaf levels for very large scans – think data warehouses.)

The skip target for each file is the integer result of (number of free extents in file with most free space) / (number of free extents in this file). The files in the filegroup with the least amount of free space will therefore have the highest skip targets, and there has to be at least one file in the filegroup with a skip target of 1, guaranteeing that each time round the round robin loop, at least one extent allocation takes place.

The skip targets are recalculated whenever a file is added to or removed from a filegroup, or at least 8192 extent allocations take place in the filegroup.

Investigating the Skip Targets

There’s an undocumented trace flag, 1165, that lets us see the skip targets whenever they’re recalculated and I believe the trace flag was added in SQL Server 2008. It also requires trace flag 3605 to be enabled to allow the debugging info to be output.

Let’s try it out!

First I’ll turn on the trace flags, cycle the error log, creating a small database, and look in the error log for pertinent information:

DBCC TRACEON (1165, 3605);
GO

EXEC sp_cycle_errorlog;
GO

USE [master];
GO

IF DATABASEPROPERTYEX (N'Company', N'Version') > 0
BEGIN
	ALTER DATABASE [Company] SET SINGLE_USER
		WITH ROLLBACK IMMEDIATE;
	DROP DATABASE [Company];
END
GO

CREATE DATABASE [Company] ON PRIMARY (
    NAME = N'Company_data',
    FILENAME = N'D:\SQLskills\Company_data.mdf',
	SIZE = 5MB,
    FILEGROWTH = 1MB)
LOG ON (
    NAME = N'Company_log',
    FILENAME = N'D:\SQLskills\Company_log.ldf'
);

EXEC xp_readerrorlog;
GO
2016-10-04 11:38:33.830 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs -856331000.
2016-10-04 11:38:33.830 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 1.
2016-10-04 11:38:33.830 spid56       	File [Company_data] (1) has 44 free extents and skip target of 1. 

The m_cAllocs is the threshold at which the skip targets will be recalculated. In the first line of output, it has a random number as the database has just been created and the counter hasn’t been initialized yet. It’s the name of a class member of the C++ class inside the SE that implements filegroup management.

Now I’ll add another file with the same size:

ALTER DATABASE [Company] ADD FILE (
	NAME = N'SecondFile',
	FILENAME = N'D:\SQLskills\SecondFile.ndf',
	SIZE = 5MB,
    FILEGROWTH = 1MB);
GO

EXEC xp_readerrorlog;
GO
2016-10-04 11:41:27.880 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192.
2016-10-04 11:41:27.880 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 3.
2016-10-04 11:41:27.880 spid56       	File [Company_data] (1) has 44 free extents and skip target of 1. 
2016-10-04 11:41:27.880 spid56       	File [SecondFile] (3) has 79 free extents and skip target of 1. 

Note that even though the two files have different numbers of extents, the integer result of 79 / 44 is 1, so the skip targets are both set to 1.

Now I’ll add a much larger file:

ALTER DATABASE [Company] ADD FILE (
	NAME = N'ThirdFile',
	FILENAME = N'D:\SQLskills\ThirdFile.ndf',
	SIZE = 250MB,
    FILEGROWTH = 1MB);
GO

EXEC xp_readerrorlog;
GO
2016-10-04 11:44:20.310 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192.
2016-10-04 11:44:20.310 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 4.
2016-10-04 11:44:20.310 spid56       	File [Company_data] (1) has 44 free extents and skip target of 90. 
2016-10-04 11:44:20.310 spid56       	File [ThirdFile] (4) has 3995 free extents and skip target of 1. 
2016-10-04 11:44:20.310 spid56       	File [SecondFile] (3) has 79 free extents and skip target of 50. 

The file with the most free space is file ID 4, so the skip targets of the other files are set to (file 4’s free extents) / (free extents in the file). For example, the skip target for file 1 becomes the integer result of 3995 / 44 = 90.

Now I’ll create a table that can have only one row per page, and force more than 8192 extent allocations to take place (by inserting more than 8192 x 8 rows, forcing that many pages to be allocated). This will also mean the files will have autogrown and will have roughly equal numbers of free extents.

USE [Company];
GO

CREATE TABLE [BigRows] (
	[c1] INT IDENTITY,
	[c2] CHAR (8000) DEFAULT 'a');
GO

SET NOCOUNT ON;
GO

INSERT INTO [BigRows] DEFAULT VALUES;
GO 70000

EXEC xp_readerrorlog;
GO
2016-10-04 11:55:28.840 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192.
2016-10-04 11:55:28.840 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 3.
2016-10-04 11:55:28.840 spid56       	File [Company_data] (1) has 0 free extents and skip target of 74. 
2016-10-04 11:55:28.840 spid56       	File [ThirdFile] (4) has 0 free extents and skip target of 74. 
2016-10-04 11:55:28.840 spid56       	File [SecondFile] (3) has 74 free extents and skip target of 1. 

We can see that all the files have filled up and auto grown, and randomly file ID 3 is now the one with the most free space.

Spinlock Contention

The skip targets for the files in a filegroup are protected by the FGCB_PRP_FILL spinlock, so this spinlock has to be acquired for each extent allocation, to determine which file to allocate from next. There’s an exception to this when all the files in a filegroup have roughly the same amount of free space (so they all have a skip target of 1). In that case, there’s no need to acquire the spinlock to check the skip targets.

This means that if you create a filegroup that has file sizes that are different, the odds are that they will auto grow at different times and the skip targets will not all be 1, meaning the spinlock has to be acquired for each extent allocation. Not a huge deal, but it’s still extra CPU cycles and the possibility of spinlock contention occurring (for a database with a lot of insert activity) that you could avoid by making all the files in the filegroup the same size initially.

If you want, you can watch the FGCB_PRP_FILL spinlock (and others) using the code from this blog post.

Performance Implications

So when do you need to care about proportional fill?

One example is when trying to alleviate tempdb allocation bitmap contention. If you have a single tempdb data file, and huge PAGELATCH_UP contention on the first PFS page in that file (from  a workload with many concurrent connections creating and dropping small temp tables), you might decide to add just one more data file to tempdb (which is not the correct solution). If that existing file is very full, and the new file isn’t, the skip target for the old file will be large and the skip target for the new file will be 1. This means that subsequent allocations in tempdb will be from the new file, moving all the PFS contention to the new file and not providing any contention relief at all! I discuss this case in my post on Correctly adding data file to tempdb.

The more common example is where a filegroup is full and someone adds another file to create space. In a similar way to the example above, subsequent allocations will come from the new file, meaning that when it’s time for a checkpoint operation, all the write activity will be on the new file (and it’s location on the I/O subsystem) rather than spread over multiple files (and multiple locations in the I/O subsystem). Depending on the characteristics of the I/O subsystem, this may or may not cause a degradation in performance.

Summary

Proportional fill is an algorithm that it’s worth knowing about, so you don’t inadvertently cause a performance issue, and so that you can recognize a performance issue caused by a misconfiguration of file sizes in a filegroup. I don’t expect you to be using trace flag 1165, but if you’re interested, it’s a way to dig into the internals of the allocation system.

Enjoy!

Capturing spinlock statistics for a period of time

This is a quick post to blog a script that allows spinlock statistics to be captured for a defined period of time (as I need to reference it in my next post). Enjoy!

IF EXISTS (SELECT * FROM [tempdb].[sys].[objects]
	WHERE [name] = N'##TempSpinlockStats1')
	DROP TABLE [##TempSpinlockStats1];

IF EXISTS (SELECT * FROM [tempdb].[sys].[objects]
	WHERE [name] = N'##TempSpinlockStats2')
	DROP TABLE [##TempSpinlockStats2];
GO

-- Baseline
SELECT * INTO [##TempSpinlockStats1]
FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

-- Now wait...
WAITFOR DELAY '00:00:05';
GO

-- Capture updated stats
SELECT * INTO [##TempSpinlockStats2]
FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

-- Diff them
SELECT
		'***' AS [New],
		[ts2].[name] AS [Spinlock],
		[ts2].[collisions] AS [DiffCollisions],
		[ts2].[spins] AS [DiffSpins],
		[ts2].[spins_per_collision] AS [SpinsPerCollision],
		[ts2].[sleep_time] AS [DiffSleepTime],
		[ts2].[backoffs] AS [DiffBackoffs]
	FROM [##TempSpinlockStats2] [ts2]
	LEFT OUTER JOIN [##TempSpinlockStats1] [ts1]
		ON [ts2].[name] = [ts1].[name]
	WHERE [ts1].[name] IS NULL
UNION
SELECT
		'' AS [New],
		[ts2].[name] AS [Spinlock],
		[ts2].[collisions] - [ts1].[collisions] AS [DiffCollisions],
		[ts2].[spins] - [ts1].[spins] AS [DiffSpins],
		CASE ([ts2].[spins] - [ts1].[spins]) WHEN 0 THEN 0
			ELSE ([ts2].[spins] - [ts1].[spins]) /
				([ts2].[collisions] - [ts1].[collisions]) END
				AS [SpinsPerCollision],
		[ts2].[sleep_time] - [ts1].[sleep_time] AS [DiffSleepTime],
		[ts2].[backoffs] - [ts1].[backoffs] AS [DiffBackoffs]
	FROM [##TempSpinlockStats2] [ts2]
	LEFT OUTER JOIN [##TempSpinlockStats1] [ts1]
		ON [ts2].[name] = [ts1].[name]
	WHERE [ts1].[name] IS NOT NULL
	AND [ts2].[collisions] - [ts1].[collisions] > 0
ORDER BY [New] DESC, [Spinlock] ASC;
GO

Example output (trimmed to fit here):

New  Spinlock                DiffCollisions   DiffSpins   SpinsPerCollision DiffSleepTime   DiffBackoffs
---- ----------------------  ---------------- ----------- ----------------- --------------- ------------
     ALLOC_CACHES_HASH       999              257750      258               0               5
     BLOCKER_ENUM            103              27250       264               0               2
     CMED_HASH_SET           286              71500       250               0               0
     COMPPLAN_SKELETON       148              37000       250               0               0
     DBTABLE                 14               3500        250               0               0
     FGCB_PRP_FILL           4                1000        250               0               0
     FREE_SPACE_CACHE_ENTRY  983              255250      259               0               7
     LOGCACHE_ACCESS         3353             314628      93                0               2241
     LOGFLUSHQ               797              206250      258               0               10
     OPT_IDX_STATS           147              36750       250               0               0
     RESQUEUE                23               5750        250               0               0
     SECURITY_CACHE          106              26500       250               0               0
     SOS_CACHESTORE          235              60750       258               0               3
     SOS_OBJECT_STORE        55               13750       250               0               0
     SOS_SCHEDULER           219              54750       250               0               0
     SOS_SUSPEND_QUEUE       72               18000       250               0               0
     SOS_TASK                69               18750       271               0               2
     SQL_MGR                 394              98500       250               0               0
     XDES                    52               13000       250               0               0
     XDESMGR                 840              341500      406               0               16
     XTS_MGR                 165              42750       259               0               2