This is something that came up recently on the Microsoft Certified Master DL, and is something I discuss in our IEPTO1 class because of the performance implications of it, so I thought it would make an interesting post.
Allocation Algorithms
The SQL Server Storage Engine (SE) uses two algorithms when allocating extents from files in a filegroup: round robin and proportional fill.
Round robin means that the SE will try to allocate from each file in a filegroup in succession. For instance, for a database with two files in the primary filegroup (with file IDs 1 and 3, as 2 is always the log file), the SE will try to allocate from file 1 then file 3 then file 1 then file 3, and so on.
The twist in this mechanism is that the SE also has to consider how much free space is in each of the files in the filegroup, and allocate more extents from the file(s) with more free space. In other words, the SE will allocate proportionally more frequently from files in a filegroup with more free space. This twist is called proportional fill.
Proportional fill works by assigning a number to each file in the filegroup, called a ‘skip target’. You can think of this as an inverse weighting, where the higher the value is above 1, the more times that file will be skipped when going round the round robin loop. During the round robin, the skip target for a file is examined, and if it’s equal to 1, an allocation takes place. If the skip target is higher than 1, it’s decremented by 1 (to a minimum value of 1), no allocation takes place, and consideration moves to the next file in the filegroup.
(Note that there’s a further twist to this: when the -E startup parameter is used, each file with a skip target of 1 will be used for 64 consecutive extent allocations before the round robin loop progresses. This is documented in Books Online here and is useful for increasing the contiguity of index leaf levels for very large scans – think data warehouses.)
The skip target for each file is the integer result of (number of free extents in file with most free space) / (number of free extents in this file). The files in the filegroup with the least amount of free space will therefore have the highest skip targets, and there has to be at least one file in the filegroup with a skip target of 1, guaranteeing that each time round the round robin loop, at least one extent allocation takes place.
The skip targets are recalculated whenever a file is added to or removed from a filegroup, or at least 8192 extent allocations take place in the filegroup.
Investigating the Skip Targets
There’s an undocumented trace flag, 1165, that lets us see the skip targets whenever they’re recalculated and I believe the trace flag was added in SQL Server 2008. It also requires trace flag 3605 to be enabled to allow the debugging info to be output.
Let’s try it out!
First I’ll turn on the trace flags, cycle the error log, creating a small database, and look in the error log for pertinent information:
DBCC TRACEON (1165, 3605); GO EXEC sp_cycle_errorlog; GO USE [master]; GO IF DATABASEPROPERTYEX (N'Company', N'Version') > 0 BEGIN ALTER DATABASE [Company] SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE [Company]; END GO CREATE DATABASE [Company] ON PRIMARY ( NAME = N'Company_data', FILENAME = N'D:\SQLskills\Company_data.mdf', SIZE = 5MB, FILEGROWTH = 1MB) LOG ON ( NAME = N'Company_log', FILENAME = N'D:\SQLskills\Company_log.ldf' ); EXEC xp_readerrorlog; GO
2016-10-04 11:38:33.830 spid56 Proportional Fill Recalculation Starting for DB Company with m_cAllocs -856331000. 2016-10-04 11:38:33.830 spid56 Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 1. 2016-10-04 11:38:33.830 spid56 File [Company_data] (1) has 44 free extents and skip target of 1.
The m_cAllocs is the threshold at which the skip targets will be recalculated. In the first line of output, it has a random number as the database has just been created and the counter hasn’t been initialized yet. It’s the name of a class member of the C++ class inside the SE that implements filegroup management.
Now I’ll add another file with the same size:
ALTER DATABASE [Company] ADD FILE ( NAME = N'SecondFile', FILENAME = N'D:\SQLskills\SecondFile.ndf', SIZE = 5MB, FILEGROWTH = 1MB); GO EXEC xp_readerrorlog; GO
2016-10-04 11:41:27.880 spid56 Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192. 2016-10-04 11:41:27.880 spid56 Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 3. 2016-10-04 11:41:27.880 spid56 File [Company_data] (1) has 44 free extents and skip target of 1. 2016-10-04 11:41:27.880 spid56 File [SecondFile] (3) has 79 free extents and skip target of 1.
Note that even though the two files have different numbers of extents, the integer result of 79 / 44 is 1, so the skip targets are both set to 1.
Now I’ll add a much larger file:
ALTER DATABASE [Company] ADD FILE ( NAME = N'ThirdFile', FILENAME = N'D:\SQLskills\ThirdFile.ndf', SIZE = 250MB, FILEGROWTH = 1MB); GO EXEC xp_readerrorlog; GO
2016-10-04 11:44:20.310 spid56 Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192. 2016-10-04 11:44:20.310 spid56 Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 4. 2016-10-04 11:44:20.310 spid56 File [Company_data] (1) has 44 free extents and skip target of 90. 2016-10-04 11:44:20.310 spid56 File [ThirdFile] (4) has 3995 free extents and skip target of 1. 2016-10-04 11:44:20.310 spid56 File [SecondFile] (3) has 79 free extents and skip target of 50.
The file with the most free space is file ID 4, so the skip targets of the other files are set to (file 4’s free extents) / (free extents in the file). For example, the skip target for file 1 becomes the integer result of 3995 / 44 = 90.
Now I’ll create a table that can have only one row per page, and force more than 8192 extent allocations to take place (by inserting more than 8192 x 8 rows, forcing that many pages to be allocated). This will also mean the files will have autogrown and will have roughly equal numbers of free extents.
USE [Company]; GO CREATE TABLE [BigRows] ( [c1] INT IDENTITY, [c2] CHAR (8000) DEFAULT 'a'); GO SET NOCOUNT ON; GO INSERT INTO [BigRows] DEFAULT VALUES; GO 70000 EXEC xp_readerrorlog; GO
2016-10-04 11:55:28.840 spid56 Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192. 2016-10-04 11:55:28.840 spid56 Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 3. 2016-10-04 11:55:28.840 spid56 File [Company_data] (1) has 0 free extents and skip target of 74. 2016-10-04 11:55:28.840 spid56 File [ThirdFile] (4) has 0 free extents and skip target of 74. 2016-10-04 11:55:28.840 spid56 File [SecondFile] (3) has 74 free extents and skip target of 1.
We can see that all the files have filled up and auto grown, and randomly file ID 3 is now the one with the most free space.
Spinlock Contention
The skip targets for the files in a filegroup are protected by the FGCB_PRP_FILL spinlock, so this spinlock has to be acquired for each extent allocation, to determine which file to allocate from next. There’s an exception to this when all the files in a filegroup have roughly the same amount of free space (so they all have a skip target of 1). In that case, there’s no need to acquire the spinlock to check the skip targets.
This means that if you create a filegroup that has file sizes that are different, the odds are that they will auto grow at different times and the skip targets will not all be 1, meaning the spinlock has to be acquired for each extent allocation. Not a huge deal, but it’s still extra CPU cycles and the possibility of spinlock contention occurring (for a database with a lot of insert activity) that you could avoid by making all the files in the filegroup the same size initially.
If you want, you can watch the FGCB_PRP_FILL spinlock (and others) using the code from this blog post.
Performance Implications
So when do you need to care about proportional fill?
One example is when trying to alleviate tempdb allocation bitmap contention. If you have a single tempdb data file, and huge PAGELATCH_UP contention on the first PFS page in that file (from a workload with many concurrent connections creating and dropping small temp tables), you might decide to add just one more data file to tempdb (which is not the correct solution). If that existing file is very full, and the new file isn’t, the skip target for the old file will be large and the skip target for the new file will be 1. This means that subsequent allocations in tempdb will be from the new file, moving all the PFS contention to the new file and not providing any contention relief at all! I discuss this case in my post on Correctly adding data file to tempdb.
The more common example is where a filegroup is full and someone adds another file to create space. In a similar way to the example above, subsequent allocations will come from the new file, meaning that when it’s time for a checkpoint operation, all the write activity will be on the new file (and it’s location on the I/O subsystem) rather than spread over multiple files (and multiple locations in the I/O subsystem). Depending on the characteristics of the I/O subsystem, this may or may not cause a degradation in performance.
Summary
Proportional fill is an algorithm that it’s worth knowing about, so you don’t inadvertently cause a performance issue, and so that you can recognize a performance issue caused by a misconfiguration of file sizes in a filegroup. I don’t expect you to be using trace flag 1165, but if you’re interested, it’s a way to dig into the internals of the allocation system.
Enjoy!
13 thoughts on “Investigating the proportional fill algorithm”
Amazing article Paul…
That’s great article.
What do you mean by “at least 8192 extent allocations take place in the filegroup”, are you saying that it needs to grow 64KB X 8192 = 512MB ?
Thanks.
No, I mean that 8192 extent allocation requests occur, regardless of whether or not one or more files need to grow to allow that.
Hi, Execellent article. I am working with a customer who do not believe in this. I am trying to créate some demo scripts but I am not sure how many extents are allocated in every loop for an idividual file. ¿Do you have that info? I can not find it
thanks
One extent per file per loop. Unless -E is used, in which case 64 extents (in 64 separate allocations) per file, to increase readahead performance of very large indexes.
Great Publish as usual , but i am trying to understand the tempdb initial size concept, it will reset to the minimum as low as 8mb not matter what we set as initial file size , when we perform shrink file or shrink database operation. Why not it shrink to Initial size what we set as a part of Instant File Initialization ( as a Perform Volume Maintenance tasks ).
Greatly Appreciate any comments.
No – on server restart, the tempdb data files will reset to whatever size they were last specifically set to (either by database creation or subsequent ALTER DATABASE … MODIFY FILE statements.
There’s also FGCB_add_remove latch for the same purpose . Now 2 things for the same purpose.
Please will be helpful if you pls ans it
What do you mean? FGCB_ADD_REMOVE latch isn’t for controlling proportional fill weightings. The FGCB_PRP_FILL spinlock controls access to them.
Please here where file targets in file are saved.
Regards
They’re not – it’s an in-memory data structure for the filegroup only.
I read on books online (https://docs.microsoft.com/en-us/sql/relational-databases/databases/database-files-and-filegroups?view=sql-server-ver15) that when autogrow is enabled, and all space in all files in a filegroup is exhausted that the files get expanded in a round robin way and not all at (approx) the same time. So for example, if you has 3 files A, B &C all initially set to 1GB and autogrow set to 1GB. All 3 files will fill at approximately the same time but initially only file A gets expanded to 2GBs. Because of proportional fill all writes will get sent to this this file. This approach by microsoft doesnt make sense to me as surely it will cause the contention you mentioned above?
No because not *all* the allocations will come from the grown file, so the other files will quickly grow as well.