Investigating the proportional fill algorithm

This is something that came up recently on the Microsoft Certified Master DL, and is something I discuss in our IEPTO1 class because of the performance implications of it, so I thought it would make an interesting post.

Allocation Algorithms

The SQL Server Storage Engine (SE) uses two algorithms when allocating extents from files in a filegroup: round robin and proportional fill.

Round robin means that the SE will try to allocate from each file in a filegroup in succession. For instance, for a database with two files in the primary filegroup (with file IDs 1 and 3, as 2 is always the log file), the SE will try to allocate from file 1 then file 3 then file 1 then file 3, and so on.

The twist in this mechanism is that the SE also has to consider how much free space is in each of the files in the filegroup, and allocate more extents from the file(s) with more free space. In other words, the SE will allocate proportionally more frequently from files in a filegroup with more free space. This twist is called proportional fill.

Proportional fill works by assigning a number to each file in the filegroup, called a ‘skip target’. You can think of this as an inverse weighting, where the higher the value is above 1, the more times that file will be skipped when going round the round robin loop. During the round robin, the skip target for a file is examined, and if it’s equal to 1, an allocation takes place. If the skip target is higher than 1, it’s decremented by 1 (to a minimum value of 1), no allocation takes place, and consideration moves to the next file in the filegroup.

(Note that there’s a further twist to this: when the -E startup parameter is used, each file with a skip target of 1 will be used for 64 consecutive extent allocations before the round robin loop progresses. This is documented in Books Online here and is useful for increasing the contiguity of index leaf levels for very large scans – think data warehouses.)

The skip target for each file is the integer result of (number of free extents in file with most free space) / (number of free extents in this file). The files in the filegroup with the least amount of free space will therefore have the highest skip targets, and there has to be at least one file in the filegroup with a skip target of 1, guaranteeing that each time round the round robin loop, at least one extent allocation takes place.

The skip targets are recalculated whenever a file is added to or removed from a filegroup, or at least 8192 extent allocations take place in the filegroup.

Investigating the Skip Targets

There’s an undocumented trace flag, 1165, that lets us see the skip targets whenever they’re recalculated and I believe the trace flag was added in SQL Server 2008. It also requires trace flag 3605 to be enabled to allow the debugging info to be output.

Let’s try it out!

First I’ll turn on the trace flags, cycle the error log, creating a small database, and look in the error log for pertinent information:

DBCC TRACEON (1165, 3605);
GO

EXEC sp_cycle_errorlog;
GO

USE [master];
GO

IF DATABASEPROPERTYEX (N'Company', N'Version') > 0
BEGIN
	ALTER DATABASE [Company] SET SINGLE_USER
		WITH ROLLBACK IMMEDIATE;
	DROP DATABASE [Company];
END
GO

CREATE DATABASE [Company] ON PRIMARY (
    NAME = N'Company_data',
    FILENAME = N'D:\SQLskills\Company_data.mdf',
	SIZE = 5MB,
    FILEGROWTH = 1MB)
LOG ON (
    NAME = N'Company_log',
    FILENAME = N'D:\SQLskills\Company_log.ldf'
);

EXEC xp_readerrorlog;
GO
2016-10-04 11:38:33.830 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs -856331000.
2016-10-04 11:38:33.830 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 1.
2016-10-04 11:38:33.830 spid56       	File [Company_data] (1) has 44 free extents and skip target of 1. 

The m_cAllocs is the threshold at which the skip targets will be recalculated. In the first line of output, it has a random number as the database has just been created and the counter hasn’t been initialized yet. It’s the name of a class member of the C++ class inside the SE that implements filegroup management.

Now I’ll add another file with the same size:

ALTER DATABASE [Company] ADD FILE (
	NAME = N'SecondFile',
	FILENAME = N'D:\SQLskills\SecondFile.ndf',
	SIZE = 5MB,
    FILEGROWTH = 1MB);
GO

EXEC xp_readerrorlog;
GO
2016-10-04 11:41:27.880 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192.
2016-10-04 11:41:27.880 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 3.
2016-10-04 11:41:27.880 spid56       	File [Company_data] (1) has 44 free extents and skip target of 1. 
2016-10-04 11:41:27.880 spid56       	File [SecondFile] (3) has 79 free extents and skip target of 1. 

Note that even though the two files have different numbers of extents, the integer result of 79 / 44 is 1, so the skip targets are both set to 1.

Now I’ll add a much larger file:

ALTER DATABASE [Company] ADD FILE (
	NAME = N'ThirdFile',
	FILENAME = N'D:\SQLskills\ThirdFile.ndf',
	SIZE = 250MB,
    FILEGROWTH = 1MB);
GO

EXEC xp_readerrorlog;
GO
2016-10-04 11:44:20.310 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192.
2016-10-04 11:44:20.310 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 4.
2016-10-04 11:44:20.310 spid56       	File [Company_data] (1) has 44 free extents and skip target of 90. 
2016-10-04 11:44:20.310 spid56       	File [ThirdFile] (4) has 3995 free extents and skip target of 1. 
2016-10-04 11:44:20.310 spid56       	File [SecondFile] (3) has 79 free extents and skip target of 50. 

The file with the most free space is file ID 4, so the skip targets of the other files are set to (file 4’s free extents) / (free extents in the file). For example, the skip target for file 1 becomes the integer result of 3995 / 44 = 90.

Now I’ll create a table that can have only one row per page, and force more than 8192 extent allocations to take place (by inserting more than 8192 x 8 rows, forcing that many pages to be allocated). This will also mean the files will have autogrown and will have roughly equal numbers of free extents.

USE [Company];
GO

CREATE TABLE [BigRows] (
	[c1] INT IDENTITY,
	[c2] CHAR (8000) DEFAULT 'a');
GO

SET NOCOUNT ON;
GO

INSERT INTO [BigRows] DEFAULT VALUES;
GO 70000

EXEC xp_readerrorlog;
GO
2016-10-04 11:55:28.840 spid56       Proportional Fill Recalculation Starting for DB Company with m_cAllocs 8192.
2016-10-04 11:55:28.840 spid56       Proportional Fill Recalculation Completed for DB Company new m_cAllocs 8192, most free file is file 3.
2016-10-04 11:55:28.840 spid56       	File [Company_data] (1) has 0 free extents and skip target of 74. 
2016-10-04 11:55:28.840 spid56       	File [ThirdFile] (4) has 0 free extents and skip target of 74. 
2016-10-04 11:55:28.840 spid56       	File [SecondFile] (3) has 74 free extents and skip target of 1. 

We can see that all the files have filled up and auto grown, and randomly file ID 3 is now the one with the most free space.

Spinlock Contention

The skip targets for the files in a filegroup are protected by the FGCB_PRP_FILL spinlock, so this spinlock has to be acquired for each extent allocation, to determine which file to allocate from next. There’s an exception to this when all the files in a filegroup have roughly the same amount of free space (so they all have a skip target of 1). In that case, there’s no need to acquire the spinlock to check the skip targets.

This means that if you create a filegroup that has file sizes that are different, the odds are that they will auto grow at different times and the skip targets will not all be 1, meaning the spinlock has to be acquired for each extent allocation. Not a huge deal, but it’s still extra CPU cycles and the possibility of spinlock contention occurring (for a database with a lot of insert activity) that you could avoid by making all the files in the filegroup the same size initially.

If you want, you can watch the FGCB_PRP_FILL spinlock (and others) using the code from this blog post.

Performance Implications

So when do you need to care about proportional fill?

One example is when trying to alleviate tempdb allocation bitmap contention. If you have a single tempdb data file, and huge PAGELATCH_UP contention on the first PFS page in that file (from  a workload with many concurrent connections creating and dropping small temp tables), you might decide to add just one more data file to tempdb (which is not the correct solution). If that existing file is very full, and the new file isn’t, the skip target for the old file will be large and the skip target for the new file will be 1. This means that subsequent allocations in tempdb will be from the new file, moving all the PFS contention to the new file and not providing any contention relief at all! I discuss this case in my post on Correctly adding data file to tempdb.

The more common example is where a filegroup is full and someone adds another file to create space. In a similar way to the example above, subsequent allocations will come from the new file, meaning that when it’s time for a checkpoint operation, all the write activity will be on the new file (and it’s location on the I/O subsystem) rather than spread over multiple files (and multiple locations in the I/O subsystem). Depending on the characteristics of the I/O subsystem, this may or may not cause a degradation in performance.

Summary

Proportional fill is an algorithm that it’s worth knowing about, so you don’t inadvertently cause a performance issue, and so that you can recognize a performance issue caused by a misconfiguration of file sizes in a filegroup. I don’t expect you to be using trace flag 1165, but if you’re interested, it’s a way to dig into the internals of the allocation system.

Enjoy!

DBCC CHECKDB scalability and performance benchmarking on SSDs

Back in February I ran a bunch of performance tests of DBCC CHECKDB on SSDs, to evaluate the effects of degree-of-parallelism (DOP) and various options and traceflags, and now I’m finally getting around to presenting the results. Make sure to also read the recent post where I talk about the detrimental effect of computed-column indexes on DBCC CHECKDB performance.

[Edit: I’ve heard anecdotal evidence that reducing DOP for DBCC CHECKDB is even more important on 2012 and later versions. YMMV.]

The two variables I altered during my tests were:

  • DOP – 32, 16, 8, 4, 2, 1
  • Configuration – full DBCC CHECKDB, using WITH PHYSICAL_ONLY (PO), using PO and TF2549, using PO and TF2562, using PO and both trace flags

Trace flags 2549 and 2562 are described in KB 2634571, and can be summarized as:

  • 2549: treat each database file as if it’s on a separate physical drive for the purposes of driving readahead
  • 2562: perform the entire set of checks in a single ‘batch’ instead of multiple batches, and increase the efficient of per-thread calls to get a new list of pages to read

These two trace flags really only apply to DBCC CHECKDB when the WITH_PHYSICAL_ONLY option is used.

My test system is as follows:

  • DELL R720 with two 8-core E5-2670 CPUs and hyper-threading enabled
  • 64GB of memory
  • The test database is AdventureWorks that Jonathan expanded to 500GB for me using his cool scripts. The database is split over 8 data files stored on two 320GB Fusion-io drives, with tempdb and its log placed on two more 320GB Fusion-io drives. I set things up this way to remove I/O waits from the test.
  • There was no additional load running concurrently with DBCC CHECKDB.

Complete Results

The complete results are shown below:

CHECKDB-DOP1

There are some very interesting results from this graph, for the test I’m doing using SSDs:

  • For a full DBCC CHECKDB, there is a 70-80% performance gain from DOP 1 to DOP 2, from DOP 2 to DOP 4, and from DOP 4 to DOP 8.
  • For a full DBCC CHECKDB run, there is really no scalability gain in performance for a DOP greater than 8, and in fact a slight performance degradation for higher DOPs.
  • As you can see for the results of a full DBCC CHECKDB with DOP 8 and higher, DBCC CHECKDB on that server without I/O waits can process 0.5GB of the database per second. That’s pretty cool and is down to the fast SSDs (there were no PAGEIOLATCH_SH waits) and fast processors. Back in SQL Server 2000, our performance benchmark was about 1GB per minute. See here for some example SQL Server 2000 benchmark times that show around 3-5GB/minute, but not on SSDs.
  • The main boost in performance when running on SSDs comes from using the WITH PHYSICAL_ONLY option. This is, of course, to be expected as it cuts out all of the logical checks that make up the bulk of the CPU usage of DBCC CHECKDB.

For the scalability result running a full DBCC CHECKDB, some latch contention for structures inside DBCC CHECKDB starts to show up at DOP 8, but not enough to give zero scalability at higher DOPs. This comes from a much higher time spent waiting for CXPACKET waits, roughly doubling from DOP 8 to DOP 16, and again from DOP 16 to DOP 32. I blame the query processor for that :-) In all seriousness, I didn’t expect to see a scalability gain with DOP higher than 8, based on what I’ve seen in the past.

PHYSICAL_ONLY Results

The results without DOP 1 and DOP 2 are shown below, which allows us to focus on the PHYSICAL_ONLY results:

CHECKDB-DOP2

Again, for the test I’m doing using SSDs, there are some interesting results:

  • DOP 4 seems to give the best performance overall, apart from the use of both trace flags with DOP 8.
  • Higher DOPs get progressively worse performance. Analysis of wait and latch statistics show that this is caused by increasing amounts of latch contention at higher DOPs.
  • The two trace flags don’t have a pronounced effect on performance in my situation.

Summary

When running with a database on SSDs and very fast processors, full DBCC CHECKDB performance scales to DOP 8, and the main performance gain comes from using the WITH PHYSICAL_ONLY option. The two trace flags introduced in late 2011 don’t have a pronounced effect on PHYSICAL_ONLY performance. This is all to be expected, as the SSDs really remove the I/O component of DBCC CHECKDB performance and scalability from consideration.

Although these tests are not conclusive until I’ve run similar tests on a slower I/O subsystem, I’m tempted to recommend limiting DBCC CHECKDB‘s DOP to 8 if you can (maybe using Resource Governor) to limit its resource drain on your system.

I obviously also have to recommend using SSDs to vastly reduce or effectively eliminate (in my case) I/O waits.

During April and May I’ll perform the same set of tests using 15k SCSI drives and also 7.2k SATA drives to see if and how the results differ.

Hope you found this interesting!

The pros and cons of trace flags

If I told you about an undocumented mod you could make to your car's brakes to make them more efficient, would you try it? What if a random mechanic told you? What if someone who's an expert in your make and model of car told you? What if the car manufacturer asked you to try it to solve a brake problem – and just drive around a bit to see what happens and see if it stops the car crashing again?

:-)

A month ago I kicked off a survey about your comfort level with using trace flags in production and now it's time to look at the results.

 

The Other values are:

  • 15 x "It depends!!!!"
  • 7 x "We use them where specifically needed for short amounts of time example – the deadlock trace flags (1204 and 1222) to dump the details to the error log. in some cases as needed we will use startup trace flags – example 845 on 2005 x64 standard to allow."
  • 6 x "If Paul says they're okay."
  • 4 x "We use Trace flags which well documented by MS *and* reputable sources and tested in a controlled manner to ensure continued stability."
  • 3 x "Deadlock full details – because deadlocks are rare, I need full details every time because lightening doesn't strike twice."
  • 3 x "I feel comfortable, each trace does different things, if we need it we'll use it."
  • 3 x "I rarely if ever use a trace flag in production but I'm comfortable using one if it's documented by a reputable source and it makes sense for us."
  • 2 x "Enable it if css asks, or it its documented in ms official documentation but test before implementing."
  • "Either from MS source or from a non-MS but reputable source."
  • "If I hear/read about it from a non-MS, but reputable source, to fix a bug or boost performance."
  • "If some doc is evident and MS knows what the TF actually does, instead of their telling you to turn it on without realizing it affects more than they thought (ex/ TF2330 stops all index usage stats gathering)."
  • "Only after testing in a dev/test environment first."
  • "Will use them on need basis – for example – deadlock trace 1222 to log the deadlock info in the error log."

Trace flags are interesting things. They change the behavior of SQL Server in a huge variety of sometimes useful, sometimes dangerous, and sometimes bizarre ways. There are many hundreds of trace flags, most of which are undocumented – and usually for the very good reasons that they're only supposed to be used for testing, for debugging problems with Product Support, or to enable specific bug fixes that may be detrimental to others.

I never advise turning on a trace flag (either when teaching a class or discussing with a client) unless I'm rock-solid 100% sure about what it does and what the effects are. You should do the same when considering trace flag use on a production system.

The example I like to use is trace flag 661 that disables the ghost cleanup background task. It's documented in KB 920093 and I've blogged about it here (and in a few other posts). It's a very useful trace flag which is safe to use, is documented by MS, is referenced by trustworthy sources (well, me :-) and can give a performance boost under some circumstances. However, it can have horrible side-effects if you don't understand the ramifications of turning it on – deleted data and index records stay around until the next index rebuild/reorganize, leading to page splits and data file growth.

Where trace flags are concerned, you need to test and validate before enabling in production – no matter *who* you hear about it from.

Be careful out there!