Last month I kicked off a survey asking you to run some code to send me data on the size, number of files, and number of filegroups in your databases. I got back more than 17000 results, and I’m presenting a couple of ways of interpreting the data here. Interestingly, I only had a handful of results for databases above 100GB and for databases with more than 100 files, so to keep the graphs readable, I’ve chosen to exclude those.
First off, for databases with only a primary filegroup, how many files were there in that filegroup?
- 24 files: 3 databases
- 20 files: 1 database
- 16 files: 3 databases
- 15 files: 1 database
- 12 files: 1 database
- 10 files: 1 database
- 8 files: 10 databases
- 7 files: 1 database
- 6 files: 50 databases
- 5 files: 6 databases
- 4 files: 56 databases
- 3 files: 27 databases
- 2 files: 67 databases
- 1 file: 16121 databases
Unsurprisingly, single file plus single filegroup is the most common physical layout. We see this over and over, regardless of the size of the database.
As your databases start to increase in size, you need to consider splitting them up (note I’m not using the overloaded term ‘partitioning’) to allow you to do:
- Targeted restores in the event of wholesale data loss, helping to reduce downtime. If you have multiple filegroups, you can potentially do a partial restore of only the data required to get the OLTP portion of your workload up and running, restoring the rest of the filegroups later.
- Targeted maintenance to reduce the time and resources necessary to manage fragmentation. If your indexes are split of multiple filegroups (using partitioning) you can rebuild or reorganize just the index portion that has fragmentation.
- Targeted performance management. If your workload uses/affects multiple portions of your database, it may be beneficial to place those different portions of the database on different sections of your I/O subsystem.
None of these things can be done with a single file plus single filegroup database.
Data Files vs. Filegroups
This is really interesting. The majority of databases that had more than one filegroup had the number of data files equal to the number of filegroups. I was prepared for this, with quite a few respondents pointing this out in their data and explaining that the database is from a third-party vendor application.
As a general rule of thumb, I recommend each filegroup having 2-4 data files, as this will give overall better I/O performance than a single data file. I’ve seen this over and over on client systems and you can see quite a few data points on the graph above reflecting that too. I also have some empirical evidence from various performance tests I’ve done (narrow scenarios, but definite proof-points):
Note that I didn’t include tempdb in this survey. Tempdb is a whole different kettle of fish (excellent British phrase!), where multiple data files can be required to alleviate in-memory contention for allocation bitmaps (classic PAGELATCH_UP/EX contention). For tempdb guidelines see: Tempdb configuration survey results and advice.
Data Files vs. Database Size
This is also really interesting. Without knowing what these databases are being used for, it seems that many of these databases have too many files for the size of the database (my gut feel, based on experience). My guess is that the file/filegroup layout was chosen based on rules that don’t equate to the normal reasons for having multiple file/filegroups, as I described above. What’s particularly surprising is the number of database less than 1-GB in size that have many, many data files.
So how many data files and filegroups should you have? There’s no right answer, and the sweet spot is going to vary for all of you.
Some general rules of thumb:
- As your databases get larger, it becomes more likely that you’re going to need multiple files and filegroups
- Multiple filegroups give you enhanced possibilities for targeted disaster recovery, easier manageability, and I/O subsystem placement
- Each filegroup should have 2-4 files at least, with tempdb being a special case
Whatever you do, don’t forget to test to figure out the optimal configuration for your workload.