Files and filegroups survey results

Last month I kicked off a survey asking you to run some code to send me data on the size, number of files, and number of filegroups in your databases. I got back more than 17000 results, and I’m presenting a couple of ways of interpreting the data here. Interestingly, I only had a handful of results for databases above 100GB and for databases with more than 100 files, so to keep the graphs readable, I’ve chosen to exclude those.

Single Filegroup

First off, for databases with only a primary filegroup, how many files were there in that filegroup?

  • 24 files: 3 databases
  • 20 files: 1 database
  • 16 files: 3 databases
  • 15 files: 1 database
  • 12 files: 1 database
  • 10 files: 1 database
  • 8  files: 10 databases
  • 7 files: 1 database
  • 6 files: 50 databases
  • 5 files: 6 databases
  • 4 files: 56 databases
  • 3 files: 27 databases
  • 2 files: 67 databases
  • 1 file: 16121 databases

Unsurprisingly, single file plus single filegroup is the most common physical layout. We see this over and over, regardless of the size of the database.

As your databases start to increase in size, you need to consider splitting them up (note I’m not using the overloaded term ‘partitioning’) to allow you to do:

  • Targeted restores in the event of wholesale data loss, helping to reduce downtime. If you have multiple filegroups, you can potentially do a partial restore of only the data required to get the OLTP portion of your workload up and running, restoring the rest of the filegroups later.
  • Targeted maintenance to reduce the time and resources necessary to manage fragmentation. If your indexes are split of multiple filegroups (using partitioning) you can rebuild or reorganize just the index portion that has fragmentation.
  • Targeted performance management. If your workload uses/affects multiple portions of  your database, it may be beneficial to place those different portions of the database on different sections of your I/O subsystem.

None of these things can be done with a single file plus single filegroup database.

Data Files vs. Filegroups

FilesFilegroups

This is really interesting. The majority of databases that had more than one filegroup had the number of data files equal to the number of filegroups. I was prepared for this, with quite a few respondents pointing this out in their data and explaining that the database is from a third-party vendor application.

As a general rule of thumb, I recommend each filegroup having 2-4 data files, as this will give overall better I/O performance than a single data file. I’ve seen this over and over on client systems and you can see quite a few data points on the graph above reflecting that too. I also have some empirical evidence from various performance tests I’ve done (narrow scenarios, but definite proof-points):

Just to be clear, based on comments, this is because of parallelism at the I/O subsystem level (not one thread per data file, as that’s a myth, but being able to write to multiple points on the array during checkpoints), and this does not apply if you only have a single physical drive on which to put your data (and I don’t mean what Windows sees as a single drive letter, I really mean just one actual drive). In fact, if you only have a single drive, you have other problems because that gives you no redundancy – you’re living on a knife-edge.

Note that I didn’t include tempdb in this survey. Tempdb is a whole different kettle of fish (excellent British phrase!), where multiple data files can be required to alleviate in-memory contention for allocation bitmaps (classic PAGELATCH_UP/EX contention). For tempdb guidelines see: Tempdb configuration survey results and advice.

Data Files vs. Database Size

FilesDBSize

This is also really interesting. Without knowing what these databases are being used for, it seems that many of these databases have too many files for the size of the database (my gut feel, based on experience). My guess is that the file/filegroup layout was chosen based on rules that don’t equate to the normal reasons for having multiple file/filegroups, as I described above. What’s particularly surprising is the number of database less than 1-GB in size that have many, many data files.

Summary

So how many data files and filegroups should you have? There’s no right answer, and the sweet spot is going to vary for all of you.

Some general rules of thumb:

  • As your databases get larger, it becomes more likely that you’re going to need multiple files and filegroups
  • Multiple filegroups give you enhanced possibilities for targeted disaster recovery, easier manageability, and I/O subsystem placement
  • Each filegroup should have 2-4 files at least, with tempdb being a special case

Whatever you do, don’t forget to test to figure out the optimal configuration for your workload.

The Accidental DBA (Day 11 of 30): Backups: Backup Storage and Retention

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

There are two very common mistakes that DBAs (accidental or otherwise) make around backups:

  1. Storing the only copy of the backups on the same I/O subsystem as the databases
  2. Only keeping the most recent full backup, and differential/log backups since then

Backup Storage

The worst place in the world to store your backups is on the same portion of the I/O subsystem as the databases themselves.

If that I/O subsystem has a problem that causes corruption in the databases, the odds are that the same problem will affect the backups. Similarly, if someone makes a mistake and accidentally re-formats the I/O subsystem, not only will it destroy the databases but also the backups too.

How can you recover if your backups have also been corrupted or destroyed? You can’t. Or at least you can’t restore from your backups. I’ve heard of many cases where companies have lost a lot of money, business, respect, and jobs because of backups being co-located with databases.

Backups need to at least be stored on a different I/O subsystem from the databases, on a different server. That way a localized failure will not damage the backups as well.

Even that’s not good enough though. What if the entire data center is damaged or shut down (think about catastrophic power loss without any backup power source)? If the only backups are stored onsite, recovery will not be possible. For this reason, it’s highly advisable to keep copies of  your backups at an offsite location that’s easily accessible with a reasonable amount of time.

Note that I said ‘copoes of your backups at an offsite location’. You shouldn’t have your backups stored only offsite, because most of the time a disaster is localized and the entire data center is not affected. You just need to cope with the case when it is. So you should have onsite and offsite copies of your backups, to allow for fast access to cope with localized problems as well as slower access when a more comprehensive disaster occurs.

A common backup storage strategy involves first backing up locally to disk storage, to a network share or to local storage and then copied to a network share, and then copying the backups to an archive system onsite plus creating copies to ship offsite.

Whatever your strategy becomes, just don’t have only a single, local copy of your backups, stored with the databases; that’s the worst possible way to do it.

Backup Retention

One of the issues you may have is that you only have limited onsite and offsite storage space for your backups – so how long do you keep your backups for?

Imagine a timeline of backups: F1 L1 L2 D1 L3 L4 D2 L5 L6 D3 L7 L8 F2 L9 L10 D4 L11 L12 D5 L13 L14 D6 L15 L16

You may wonder why you need to keep any backup prior to the most recent full backup, F2. Think about this – what if the F2 full backup is corrupt?

If you haven’t kept any backups prior to F2, without the F2 backup to start your restore sequence, all the backups you have after F2 are completely useless.

If you can’t use the most recent full backup, the next best thing is the full backup you took before that, F1. And to use that full backup, you’ll also need the most recent differential backup that was based on that full backup, D3, plus log backups L7 and L8, to get you up to the point in time around F2, and then log backups L9 to L16 (because differential backups D4-D6 are based off F1, not F2, and so cannot be used).

As you can see, having a corrupt full backup also means having to restore a lot more backups, which will likely blow your downtime requirement, but that should be a rare case and an exception should be able to be made. You just need to make sure you have the older backups.

So, again, how long do you keep your backups for?

If I had a backup strategy that was weekly full backups, daily differential backups, and log backups every 30 minutes, I would try to keep at least a month’s worth of backups at all times onsite, with three months of backups stored offsite. Do you think that sounds excessive? Unless you’re testing a multitude of restore sequences every week using all the older backups, which is rarely possible, you need to have ‘defense in depth’ – multiple possible alternate restore strategies if multiple older backups prove to be damaged or invalid.

Summary

This was the last post in our Accidental DBA series about SQL Server backups. Over the last six posts I’ve explained:

  • What the downtime (RTO) and data loss (RPO) requirements are, and why they’re critical to know for your databases
  • The three recovery models that can be used, and how they affect the log and log backups
  • The three main types of backups (full, differential, and log) and what they’re used for
  • How to design a restore strategy that takes into account the RPO and RTO requirements, which then lets you decide what backups you need to perform
  • Commonly used BACKUP options and why they’re useful
  • The importance of validating backups to make sure that when you need them, they’re usable
  • The importance of storing multiple copies of backups, both onsite and offsite, so a disaster doesn’t result in complete data loss
  • Finally, the importance of retaining older backups so your disaster recovery strategy can cope with some backups being invalid

I hope this has given you a good grounding in the importance of having a good, well-tested backup strategy. Protecting the data from loss and damage is really one of the most important things that a DBA does.

The Accidental DBA (Day 10 of 30): Backups: Backup Testing for Validation

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Kimberly and I have a saying: “You don’t have a backup until you’ve restored it!”. While this is not technically true, the sentiment is that you should be validating your backups, so that you have a good sense of confidence that when a disaster strikes and you go to use your backups, your disaster recovery will be successful.

What do I mean by ‘validating your backups’? I mean that at prescribed intervals you should emulate a disaster recovery response and perform a full restore sequence on a test server, involving your full, differential, and log backups. I usually advise doing this at least once a week, which also happens to be the most common interval between performing full backups. There’s really no frequency that’s too frequent for backup validation, but there are definitely those that are not frequent enough.

There are all kinds of excuses that people have for not regularly validating their backups, including:

  • I don’t have time
  • I don’t have the necessary disk space
  • That’s not my job
  • I don’t think it’s important

All of these excuses are invalid in my opinion. As a DBA, accidental or otherwise, it is part of y our job to protect the integrity of the database for which you’re responsible. Simply performing backups is not enough. You’re also responsible for making sure that those backups are valid, and not just at the time they were taken, but as time passes as well because an I/O subsystem can just as easily corrupt a backup on disk as it can corrupt data and log files on disk. It’s extremely important that, when a disaster strikes, you’re able to successfully use your backups – otherwise what’s the point of having them? You can’t ever get a guarantee (as I/O subsystem corruption can happen at any time) but you can get increased peace of mind.

I’ve lost count of the number of times I’ve heard of people with database corruption that also have corrupt backups, or the backups have been failing for months, or their restore sequence is broken because of a missing log backup, or some other variation of backups not working when they’re needed. You don’t want that to happen to you, and it’s really so easy to test your backups even just occasionally. If you think you don’t have enough disk space, you could even just restore to some external USB drives, anything will work as long as it lets you do the restore.

Now the question is, what RESTORE options do you want to use? At the very least, you want the restore system to verify the contents of the backup. Hopefully you’re using the CHECKSUM option for your backups (that I discussed on day 9) which validates page checksums as the pages are read from the data files and also creates a checksum of the entire backup file. If so, you at the very least you should do something like:

RESTORE VERIFYONLY
FROM DISK = N'F:\SQLskills\20130609_SalesDB_Full.bck'
WITH CHECKSUM;
GO

That will re-validate all the page checksums of the pages in the database in the backup, and also recalculate the checksum of the entire backup and validate it against the checksum stored in the backup header. This doesn’t take any disk space at all, and will let you know if your backup is corrupt.

What’s even better is to actually perform the restore operations, using the CHECKSUM option at all times, and then running consistency checks (through DBCC CHECKDB that we’ll discuss later in the month) to make sure that all the backups you would use are valid, and the database itself is free of corruptions.

If you do find you’ve got corrupt backups, you know you’re going to have to take new backups or potentially go backup to older backups that you’ve kept around for just this eventuality. Which brings up another issue: backup retention. How long do you keep backups around for, and where do you keep them? I’ll talk about that tomorrow.