I feel like I have had this discussion multiple times over the last few months with clients, so I decided to turn those discussions into a blog post. One of the things we often get asked about is how to plan a backup strategy, and my routine answer is “JUST DON’T.” You don’t need a backup strategy, you need a RESTORE strategy, and planning a BACKUP strategy without having a RESTORE strategy is fundamentally flawed. This post will explain why….
Step 1: Gather SLAs
If you don’t already have them, you need to get service level agreements (SLA’s) in place for Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) for every database you are responsible for. These are the two key drivers in designing any solution for High Availability (HA) or Disaster Recover (DR) and without them, you won’t know if your solution will meet the business requirements. The RPO determines what the acceptable amount of data loss for the database is and drives how frequently you need backups to be occurring. The RTO determines how long you have to recover the data in the event of a total catastrophe.
It is very easy to design a backup strategy that should meet your RPO requirements. However, not every strategy that meets the RPO is also going to meet the RTO requirement. In fact, there are RTO’s out there that can’t be met with just backups alone, which is why we have HA/DR features in SQL Server like Failover Clustered Instances, Availability Groups, and Log Shipping, to reduce the time between failure to recovery to minutes (potentially seconds if you are lucky).
Step 2: Test RESTORE times
Do you know how long it takes to RESTORE each database from a FULL backup? Does your run book include critical configuration options like Instant File Initialization or backup compression that can impact how long it takes to restore a database? Does the database have an excessively large transaction log file that will require zero initialization upon creation if the existing database files are not there? Does the database use transparent database encryption (TDE) which will prevent it from being able to instant initialize the size of the data files for the database?
Until you test your RESTORE time for a FULL backup, you don’t know if you can meet your RTO utilizing backups or not, and you may need one of the previously mentioned HA/DR features in SQL Server. However, in an absolute worst case scenario where recovery can only occur from backups, it is still important to know how long each step of the process is going to take so that you can set expectations appropriately. Your minimum recovery time is going to at least be the time required to restore the most recent FULL backup.
Step 3: Determine Data Change Rate
How fast does data change for the database? This is going to determine how long subsequent restores can take to roll the database forward to a point-in-time minimizing data loss and is a key feeder into the RESTORE strategy. The most important things to consider here is:
The answers to those questions are not as simple as you might think.
One of the worst things I often see in terms of planning just a BACKUP strategy is running a FULL backup of the database, followed immediately by index maintenance. The FULL backup clears out the differential bitmaps in the database, and then index maintenance changes tons of pages and the next DIFFERENTIAL backup contains all of the pages that were modified since the last FULL backup. In one case for a client, the DIFFERENTIAL backup was actually larger than the FULL backup because the index maintenance task rebuilt every index in the database. This means that to restore they would essentially be restoring the entire database twice before they even started restoring transaction log backups. Generally the thinking is that you should have a FULL backup of the database before any maintenance happens “Just in case…” which I can understand on some levels but find absolutely ridiculous at other levels.
Step 4: Plan Out RESTORE Steps
How many files are you going to have to RESTORE in a worst case scenario? How many steps are involved and what are those steps? What happens if someone makes a mistake in the middle of those steps? It happens to the best of us, and I have made the mistake of running a FULL restore of a 8TB database WITH NORECOVERY and then missed scripting WITH NORECOVERY on the first LOG restore during a production down scenario and had to start over restoring the FULL backup again. Ask Erin!
Ideally you want to have the right backups in place to allow you to restore to a point-in-time, in the least number of steps possible to meet your RPO, but you are also planning out how to meet the RTO as well. RPO is easily met with a good FULL backup and subsequent LOG backups, minimizing risk of data loss. However, if you only perform FULL backups once a week, and you have 5 minute LOG backups being taken, you may have to restore over 2000+ log backups to roll the database forward fully if there is a failure at the end of the week requiring going to backups. This specifically is why you should NEVER plan a BACKUP strategy and you need to plan a RESTORE strategy first. Know the steps required and design around reducing those to the extent possible to get the fastest recovery time.
This is where DIFFERENTIAL backups come into the design since we can use the DIFFERENTIAL backup to minimize the number of steps that are required to roll the database forward to a point-in-time. If we had a weekly FULL backup and DIFFERNTIAL backups daily, the restore would only require:
- Restoring the latest FULL backup (WITH NORECOVERY)
- Restoring the latest DIFFERENTIAL backup (WITH NORECOVERY)
- Restoring the LOG backups since the differential – if at 5 minute intervals the worst case is a day worth of LOG backups is 288
We just reduced our operations from over 2000+ to 290 in a worst case by planning for how we are going to RESTORE the database. This might still be too much and you could introduce 12hr DIFFERENTIAL backups and cut it in half again, but it’s all about knowing what is going to be required to actually perform the recovery of the database.
Step 5: Design a BACKUP Strategy to Meet RTO for RESTORE
Now that you have done all the RESTORE planning, and hopefully testing, you can design the BACKUP strategy based on what you have learned/figured out. Some places do daily FULL backups, with 4 hour DIFFERENTIAL backups, and 15 minute LOG backups. Some places do weekly FULL with daily DIFFERENTIAL backups and hourly LOG backups. We even have customers that trigger the next LOG backup immediately when the running LOG backup completes, so that it is constantly backing up the transaction log. There is no one size fits all solution for SQL Server.
Step 6: Test and Verify
My good friend Buck Woody said “You don’t have a backup until you have restored it” in one of my absolute favorite sessions at SQLBits VII, “Creating a Business Continuity Plan.” It is so true… You can’t just deploy a backup strategy and expect that it is going to meet your recovery requirements permanently. You have to test it repeatedly and verify that you can continue to meet your RPO and RTO requirements. If you can’t then the strategy and design needs to change, or the business needs to be made aware of the problem to correctly set the expectations.