This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Now that I’ve explained the downtime (RTO) and data loss (RPO) requirements plus the various recovery models and backup types, I want to explain how to design a recovery strategy.

When you’re designing a recovery strategy, the big temptation is to start by designing a backup strategy.

Don’t do it!

You need to design a restore strategy, not a backup strategy.

It’s all to do with framing your thinking around being able to restore while meeting your downtime and data-loss requirements.

  • Your downtime requirement tells you how quickly you must be able to restore, while meeting your data loss requirement.
  • Your data loss requirement tells you the point in time to which you must be able to restore, while meeting your downtime requirement.

If you think about it from the backup perspective first, you might design a backup strategy that does NOT allow you to meet your downtime and data-loss requirements.

A Real-Life Example

Here’s a story I like to tell when I’m teaching the consistency checking and corruption recovery section of our IE3 Immersion Event on High Availability and Disaster Recovery. Back in April 2004, when I still worked on the SQL Server team at Microsoft, I was helping out in a Product Support case involving a bank that had suffered corruption as part of a crash. They had a single full backup from January 2004 and half-hourly log backups between January and April, over 5000 backups in total. All the backups restored perfectly, but their restore ended up taking three days. All bank operations were down while the restore was taking place, and this extended outage eventually led to the bank going out of business because of a lack of client confidence.

What were they thinking, you may ask? How could they have designed that backup strategy?

Well that was their problem. They did not have much SQL Server experience and they thought they had a great backup strategy:

  • They had a full backup
  • They were running the full recovery model
  • They were taking regular log backups

Their mistake was that they had not taken their downtime (8 hours maximum) and data loss (zero) requirements into account. They met their data loss requirement but not their downtime requirement, and they had never tested doing a restore so had no idea that their requirements wouldn’t be met. They should have designed a restore strategy.

Designing a Restore Strategy

It’s imperative that you work out how quickly you have to be able to restore, and to what point, and then use that information to figure out what backups you need to do so you can perform the correct restore sequence when disaster strikes.

As far as data loss is concerned, the amount of data loss that can be tolerated defines how frequently you need to perform log backups. If 15 minutes is the acceptable amount of data loss, then you must perform log backups at least every 15 minutes. If zero is the acceptable amount of data loss, relying on log backups is dangerous, as a disaster may destroy the data and log files, making a tail-of-the-log . In that case, you will need some kind of synchronous technology like database mirroring or AlwaysOn availability groups – which we’ll discuss on day 22 of this series.

Data loss is the easier requirement to satisfy, as it’s really just thinking about log backup frequency. The downtime requirement is a lot harder, in my opinion.

As far as downtime is concerned, the amount of downtime that can be tolerated defines how quickly the entire restore sequence has to complete, from the initial restore of the most recent full backup, through to restoring the final log backup that gets the database to a point in time within the data loss requirement.

This means that you’re going to have to test the restore sequence to make sure that in the worst case it’s possible to restore within the downtime requirement.

Example Scenario

For example, with a 30-minute data-loss requirement and a 2 hour downtime requirement, you might design a restore strategy that is:

  • Perform a tail-of-the-log backup
  • Restore from the most recent full backup
  • Restore all log backups since the most recent full backup

And then set the full backup frequency at every day at midnight and the log backup frequency at every half hour. The full backup takes 20 minutes to complete and each log backup takes 5 minutes to complete.

The worst case scenario is that a crash occurs at 23:59:59. This means the restore sequence is (with some estimated completion times):

  • Perform the tail-of-the-log backup to backup the log generated since the log backup at 23:30. This takes 5 minutes.
  • Restore the full backup from midnight. This takes 20 minutes.
  • Restore the 48 log backups from today, starting with the one at 00:00 and ending with the one at 23:30. Each one takes 5 minutes, for total time of 240 minutes.
  • Restore the tail-of-the-log backup. This takes 5 minutes.

In total, the restore sequence takes 270 minutes, or 4 hours 30 minutes. This is clearly longer than the downtime requirement.

The solution is to add differential backups during each day, say every 4 hours starting at midnight, with each one taking 10 minutes to complete.

The restore sequence then becomes:

  • Perform the tail-of-the-log backup to backup the log generated since the log backup at 23:30. This takes 5 minutes.
  • Restore the full backup from midnight. This takes 20 minutes.
  • Restore the most recent differential backup from 20:00. This takes 10 minutes.
  • Restore the 8 log backups from today, starting with the one at 20:00 and ending with the one at 23:30. Each one takes 5 minutes, for total time of 40 minutes.
  • Restore the tail-of-the-log backup. This takes 5 minutes.

Now the total restore sequence takes 80 minutes, or 1 hour 20 minutes. This is within the downtime requirement.

Summary

As you can see from my (admittedly contrived) example, putting together a recovery strategy involves:

  1. Defining downtime and data-loss requirements for the database
  2. Working out what restores are necessary to meet #1
  3. Working out what backups are necessary to meet #2
  4. Testing the restore sequence to ensure it meets #1
  5. Potentially going back to #3 if the restore sequence does not meet #1 (or push back on the downtime requirement or provision a high-availability solution that allows a fast failover to a redundant copy of the database)

If the bank from 2004 had even done #4 then they’d have discovered that their backup strategy was inadequate and added in more frequent full and differential backups.

Don’t make the mistake they did – you always want to meet your downtime and data-loss requirements when disaster strikes.