This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

For the next six posts in the series I’ll be helping you understand one of the most important areas of responsibility for people looking after SQL Server: ensuring there are sufficient backups to allow recovery if a disaster occurs.

What do I mean by ‘disaster’? Anything that causes damage to your database files – generally a problem with the I/O subsystem (your disks/SAN/storage). If the database files are damaged, you need to have some way of being able to recover the damaged data, otherwise it could mean the company losing time, business, clients, money, and possibly endangering your job.

You might think that a disaster won’t happen to your company, but I/O subsystem problems happen very commonly. Even if you have some kind of redundant copy of the database using one of SQL Server’s high-availability technologies (which we’ll cover later in this blog post series), you still need to have backups in case the high-availability technology fails. It’s like the common mantra about I.T. security –  “defense in depth” – the more layers of defense against disaster your have, the better.

To start with I want to cover a few concepts that will help frame your thinking about what backups to perform. Today I’ll discuss the industry terms RTO and RPO.

RTO stands for Recovery Time Objective. In simple terms, you can think of RTO as a measure of how much downtime is acceptable, or how quickly must the data be made accessible again. RTO is often talked about in terms of the number of nines of desirable up time or accessibility for the data/database/system. For example, 5-nines means 99.999% up-time, and if that’s being measured 24 hours a day, 365 days a year, then that translates into just over 5 minutes of allowable downtime per year. That’s really hard to achieve. It’s easier to meet 4-nines (52.5 minutes per year of downtime) or 3-nines (8.75 hours per year of downtime).

When discussing the desired up time, you have to decide whether it’s being measured 24×365, or 9am-5pm during weekdays, or some other time period. You also need to decide whether the measured downtime includes scheduled downtime for maintenance/patching or not, as it’s much easier to meet a high number of nines if scheduled maintenance is permitted.

You can read a little bit more about downtime SLAs in a blog post survey I conducted back in 2009.

RPO stands for Recovery Point Objective. Again, in simple terms, you can think of RPO as a measure of how much data or work it’s acceptable to lose. It’s relatively easy to achieve very minimal or even zero data/work loss using backups, but depending on the amount of damage the database suffered when the disaster hit, recovering might take a lot of time. For instance, if an entire database is destroyed, depending on the architecture of the database and the backups that exist, it may take a significant amount of time to recover the database up to the point of the disaster. Most RPOs are defined as the the amount of time for which work may be lost. For example, an RPO might be that the database should be recoverable to a point within thirty minutes of the disaster occurring, which means that up to 30 minutes of work may be lost.

Both of these terms are called SLAs – Service Level Agreements. For each database for which you’re responsible, you need to find out what the RTO and RPO requirements are, as you’re responsible for meeting them. This means pressing the business managers/owners for business-driven decisions on what acceptable RTO and RPO values are – these are not numbers that can be decided without considering the business implications of downtime and data loss.

The business management will likely want to have as little downtime and data loss as possible when a disaster occurs and it’s part of your job to figure out what’s possible using backups (and maybe high-availability technologies as well, if the budget permits) and make sure that the business management agrees on what’s practical to achieve with the resources you have available.

Tomorrow I’ll talk about recovery models and backup types, and then on day 8 I’ll discuss how to create a backup strategy that helps to minimize downtime and data loss.