At the Spring SQLintersection conference, I presented my session Understanding SQL Backups and Restores. During this session I primarily talk about how to create a solid recovery strategy and how to restore SQL Server backups. I cover full, differential, transaction log, and piece-meal restores. During the session, I stress how important it is to practice restoring your backups. There are multiple reasons for doing this, one is so that you know how to properly restore. You don’t want the first time you have to perform a restore with a full, differential, and 30 log files to be in production, with a boss looking over your shoulder. Another benefit is restore validation. This is the process of performing regular restores to make sure that your backups are valid.
Even though you may be an expert at restoring backups, a regular process to restore your backups for validation is still needed. I always tell my audience that if you are in a heavily regulated industry, that examiners and auditors love to see that you have a restore validation process. You can easily prove that restores are good, and how long restores take.
A couple of weeks after the conference, I had an attendee email me that when they tried to restore their backups from an offsite location to their DR site that the restores failed. They received an error “The media family on device ‘backup_file.bak’ is incorrectly formed”. I wrote back that this message usually indicates a bad backup file, or that they are trying to restore a backup from a higher version of SQL to a lesser version. For example, a 2016 backup to a 2014 instance. I was assured that both servers were the same version and that all backup files were failing, but that if they tried restoring the same file to the original server, that it works. At this point, I asked for them to explain their entire process. How are they taking the backups, getting the files offsite, and then to the DR site. It turns out that they FTP from production to offsite, and then FTP from offsite to production. The FTP is where the process was failing. They were transmitting binary from production to offsite, but transmitting text from offsite to DR. Transmitting the file using text was slightly modifying the file and was causing an issue. They re-transmitted the file using binary and was able to successfully restore.
Had they not tested this and a true production issue had occurred, they would have lost valuable time which could have also resulted in significant financial loss. I cannot stress enough how important it is to practice testing and validating recovery strategies. When an issue arises, you should have 100% confidence in your plan and your ability to recover. If anything in your recovery plan changes, make sure a full regression of your recovery strategy is validated as well.
I hope you found this post helpful! And, if you want to find all of our SQLskills SQL101 blog posts – check out: SQLskills.com/help/SQL101