Click to learn more about author W. Curtis Preston.
A system administrator who has actually been through a real disaster told me in a recent podcast interview that most of the challenges he had were more about basic infrastructure like internet, lodging, and food. In this final article in the series, I want to focus on the actual backup and DR system itself – without it you’re getting nowhere. Click links to read part one and part two.
I’m sure you tested your backup and DR system during its initial implementation. The important thing, however, is that this testing should continue. Only then will you be able to reliably state your current RTA and RPA back to the business units so that they will know what they can expect in any sort of disaster.
It’s also important that you test the recovery of as much of your data center as you can. The cloud makes this possible, so there’s really no excuse anymore. You should be regularly restoring your entire mission-critical enterprise somewhere in the cloud.
The reason this is important is due to the advent of ransomware. The chances that your company could be impacted by ransomware seem to go up every day. But if you’re able to reliably tell someone “I can have all of our mission-critical systems back online in two days,” you might just save your company millions of dollars in ransom money.
As we saw in the first two parts of this series, your hands will often be tied behind your back during the actual recovery. Therefore, you should try to replicate this as much as possible during your test recoveries. If you are the person that designed or administers the backup and DR system, you should not be the person doing the actual recovery. It should be performed by someone who is technically capable, but not operationally familiar with the system itself. They should be able to follow your documentation and execute the recovery. If you are able to do that, you will be prepared for an actual disaster.
Another important thing that we found out when interviewing this person who has actually been through a disaster is the importance of having extra backup system capacity. This is important because during the actual recovery, every spare computing cycle and every spare I/O will be used to conduct the recovery. If the recovery takes more than a few hours, your company needs to continue to function and its data needs to continue to be protected during that time.
The recovery that our podcast guest experienced took two weeks. You cannot go two weeks without backing up important company data, so it was lucky for him they had extra backup system capacity they could activate during the disaster recovery. In his case, it was their tape library they use to provide copies of backups. They were able to change the backup configurations so the backups went directly to those tape drives. While it might not have been ideal from a performance standpoint, it allowed backups to continue while the recovery progressed uninterrupted.
Another important thing we heard during this interview was the importance of automatic backup inclusion. In this case, it meant that as systems were recovered, they were automatically added back to the new standby backup system. I can’t stress strongly enough that this concept is a crucial component for any good system.
Finally, we discussed the need to be flexible. There is the plan and then there is the actual recovery. The key in an actual disaster recovery is to be flexible, which means that you need flexible people. I am once again reminded that your most important resource is the people that are going to make everything happen. So my final piece of DR advice is to check on your people. Make sure they know how important they are to you, because one day they just might be the thing standing between your company and oblivion.