Advertisement

Lessons from a Real Disaster Recovery: You’ve got to Recover the Network First

By on

Click to learn more about author W. Curtis Preston.

Being forced to perform a complete disaster recovery under duress is an incredibly difficult thing; it’s also a great way to learn. On my Restore it All podcast, I recently interviewed an IT professional who was the man on the ground after a hurricane took out the island where his  company hosted two data centers. While the recovery was ultimately successful, they did learn a number of lessons along the way, and I thought these lessons might make a great series of articles.  This first article in the series will focus on the network lessons learned in this real-life exercise.

The person spoke on the condition of anonymity, so I will not be using his real name or his company’s name.  We will call him Ron. But he is a real person and this was a real disaster.  You can hear the first part of his interview here.

When you assume…

It should come as no surprise to the reader that without a solid network connection, not much is going to happen. This is why so many companies spend so much money on highly redundant network connections. In addition, the amount of money spent on these connections allows people to make the assumption that they will always be up. That assumption was destroyed the day the hurricane took out the island.  Nothing worked.

When Ron showed up at the island, he discovered they could not do even the most basic things, like logging into servers. This was because they used Active Directory and were relying on the Active Directory services on the mainland. The mainland was, of course, completely unavailable because the network connection was down.

It’s really hard to begin the recovery of your servers if you can’t log into them. There are many services, including backup and recovery services that require the user to login as themselves. This means that while they may have had local administrative accounts, they might not have been able to run the backup system if they couldn’t login as themselves – which required Active Directory.

You really want your backup system users logging in as themselves. That means they need a reliable connection. You may be thinking to yourself, “Of course they need a reliable connection. That’s why we have dual connections to the Internet! If either of them goes down, we can use the other.”

The 3-2-1 Rule

In backups, we often talk about the 3-2-1 rule, which specifies that you want three different versions of your data on two different media, one of which is off-site. I think we can easily adapt this concept to network infrastructure as well.

A best-case scenario would be that you have three different network connections using two different physically separate connections, one of which uses a completely different network type. This may seem like overkill, but if the company had done this prior to this disaster, they would not have prolonged their outage as long as they did.

What I mean here is while it is important to have redundant physical connections, what if those physical connections both use the same type of equipment and ultimately use the same provider? They’re not going to be much use if the worst happens. You need to look at some other way to provide a network connection.

What About Satellite?

Satellite network connections have historically not been the best for a network connection. They are expensive and don’t typically offer much bandwidth and usually have horrible latency.

They did start using a satellite connection to provide basic network connectivity so that services like Active Directory would begin working. Things moved along swimmingly for a while, until the satellite connection kept going down at some point during the day.

The network administrators looked at the connection and wondered if it was weather. Are clouds obscuring their connection to the satellite? Is there some other local problem that is causing them to be unable to use this connection?

The problem turned out to be a daily network cap. Once they hit the cap, their network connection would throttle to such a slow speed that it almost appeared to be down.

This part of the story is why I think the new Starlink service from Elon Musk looks really encouraging. Using a high number of low orbiting satellites, they are able to provide a network connection from 100-200 Mbs with a latency of 20-30 ms. The service is currently in beta and it looks really promising in a number of ways. This could bring Internet to rural areas and islands and can also provide an inexpensive backup Internet connection for any company. This network connection would be completely independent of everything else your company does.

Watch Those Backhoes

As the recovery continued to progress, they started using microwave connections to a central network facility that gave them network connectivity to the mainland. That wasn’t exactly perfect, either.

Microwave transmissions require a line of sight between transmitter and receiver. This might also mean that you need multiple relay sites between you and the Internet connection you are trying to use. While this has obvious latency issues, it also means that each relay site is a single point of failure.

Imagine having multiple single points of failure between you and your Internet connection, while simultaneously everyone around the sites is trying to piece their existence back together. This requires all kinds of heavy equipment (e.g. backhoes) that can often hurt as much as they help. They can take out power and destabilize buildings causing them to fall and the microwave connection to become unstable. Ron said this dependency on the local physical infrastructure was often maddening.

I’m reminded of a talk I once saw from the CIO of Denver international Airport. From her office, she could see almost the entire airport at a glance. Every time that she would have some sort of network issue, she immediately grabbed a pair of binoculars and looked out her window for backhoes. She said that roughly half the time they had a network issue, it was caused by one of them.

Plan for the Worst

I’m a backup expert, not a network expert, and I don’t presume to tell you your job. All I wanted to do was to have you re-examine your thinking in light of this disaster. How would your redundant network connection work if you only had power to your building via generators, but no power to surrounding buildings that might host the routers you are using? (I’m not talking about your routers, mind you. I’m talking about your ISP’s routers.)

Have you ever considered a completely different type of network connection that might be more resilient in a true disaster? Have you looked into satellite or microwave networks? Perhaps now would be a time to look into these alternatives so that you could put the infrastructure in place prior to something happening. Perhaps you can pay a basic fee for the connection, which you can then ramp up if you actually need to actively use that connection.

Take a few minutes and think about any assumptions you are making in your network design that might not be true if everything around you is destroyed. If something like a flood, hurricane, or terrorist acts takes out all of the infrastructure around you, how would you communicate with the outside? If your answer is “I have no idea,” maybe you should look into that.

Leave a Reply