Chicago problem underscores importance of foolproof disaster recovery techniques
Breaking News, dit dit dit dit dit dit dot…“Airport authorities report that there is a fire at the regional air traffic control center that controls air traffic for Chicago O’Hare and Chicago Midway Airport. All departing air traffic is grounded pending the restoration of air traffic control service and all in-bound air traffic has either been placed on a ground-stop or is being re-routed!” Yup, this was my morning news as I was preparing for my meeting in Chicago last week. I immediately thought to myself, no problem. My flight does not leave until 5:15 PM and that is plenty of time for the Federal Aviation Administration and their controllers to use disaster recovery techniques to sort things out and restore flights. Obviously, I thought that some flights will be delayed and even cancelled, but surely not 11 hours later would all flights remain grounded.
As the day progressed the story continued to unfold about one individual, a contractor, who entered the regional air traffic control facility, started a fire, caused the evacuation of center, and then attempted to take himself “offline” with a knife.
We still do not know the inside story, but we have been told that new equipment was on the way. On the way? What? This does not sound good for the air traveler. What if air traffic controllers ran a bank?
It is clear that this event exposed the vulnerability of at least the regional air traffic control center in regard to not only security—Holy White House Batman—but also the system.
How could one building and one person affect so many?
Security is one issue but doesn’t FAA have a disaster recovery plan?
That is, a viable Plan B that can be implemented immediately?
I understand that as a result of this event planes did not collide mid-air or crash, but think of the financial impact. The net result on Day-1 was over 3,000 flights cancelled, millions of dollars lost, and thousands of travelers inconvenienced. Even as I write this blog, which is five days after the event, service is just returning to normal levels.
The one conclusion that I can draw from this event is that the managers responsible for disaster recovery planning for the regional ATC center never imagined the scenario that just occurred as a plausible event.
O maybe they just assumed, as part of the planning process, that such a scenario was highly unlikely and thus not worthy of a plan. “No plan” or a “bad plan” is no excuse when it comes air traffic control or similar services.
This is more than an “our tax dollars not at work” thing.
Imagine the calls your bank would receive if it went dark for five days…
Five days dark, in the context of today’s internet customer and their 24/7 mobility connection to your institution is just not acceptable .
It would cause a significant loss of customer confidence in your institution.
Review, revise, and test
What can be learned from the O’Hare incident?
Now would be a good time to take a hard look at your current DR plan and imagine the scenarios that you thought not possible. Write them down and determine what your true vulnerabilities are, revise your plan, and prepare accordingly.
From our perspective, a disaster means just that—and a disaster plan includes unlikely events too
Challenge your IT Team and develop scenarios. Lose it all and more, and ask your team what could be next!
Key thought … if your plan is predicated on a re-build elsewhere then you have the wrong plan. Plainly speaking, customers will not be understanding when you tell them you have to rebuild in order to restore basic services, let alone your internet bank.
You don’t have time to rebuild a server room or data center. And if you have a disaster recovery plan or business continuity plan that is dependent on the physical movement of electronic applications, files, people, or hardware as the first step to restoration, you have the wrong plan.
Any plan that requires days to execute to achieve restoration should be considered unacceptable.
The virtual customer
Today’s customer is virtual, utilizing a combination of sources to remain connected.
Your plan should be virtual as well!
Always up … and always updated.
Restoration should be in minutes and milliseconds, not days and numerous man hours. Time is money and customers, your customers.