What is a Disaster Recovery Plan?
Updated after Hurricane Sandy in October 2012
We live in a world full of extreme events. Some are expected, such as hurricanes in the US southeast or earthquakes along the San Andreas Fault. Others are often referred to as Black Swans: unforeseen catastrophes that not only surprise, but also unravel the seemingly well-operated organizations.
Black Swan examples might include the September 11, 2001 tragedy or a potential Level 5 hurricane hitting New York City or Boston. In August 2011, another illustrative example might include massive and destructive tropical storm-induced flooding in mountainous, land-locked states such as Vermont. Hurricane Sandy ravaged the east cost of the US in October 2012, and either confirmed or dismissed many disaster recovery plans.
As Black Swans are unpredictable, preparing for one is difficult. One might know something could potentially happen, but there are no preliminary indicators or signs that provide warning.
The more resilient a system, the more likely it will successfully weather a Black Swan disaster event. Brittleness of a system generally leads to failure.
For instance, the copper pair-based network of cables around the world known as the publicly switched telephone network (PSTN), is older than most people alive. Yet during power outages, it is more likely to remain operational. Meanwhile, cellular phone networks, regardless of their protocols utilized such as CDMA or GSM, are often deprecated by power loss and over-utilization. In that case, the cellular phone networks are brittle systems that frequently fail in disaster scenarios.
A plan to proactively prepare for a range of disasters is vital. While each and every disaster is particular, in most cases, there are certain general precautions that assist a smoother restart when the normal day-to-day returns.
That is the key point: planning for disasters is about three things:
Documents that provide guidance and policy in such circumstances are called disaster recovery plans, or DR plans for short. An alternative name is a "business continuity plan." However, this piece is relevant beyond businesses, and is applicable to any type of coordinated operation that strives to survive during and after disasters strike.
Real Disaster Recovery Plans
Developing a comprehensive disaster recovery plan is a difficult task. However, organizations such as SANS provide an array of useful information on creating one, including SANS. Start with their section on Information Security Policy Templates and the Reading Room, composed of papers submitted by their students who excelled on their certification exams. More generally, searching for "disaster recovery plan template" will provide more insight.
Keep in mind that certain industries must maintain DR plans based upon legal or regulatory guidelines. It might be necessary to query resources specific to the context.
If an entity does not have staff qualified to develop and implement a disaster recovery plan, it is recommended that external parties are utilized. But starting with the basics is in the reach of almost everyone.
What Makes a Good DR Plan?
It is critical that comprehensive plans are developed. Even if it is less than exhaustive, it is a fundamental requirement to have plans that are understood and executable by the staff.
The three vital elements or characteristics of a basic DR plan might be:
After September 11, 2001 in New York City, it became standard for large firms to maintain operational DR centers outside of Manhattan.
Of course, how geographically diverse infrastructure is shaped by cost. Many DR centers were located in nearby New Jersey, which during Hurricane Sandy, experienced equal or worse damage than Manhattan. The likihood of a disaster hitting infrastructure in both New York City, and say, Singapore, is less probable, but the costs of building out and maintaining such a presence is significantly higher.
What is the "Cost" and Expectation?
The term "cost" is more than mere monetary denominations. It also reflects effort and additional stress on normal operations.
For instance, the cost of maintaining offsite data backups might be $500 a month. But there is an additional cost of confirming the operations, testing restores of that data, and so on. Extra work and thought should be included in that the monetary cost.
In that general sense, cost matters. While certainly the monetary cost matters, what about the cost of testing a restore on a regular basis? Finding a solid provider remote from your current area or country of familiarity? Or just confirming that the backups are even just operational?
That cost means that a routine must be kept, such as transporting or testing backups. Routines can be time-consuming and are easily shelved in the face of more immediate tasks. Yet without those necessary routines, a strong DR policy becomes a useless document.
Measure the expected costs in a DR plan. That will determine its ability to be implemented, and ultimately, it can determine if the DR plan is successful after the particular crisis.
A Cheap and Dirty DR Plan
The following list may be useful in a number of scenarios for developing an executable DR plan. It is not comprehensive, and it is not particular to any area, geographic location or industry. It can form the basis of a stronger plan, if utilized as a starting point.
There are a number of areas to consider and address:
In the event that communications break down, how do the staff communicate among themselves? How are other locations contacted? Note that during power outages, POTS (plain old telephone service) is often operational due to its low power requirement. Voice-over-IP or digital phone service is often the first casualty of any disaster. And cell phone networks are frequently overloaded or simply inoperative. Older "2G" cellular networks require less power and have better range than more current 3G and 4G networks. And of course internet access, and with it email, is often a luxury.
Dedicated POTS lines, numeric or one-way pagers and two-way radios are more reliable communication mediums.
An entity might have third-party vendors, customers or collaborators remote from the affected location. How do they continue to conduct communications with your entity? How do they find out more information during a crisis? How are they aware of what services are operational, such as email?
There are two ways to approach such a problem, pro-actively or reactively.
First, there may be a pre-planned backup method of communications, such as an alternate email services, staff addresses or phone numbers, utilizing facilities remote from the affected location.
On the other hand, a remote web site might be maintained, immediately activated when disaster strikes, providing third parties the necessary information.
It is inexpensive to maintain a simple and remote web site. Cheap virtual accounts can cost a mere $10 a month. A one-way numeric pager is in the same price range.
Additionally, do not ask vendors if they have a plan in the moments before an expected event. When querying a third party on their DR plans at that point, it's already too late. Ask them in the onset of a relationship. And if they do have an adequate plan, distracting them is hardly assisting them at a critical moment.
Are the backups operational? Are they automated, or do they relay on staff to physically take them to another location? Is the remote location secure?
Are all systems on uninterruptible power supplies (UPS)?
Can everyone in the office disconnect the power from all systems in the event of a potential power outages, brown-outs, or power surges when the grid is back online?
All systems should be designed and built with resilience in mind. For instance, a firm shouldn't rely on a single domain name service provider. DNS services are cheap. Using multiple providers or locations reduces the risk of an outage for web services, email and so on. Many online service vendors provide DNS, including registrars and data centers. It can cost little or nothing to use two or three different DNS services.
Is the entity reliant on a single power grid or internet connection? Is any infrastructure, such as web services, remote from the primary location?
Consider such scenarios with security and monitoring systems. Do they have adequate backup power?
Are power issues so frequent or disastrous that backup generators are necessary?
The increased reliance on remote, internet-based providers certainly has benefits. But lost in that migration is the development of close proximity networks, which gain critical importance during crisis scenarios.
The vast majority of computer hardware is ordered on the internet or from remote suppliers. Have you visited the local computer store recently? Can you buy a spool of Category 5E cable with RJ-45s heads and a crimper in an emergency? Frequenting local establishments and gaining familiarity is crucial when crises prompt delays in internet or postal mail. The loss of local familiarity is an enormous hindrance to building crisis reaction plans.
Despite the explosion of the internet over the past two decades, close proximity human social relations remain of primary importance, particularly in crises scenarios. Get to know other entities in the office building, stores in the neighborhood, and so on. Internet-prompted atomization is a huge hindrance to resilience during a crisis.
One hundred years ago, most of the world was fully resilient to local disasters. The earthquake which destroyed San Franciso in 1906 had little or no impact on New York City. Today, such a disaster could mean the loss of email or web hosting, or critical elements of a development staff.
During the Cold War, the notion of determining the likelihood of nuclear holocaust was phrased as in how many "minutes to midnight." The time varied by the level of conflict and potential threats in the particular moment.
With a short window of opportunity before "midnight," what tasks are necessary to complete. It's enough time to perform a short routine, and review all the measures taken.
With reasonable warning, what are the measures taken before a potential event takes place?
Before catastrophic events, some degree of warning is often available. What needs to be done once the clock is ticking?
Consider the days before a major hurricane is expected. It might be catastrophic, it might not. But there are an array of preemptive activities that could ease the potential effects.
Is an additional one-off manual backup to a remote location or removable media possible?
Are there physical assets to pack and prepare for removal? Assets that require adequate lead-time to evacuate? Communications among staff and with external parties?
Determining what the last person leaving the facility should do, if time permits, can also be useful. Certain systems are designed to be disaster-resistant. Other are more sensitive to even regular events.
Does everyone know what the last person leaving the facility should double-check or do? Are there physical assets plausibly remove in the moment? Backup media to remove? Storage facilities to lock?
The entire point of a DR plan is to bring operations back to their normal state quickly and efficiently once normalcy is restored. Without this outcome, the DR plan is useless. Therefore, planning the steps necessary to do so is vital.
Does data have to be imported from a remote source? Over the internet or physically done from backup tapes?
How do you communicate with third parties and clients about the return to normal operations?
Backing up data is essential. But without the ability to actually restore the data, backups are pointless.
Is everyone comprehensively familiar with the plan of action? Has all staff read and discussed the plan? Any action plan that sits in an innocuous binder on a shelf unread is worthless. Conducting a meeting on the DR plan is useful for a number of reasons, beyond the increased ability for everyone to understand it.
Discussions are an important method of finding holes in any plan. For instance, consider a DR plan to migrate the web services DNS A record to a remote emergency web site. Who has access to make DNS server changes? Does the relevant staff know how to change the relevant information? Or to update the web site information?
Step-by-step exercises on a quarterly basis should be sufficient, but should be conducted when major systems change or new staff join the entity.
There should be nothing overwhelming about creating a basic yet useful DR plan for any entity.
Extensive templates available on the internet may make the task seem daunting. An initial plan may provide a useful bare-bones approach. Over time, and with regular attention, such a simple plan could blossom into a comprehensive DR plan that is practical and fits the entities' needs.
These articles may provide more insight on approaching disaster recover: