Nobody gets disaster recovery planning right the first time. Even the best efforts rely on assumptions that cannot, without examination, be relied on in the event of an actual disaster. In successful contingency planning, it is important to test and evaluate the plan regularly.
Testing your disaster recovery plan is essential. Modern businesses rely heavily on computing equipment and software to operate – often it’s all they rely on. Data processing operations are volatile in nature, resulting in frequent changes to equipment, programs, and documentation. Without regular testing, these frequent changes are likely to cause considerable changes in the variables that make up the basis for many of your Disaster Recovery Plan’s operations.
So how do you go about testing your disaster recovery plan?
The first step is to understand what kinds of questions your test will seek to answer. In my opinion, the best place to start is coming up with questions that put the entire premise of your Disaster Recovery Plan on the table: will this plan actually result in recovery of your business?
Testing: Are You Truly Recoverable?
- Are employees and service providers able to execute the plan?
- Are data backups accessible within the desired timeframes?
- Are contingencies in place to adapt to accommodate resources or employees that may not be able to participate in the recovery process due to the disaster itself?
- Can the recovery time objective (RTO) practical or attainable?
- Can the systems be restored with an acceptable degree of data loss?
- Is the businesses recovery point objective (RPO)practical/attainable?
Testing can take a number of forms, which tend to be progressive in nature. One version of testing uses the following steps:
DR Rehearsal: Team members verbally go through the specific steps as documented in the Plan to confirm effectiveness and identify gaps, bottlenecks, or other weaknesses. This test provides the opportunity to review a plan with a larger subset of people, allowing the Plan Coordinator/Manager to make appropriate changes to the Plan. Staff should be familiar with procedures, equipment, and recovery facilities.
Failover Testing: Under this scenario, servers and applications are brought online in an isolated environment. There’s no impact to existing operations or uptime. Systems administrators ensure that all operating systems come up cleanly. Application administrators validate that all applications perform as expected.
Live-Failover Testing: A live-failover test activates the total Plan. The test will disrupt normal operations, and therefore should be approached with caution.
If this is your preferred method of testing your disaster recovery plan, you need to ensure you have completed several successful Plan Rehearsals and Failover Tests before conducting any Live-Failover Testing. Additionally, communicate all expected disruptions well in advance of performing this test.
Example Disaster Recovery Plan Test Protocol
In the example below, the test has two primary steps: 1) plan the test, and 2) conduct the test. You should always identify the questions your test is seeking to answer in advance, both to save time and expenses on a test (no point testing aspects of your plan that have nothing to do with the questions you’re asking), and to avoid any post-hoc rationalization of the results.
Moreover, the questions that you seek to answer must be answerable by the tests you are performing. This is an often overlooked part of test preparation. Not only do you have to avoid questions that cannot be answered by a proposed test, but also avoid tests that might give answers, but that lack the certainty or clarity of results that allow you to act.
Of critical importance, you must be prepared to incorporate your test results into your Disaster Recovery Plan. Whether it’s as part of a regular review and audit process, or a less-formal update/amendment, your plan should always reflect what you and your team are able to do in response to a disaster, not just what you believe needs to be done.
Conducting a Recovery Test
|What aspects of the plan are being tested? (What is the test’s purpose?)|
|What are the test’s objectives?|
|For each objective, what constitutes a successful test?|
|Tests and objectives explained to management, approval and support provided.|
|Test and expected duration announced|
|At end of each test period, collect results.|
|Was recovery (or objective of DRP component subjected to testing) successful? If not, were parts of the recovery (or objective) successful? To what extent?|
|Determine and document the implications of the test results. Does successful recovery in a simple case imply successful recovery for all critical jobs in the tolerable outage period?|
|Does the result (whether successful or not) have applications for other DRP components not tested?|
|Seek information from participants and stakeholders to improve any aspects of the DRP component subjected to testing and any other components for which test results have application.|
|Notify appropriate company representatives, including other business groups/departments/areas of the test results.|
|Follow pre-determined company guidelines for submission and approval of modifications of the DRP based on test results.|
|Change the DRP manual as necessary.|
Areas to be tested
|Recovery of individual application systems by using files and documentation stored in remote backup facilities or Disaster Recovery Sites (“Recovery Files”).|
|Reloading of system backup and performing an initial program load (IPL) by using Recovery Files.|
|Ability to process designated business and system operations on a different computer.|
|Ability of management to determine priority of systems in disaster-like conditions and on limited operational systems.|
|Ability to recover and process successfully with the Disaster Recovery Team composed of people selected at random from the succession plan (or with several roles left open if no succession plan exists).|
|Ability to clearly understand responsibilities and chain of command based exclusively on the DRP.|
|Effectiveness of security measures and security bypass procedures during the recovery period.|
|Ability to successfully implement the Emergency Evacuation Plan.|
|Ability of personnel to cope with a temporary loss of real-time online information.|
|Ability of personnel to continue day-to-day operations without business-critical applications or tools.|
|Ability to effectively communicate with DRT Personnel (including potential replacements per the succession plan).|
|Ability of personnel to provide inputs to critical systems using equipment and media available at Disaster Recovery Sites.|
|Availability of applications and equipment identified as “business critical” at designated Disaster Recovery Site(s).|
|Availability of peripheral equipment and processing, such as [non-business critical equipment that is used regularly, e.g. printers and scanners]|
|Availability of support equipment, such as [non-business critical equipment that is important to operating conditions, e.g. air conditioners and dehumidifiers]|
|Availability of support: supplies, transportation, communication.|
|Distribution of output produced at the recovery site.|
|Availability of important forms and paper stock.|
|Ability to adapt plan to disasters of lower potential for catastrophic disruption.|