There are two critical pieces of information you need when preparing your business’s Disaster Recovery Plan. The first is the amount of time your business-critical systems can be offline before your business suffers irreparable damage. The second is the amount of time it takes to get those systems back to full operation following a disaster.
That’s your RTO and RPO. And while it might seem straightforward – and it may be – it’s absolutely critical that those numbers are accurate.
So, what do they mean and how do you determine them?
RTO and RPO
RTO stands for Recovery Time Objective; RPO stands for Recovery Point Objective. These two parameters help you define how long your business can afford to be offline and how much data loss/operational down-time your business can tolerate.
RTO refers to the maximum amount of time it should take to restore a particular application/system to normal operations following a disaster, regardless of whether the result is data loss, a full-scale halt of business operations, or any other potential result. Bear in mind that when an application/system is disrupted, multiple actions might be necessary to restore that resource to full operations, like replacing damaged hardware, reprogramming and testing, data restoration, quality controls, etc.
RPO is the maximum amount of data your business can tolerate losing, measured in the amount of time between your last full valid data backup and when that resource has been fully restored.
Calculating RTO and RPO
As with pretty much everything in disaster recovery, establishing this part of your plan relies on information you learned in the previous part of your plan – the establishment of your Business-Critical Systems.
How you determine RTO and RPO depends on how you want your Plan to be set up. In my opinion, there are two general ways of handling this step:
- Calculate the RTO and RPO for individual applications and processes
- Calculate the RTO and RPO for each Business-Critical System
Either method has distinct advantages and disadvantages. Calculating the RTO and RPO for individual applications and processes is necessary for both versions. However, the second option takes those calculations a step (or two) further by combining the calculations for all the applications and processes involved in each Business-Critical System. As such, the second option gives a more “results-oriented” look at restoring critical services, but is also more complex to calculate, and may not be necessary for smaller businesses.
Regardless of how you choose to calculate RTO and RPO, you should also be aware that your goal is not to restore each application or service in a vacuum, but rather as part of an overall restoration of your business. That means that the RTO and RPO of each application or service also depends on the order in which those applications and services are restored.
While it is possible to create a single business-wide RTO and RPO if your operations are small enough, it is still likely that there are certain applications/systems that are more business-critical than others. Establishing which aspects of your business are most critical to continuing operations (I know everything is critical, but this is about triage, as you’re about to see, so be honest and thoughtful when making these evaluations).
Establish Your Criticality of Services List
No matter how you elect to do your RTO and RPO calculations, they can only be accurately calculated when the order in which they are performed is considered. Determining that order requires you to create your Criticality of Services List.
Taking each of your Business-Critical Systems (or a list of services, applications, and operations for smaller companies or those with only a few Business-Critical Systems), rank each based on how critical it is to the day-to-day operation of your business.
Calculating RTO and RPO
Before we begin, you will likely find that the parameters for these calculations are a bit circular. Although they calculate different goals, they rely on each other significantly, such that one makes very little sense without the other.
That said, for each item on the list, you are going to be making two calculations:
One is the maximum acceptable amount of time it can take to fully restore a system. That is the system’s Recovery Time Objective (RTO) – that is, the maximum amount of time your team has to completely restore the operations of that system.
The other is the maximum amount of data generated by that system that you can afford to lose if that system is disabled by a disaster. That is the system’s Recovery Point Objective (RPO) – the amount of data lost (calculated in time) counting backwards from the time of its restoration, through the disaster event, back to its most recent complete data backup.
Although it’s listed second, you probably want to calculate your RPO first, particularly for services or assets whose primary function is the generation of data on which your business operates. For example, if one of your critical services is your company’s Point-of-Sale system, you will want to know how many hours of sales your company can afford to lose in a disaster.
The calculation of each item’s RPO will thus inform its RTO – if a system can only handle data loss for 48 hours, your goal for restoring that system needs to be less than 48 hours. (As the images show, the difference in time between RTO and RPO is measured by the amount of time that passed between the last full backup you made of the system in question and the moment it goes offline.)
What Informs the Calculation of RTO and RPO
A lot of factors will eventually go into your final RTO and RPO calculations, including available expertise in your current personnel, the state of your business’s operations and equipment, the resources you have available to dedicate to emergency IT recovery services, and the resources you put into your backup and recovery systems. In my experience, you should begin your calculations, however, with only your business needs in mind, and work back from there.
Determine each system’s RPO based on an honest assessment of its importance in your business. Practically speaking, what do you believe is the most data you can lose without it jeopardizing your business’s ability to survive? Once you’ve made that determination, you can then bring in other factors like cost.
You should find that the more business-critical an application or system is, the lower the RTO and RPO must be. It makes sense that your business can tolerate outages for longer periods of time in systems that rank lower on the criticality of systems list.
However, it should be noted that not all businesses will find that the applications most consistently linked with driving revenue are the ones they can least afford to be offline for significant periods of time. Professions like doctors and lawyers must put a higher priority on communication with their patients/clients than other businesses due to ethical considerations. When determining RTO and RPO, make sure that you’re considering factors beyond merely revenue generation, if such factors are an essential part of your business’s operation.
Testing and Modifying Your RTO and RPO
Once you have established your ideal goals for RTO and RPO, you need to expose those goals to two aspects of reality that are likely to force you to modify your goals: realistic assessment of your ability to recover and the cost of meeting your goals.
As for the first, you absolutely need to test your RTO and RPO before you finalize them, let alone make them a part of your Disaster Recovery Plan.
Additionally, it should be noted that both RTO and RPO have an inverse relationship between the time for recovery and the cost necessary to support the recovery. You should expect that restoring your most critical systems will cost you more per minute to restore than will your less-critical systems, factoring for relative differences in cost between different types of systems. This relationship may also inform your overall calculations of RTO and RPO, particularly if the cost of achieving your preferred RTO is prohibitive.