RPO and RTO – recovery point objective and recovery time objective – are vital metrics when developing disaster recovery (DR) plans and strategy.
They are fundamental to figuring out the details of how you will recover from an unplanned outage, whether technical or, increasingly, as a result of malicious intent, such as ransomware.
That is because the requirements of your organisation, in terms of how much data you can afford to lose and how long you can afford systems to be unavailable, dictate the storage and data protection techniques and products you must specify.
Define RTO and RPO
RTO is defined by the global ICT standard for disaster recovery, ISO/IEC 27031:2011, as: “The period of time within which minimum levels of services and/or products and the supporting systems, applications or functions must be recovered after a disruption has occurred.”
RPO, meanwhile, is defined as: “The point in time to which data must be recovered after a disruption has occurred.”
In plain English, RTO is the amount of time you can afford systems and data to be unavailable. It is measured in time and is the period within which you require systems to be restored after an outage.
RPO is the amount of data you can afford to lose. It, too, is measured in time, but seen through the lens of how much data you can afford to lose. So, it will be governed by how long ago it was to the last backup and/or snapshot or how recent the data is at a site to which you failover.
So, for example, an organisation may determine that it can work with an RTO of one hour and an RPO of two hours’ worth of data.
The picture is likely to be less simple than that, however.
Different RTOs and RPOs for different applications?
But the application landscape within an organisation doesn’t lend itself to blanket metrics that cover everything.
The reality is that granular RTOs and RPOs are probably what is required to respond to real-world situations.
So, for example, if all your systems went down, the priority would be to restore those that are public-facing and revenue-producing, which may include highly time-critical transactional applications.
These would be of the highest priority, and occupy the opposite end of the continuum from, say, long-stored and unused archived data or unstructured data without business-critical time constraints.
When it comes to practical consequences, those differences will be reflected in the use of different classes of storage and data protection.
Calculating RTO and RPO
The place to start is to determine the level of risk to the organisation of systems being unavailable and for what length of time that can be tolerated.
It makes sense to categorise and prioritise systems and to ask questions such as:
- Is it a customer-facing system?
- Is it transactional or does it only provide information?
- Which systems will have the most impact on customer revenue and/or reputation?
- How much data would be lost – per minute or hour, for example – if it became unavailable?
- How much does lost data cost in revenue terms, by the minute or hour?
- What is the maximum amount of data loss that can be tolerated?
- What is the longest we can afford systems – categorised by criticality – to be unavailable?
- Are there systems with dependencies to others and what are their RPOs/RTOs?
- How many employees would be affected by the system being down?
RPO and RTO examples
Ideal RPOs and RTOs would be zero. But the closer you get to zero, the more it costs.
So, zero is not an option for most systems, but some will need it – namely, banking transactional systems that can stand almost no data loss or system unavailability.
More realistically, you would probably categorise applications and systems by tier. So, for example:
- Tier 1 would be mission-critical applications – such as retail, transactional and customer-facing systems – that would be given RPO and RTO of less than, say, 10 minutes.
- Tier 2 systems would be important to the business but of less criticality than Tier 1, so RPO and RTO of an hour to three or four hours would be appropriate.
- Tier 3 would be all those systems that can be brought back online over hours and days.
RTO/RPO tier and data protection method
The tier, in large part, determines the level of data protection. So, for example, Tier 1 systems need dual writes and/or frequent replication of data to protect against localised issues. It will probably also need the ability to failover rapidly to a remote location in case of site-level threats.
Tier 2 systems would need less frequent copying of data, but would also need to be able to failover to remote systems. All tiers would be underpinned by daily backups, as well as staging to cloud and/or tape archives as data ages.
Your organisation’s ability to meet RTO and RPO service-level agreements (SLAs) will be determined by the scope of the outage, so need to be worked out in conjunction with a risk assessment and business impact analysis that looks at the likely range of potential causes of downtime and ranks them in terms of likelihood and effect. A disk failure is quite obviously less impactful than a flooded site, for example.
Cloud storage and RTO/RPO
When you work out the types of storage and data protection suited to your organisation and the various systems it must operate and protect, you are increasingly likely to have to take cloud storage into account.
Use of the cloud alongside the datacentre is pretty much mainstream, with surveys finding almost half of customers using the cloud for disaster recovery in some way.
The difficulty that brings to RPO and RTO requirements and calculations is that you are now dependent on an external provider. So, be sure to discuss thoroughly your RPO and RTO requirements for different classes of data and tie the provider down as much as possible with clear SLAs. There can be other complexities working with a cloud provider and remote sites.
So, you may even consider that some systems and data cannot be left to being governed by SLAs with a third party and find you want to bring them back in-house.