What does it mean to have a Disaster Recovery (DR) plan
Businesses lose about $26.5 billion every year from IT downtime. For SMB running at a capacity of 1,000 employees or less, IT downtime can cost about $12,500 per day on average. When IT systems fail, it can serve as a domino effect crumbling operations from the inside: there’s no revenue to be generated as long as data systems remain unavailable, you’re running compliance risks for as long as the outage lasts, offline systems gravely hurt reputation, and customers grow more frustrated with services leading to dissatisfaction and maybe even churn. If you think that smaller startups are not as prone to the effects of downtime, a report shows that a firm suffering a heavy outage and data loss can increase the probability of them going under within a year by almost 70%.
People don’t like thinking about failures. But as an IT leader, there’s no way around it. You can’t not account for failure and expect to build a robust infrastructure impervious to unprecedented scenarios, a lot of which is really not in your control. What if your datacenter burns down? What if there’s flood, or a hurricane, or an earthquake? Natural disasters aside, cyberattacks have gone up by almost 300% in the past 5 years and on average, a major data breach reportedly cost companies between $4 million and $10 million. We’re talking about magnitudes that can cripple entire organizations, rendering them almost in-operational in some cases. This can be avoided with a strong disaster recovery (DR) and business continuity plan (BCP).
That being said, one of the primary reasons why most IT teams don’t prioritize disaster recovery is because of how expensive it can be. Disaster recovery strategies can cost anywhere from a few thousand to millions of dollars annually. Most organizations already have their budgets stretched thin from hefty expenses that go into keeping IT running. Not to mention half the budget that goes into maintenance. Something as unprecedented as disasters that are rarely likely to occur, take up a small percentage of real estate on the list of priorities. Even if DR is given importance, it’s usually a basic data backup and recovery method where systems and resources are backed up to a different location and in case of a disaster are restored into an existing or new system. This system works as long as you’re dealing with small amounts of data and don’t mind high Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Recovery Time Objective (RTO) is the total amount of time an organization can afford to have their system down or the time taken to restore business operations after an unplanned outage occurs. Regardless of how big of an organization you are, you’d prefer low RTO (as low as 15 mins to a 3-4hours depending on the size of data and the extent of the outage).
- Recovery Point Objective (RPO) is the total amount of data that an organization can afford to lose relative to the time it takes to recover from said disaster. You want RPO to be as low as possible as well (anywhere from a few minutes to a few hours depending on the extent of the outage and size of the business).
What’s an acceptable RTO and RPO for your organization: this is something that you need to decide factoring in the losses that occur the longer it takes to recover data and restore business operations. This can also be broken down into a Business Impact Analysis (BIA) assessment. Different aspects of your business can have varying priorities when it comes to RTO and RPO. For example, bringing back the product production system has higher priority than bringing back communications. Not saying that communication isn’t important but the losses you entail from an entire production system being offline far surpasses that of the losses incurred from not bringing the communications back online. Let’s break this down into a criticality ranking table.

What are your Disaster Recovery options?
As we already mentioned, there are ways in which you can implement a DR strategy ranging from basic backup and restore where RTO/RPO can be several hours while something like multi-region active/active strategy can almost guarantee real-time business restoration with no loss. Of course, the latter is far more expensive than the former but that’s the cost of ensuring better business continuity. Let’s go over what disaster recovery options are available today:
- Backup and restore: This is the most cost-effective strategy suitable for smaller companies who can afford to have higher RPO and RTO. This method backs up data to a different location and in case of a disaster, the data can be restored.
- Pilot light: This strategy creates a light copy of your production environment with only the critical components running in another location. When a disaster occurs, the systems are quickly scaled to functional capacity using the key components. RTO and RPO is lower (a few minutes to a few hours) but the cost is also slightly higher here.
- Warm standby: In some ways, this strategy is similar to pilot light but here, a fully functional copy of the production environment is kept running in another location and instead of keeping the services idle until necessary, key components are kept running to ensure higher availability. This strategy offers much lower RTO and RPO (a few minutes) compared to other strategies yet.
- Multi-region active/active: This is where systems are built and kept running across different regions. With simultaneous traffic control, this system ensures high availability by redirecting traffic to healthier regions when a disaster happens. Depending on the scale of the disaster, you can get almost real-time RPO and RTO. Even in case of full-region outages, this system offers high resilience and makes it easier to bring back critical components online. Of course, this is the most expensive strategy of all but also the most effective, especially for larger organizations who can’t afford longer downtime.

Types of disaster recovery strategies
Let’s take a few more examples of how to define RTO, RPO, and uptime you need for your services:
- The email server fails. This is critical but you can afford the RTO to be a few hours but without losing any data. For this, you need to build a high-availability email cluster, and a cold instance will suffice with a replica of the data backed up.
- Internal HR services can be slightly less critical, and you can afford to bring the system back online in a day and also some data loss is acceptable. In this case also a cold offsite instance will be enough with daily or even weekly backups.
- Customer portals are highly critical and need to be available all the time. Plan other portal replicas for the main and offsite locations to automatically failover when an outage happens.
It’s a good idea to have transparent communication with all departments and understand the criticality of all services at a deeper level. It’s important to ask what can fail, how fast does this need to recover, how much data loss can be afforded, and how long can this service afford to be offline. Accordingly, you need to plan for having high-available secondary and tertiary sites based on criticality ranking.
Infrastructure as code (IaC)
Infrastructure as Code represents a paradigm shift in infrastructure management, transforming physical and digital infrastructure from static, manually configured entities to dynamic, programmable, and version-controlled systems. In simple words, you don’t need to rely on manual methods of creating infrastructure that’s static and not easily replaceable. If you need to create a new server, with IaC, you can define the specification of the infrastructure in a config file and an IAC tool will automatically provision the server based on the specifications.
With IaC, infrastructure components are never modified in-place, and it works on a replacement mechanism where the entire infrastructure stack can be replaced with a new version. You can create consistent, reproducible environments with a state-driven architecture simply by defining the desired end-state instead of defining the steps to get there.
Why is this beneficial for Disaster Recovery planning?
First of all, an infrastructure that can be automated and version-controlled can be reconstructed faster and more easily. This means faster recovery in case an infrastructure is compromised from a disaster. Faster reconstruction also implies one of the most beneficial factors for DR which is minimal RTO. This can further optimize RTO with advanced strategies like granular system component restoration which helps put back things at a granular level, parallel recovery to help re-build systems simultaneously, and prioritize restoration for individual components with sequences.
Another very essential aspect of IaC is multi-region restoration and failover. Your infrastructure can be spread across different regions so even if a disaster occurs, quick provisioning and recovery can recreate infrastructure components in the same or different location. Since the infrastructure is available across different regions already, critical applications and data are available even if one system goes down.
Tools like AWS CloudFormation, Kubernetes, and Terraform are some of the most prominent IaC service providers for disaster recovery. You can automate the entire infrastructure deployment and recovery process with Terraform, ensuring consistent configuration across multiple environments just by defining the infrastructure once in a codified manner. You can also scale environments easily across multiple clouds, giving you a complete solution for your disaster recovery planning along with provision for testing.
Here’s a simple configuration for setting up disaster recovery with Terraform using IaC:

Next: A backup strategy
You have an infrastructure strategy that provides version control, provisioning, and quick failover in case of disasters. Now, you need a backup strategy that doesn’t come with the infrastructure. Having a good backup plan extends beyond just disaster recovery and ensures better business continuity in various scenarios like naturals disasters, ransomware attacks, and unplanned outages. A smart way to deal with backup is by having multiple backup copies spread across diverse storage locations. Another important factor for quick recovery is to maintain a regular testing practice to validate the backup. A foolproof backup plan should also be part of your Zero Trust strategy capable of isolating infected systems and restoring backups with minimal losses.
Google cloud, for example, offers a backup and disaster recovery solution for modern teams leveraging cloud-based instances for system infrastructures. Google cloud provides, immutable, indelible backups secured in a backup vault, maximizing cyber resilience and satisfying compliance objectives. This makes it easier to restore backups in case of outages and ensure better business continuity. All this is centralized so you can manage everything from a central dashboard. What’s best is that Google gcloud works great with Terraform or other IaC platforms so you can integrate backup operations with them. It’s a simple 3 step process:
- Create a backup vault under which you define the region for the backup and the minimum retention period for the backup i.e. how long should the backup be restored.
- Define a backup plan and set the location for it. Select which vault you want the backup to be stored in and also define specifications for the backup.
- Schedule automated backups to ensure your data gets backed up into the vault at the defined intervals. You can also custom define which resources to backup across different vaults and in different regions.
Running regular Disaster Recovery (DR) tests
Running DR with IaC makes it easier to run tests and all the more detrimental because, otherwise, you’d never know if the strategies that you implemented actually works in real-world situations. What’s more, an IaC environment makes it easier to test since it’s easier to recreate infrastructures consistently with automated code. They’re version controlled so changes can be tracked and rolled back in case of faults. Documentation happens parallel to the automated recovery process, so the infrastructure is self-documented in code. The point is: environments like these should be vigorously tested because untested DR plans fail easily, no matter how well-designed they appear on paper.
Let’s look at a few exercises you can do to have a well-tested, foolproof DR environment.
Table-top exercises
Table-top testing is a simulated walk-through of disaster scenarios with key stakeholders discussing responses without actual technical implementation. This is more of a documentation phase where you need to define the various disaster scenarios you might need to respond to and build your DR strategy around that.
Implementation and outcome:
- Document natural disasters like flooding, earthquakes, fire, etc., and internal outages like ransomware, phishing attacks, etc.
- Also, document IaC-specific scenarios like corrupted state files, data leaks, etc.
- Speak to cross-functional teams like developers, operations, security, etc., and walk through recovery procedures step-by-step. Document gaps and improvement opportunities.
An example scenario could be: The production Terraform state file has been corrupted. What steps would the team take to recover using backup state files?"
Infrastructure recovery tests
Infrastructure recovery tests help validate an organization’s ability to restore and resume operations after a disruptive event. These can be practical tests that validate the ability to recreate infrastructure using templates in an isolated environment.
Implementation and outcome:
- Create a separate test environment and execute IaC scripts to build replica environments.
- Validate that all components work properly. If they don’t, you can identify gaps in your DR plan and fix them accordingly.
- Frequently testing these environments keeps the DR plan up to date while re-establishing the organization’s trust in the plan.
A simple way to run such tests could be to create a test environment, validate if all the services are functional or operational, and then destroy the environment, defaulting back to the main environment.
Configuration drift tests
A configurations drift test aims to identify and remediate the differences between the production environment and the recovery environment, ensuring that the DR environment is ready for a failover. Configurations drifts usually happen when changes are made without proper tracking or documentation. Discovering inconsistencies faster can lead to quicker resolutions, ensuring lower downtime.
Implementation and outcome:
- Run the drifting detection tools against the production environment and document any differences found.
- Setup an automated testing frequency so configuration scripts keep constantly checking for discrepancies.
- Also automate remediation so the production environment is always updated with the latest updates or patches.
- You can use tools like AWS Config, Pulumi’s drift detection program, and even Terraform’s plan command to run drift programs.
Data recovery tests
Recovery tests are very important because they help validate the ability of your systems to restore critical data within the defined RTO and RPO. These tests are to ensure that you can restore your systems in time for critical services to be back before the organization starts incurring heavy losses.
Implementation and outcome:
- Create test environments using IaC and restore production backup data to test environments.
- Once restored, validate the data integrity and completeness. Validate the time it took for recovery and if it’s under the required RTO and RPO defined for the business.
- If the recovered data is incomplete or unusable, identify faults in the backup and remediate the process.
- If data restoration takes longer than the required RTO and RPO, identify what’s causing the delay and improve backup health to meet these targets.
Pipeline recovery tests
Pipeline tests validate the recovery of CI/CD pipelines that deploy your IaC. It focuses on verifying the ability of your disaster recovery plan to restore and reconfigure pipelines, ensuring data integrity and performance after a disruption, by simulating and testing the recovery process.
Implementation and outcome:
- Create backup pipelines in separate environments and test pipeline recovery from source control.
- Validate secret and credential recovery processes and practice promoting backup pipelines to primary status.
- These tests help ensure that the data pipelines, that are sequences of data processing, can be effectively recovered and restored to a functional state after a disaster or outage.
- Successful pipeline recovery tests are crucial for maintaining business continuity and ensuring that data CI/CD pipeline remains operational even after disruptions.
Some important things to keep in mind while testing
Disaster Recovery testing should be built into your CI/CD pipeline for automated testing, especially for new environments. Speaking of environments, always test that new code can successfully build environments. After deploying new environments, always verify that backup and recovery features work with new deployments.
If you’re using immutable infrastructure patterns, validate images to ensure they’re not corrupted or to verify the integrity of the data. Validate container orchestrator recovery processes and test deployment when primary container/artifact registries are unavailable.
Another important thing is to measure the effectiveness of disaster recovery test which can be done by tracking all the relevant metrics and documenting records of each DR test:
Metrics to track:
- Recovery Time Objective (RTO): Time required to restore service
- Recovery Point Objective (RPO): Maximum acceptable data loss period
- Mean Time to Recovery (MTTR): Average time to restore service
- Test Success Rate: Percentage of DR tests that succeed on first attempt
- Coverage Percentage: Proportion of critical systems included in DR testing
Documentation requirements:
- Test Scenario: What disaster was simulated
- Test Procedure: Step-by-step actions taken
- Success Criteria: How success was determined
- Results: Actual outcomes including metrics
- Gaps Identified: Areas for improvement
- Action Items: Specific improvements with owners and timelines
Some important things to keep in mind while testing
Impeding Disaster Recovery efforts are the expenses tied to it. Whatever said and done, disaster recovery is a very small part of ITSM and there can arguably be more important aspects that need your budget and attention before DR becomes a priority. This becomes all the more expensive when you’re working with legacy hardware and software that require gargantuan efforts to modernize. Most small companies don’t even have a disaster recovery plan and simply work on the simple principle of “we’ll face it when it happens”. This is an ad-hoc way of dealing with DR which can cost you more than you think.
What does DR readiness look like:
- Ad-hoc: Close to no backups or recovery plans. The idea is to deal with disasters when they happen. In some cases, the hands of the IT leader is tied because upper management doesn’t want to invest in DR plans and are likely unaware of the consequences.
- Reactive: Most likely, a backup and restore plan is in action where a static replica of the production environment has been created in a backup vault but that’s as far as this goes. No documentation or testing to account for various outage scenarios.
- Basic preparedness: A basic plan is in action accounting for passable RTO and RPO for the organization. At this stage, minimal IT outages are easy to deal with but are not prepared for large scale, multi-region failures caused by internal errors and attacks.
- Proactive: A strong pilot-light or warm-standby method is deployed to ensure faster RTO and RPO with little or no business continuity concerns. Regular table-top testing drills are conducted and documented to test the readiness of the backup environment. This level of DR maturity is enough for most organizations.
- Resilient: For larger enterprises that can lose hundreds of thousands of dollars from outages need to have strong failover strategies across multiple regions with frequent backups to keep everything updated. RTO and RPO is minimal to none, and pipeline and data recovery tests are conducted regularly along with strong documentation.
It can be tempting to not focus on Disaster Recovery because it’s literally about planning for scenarios that may or may not happen, but it only takes one loose end for an entire infrastructure to come crumbling down. It might seem obvious, but your job as an IT leader is to prevent things from happening and not react to things when they happen. This is all IT is about. If your C-suite doesn’t see this then it’s also your job to make them understand the repercussions of not having DR and backup plans. Don’t hold back on using possible financial terms or the losses incurred as a result of possible outages.
Disaster Recovery is a continuous process
The way to improve your Disaster Recovery plans and strategies is to get the C-suite and your bosses onboard with the idea of why it’s necessary. Be sure to communicate your DR plans so there’s transparency in terms of what you’re doing and why you’re doing it. Of course, the higher ups don’t need to know the details of your DR plan, but they should be aware of the highlights:
- Have updated documentation on all DR tests that have been successfully completed, what were the tests for, and what further tests are planned.
- Have a criticality criteria and show them how you’ve prioritized various assets of the company with tested outcomes.
- Reassure them by showing all important assets are backed up and ready for emergency failover.
- Give them a timeline of your DR plans, why it makes sense, and what will these plans accomplish for the organization as a whole.
- Breakdown the risks of not implementing said DR plans and what could potentially happen if certain outages are not recovered from faster. Show them test data if necessary.
- Follow a framework so there’s a blueprint you can stick to and share it with the entire organization.
- Use the maturity model to show everyone where the company stands at the moment and what needs to be done to improve preparedness.

The 5 pillars of an effective Disaster Recovery strategy
As long as there’s transparency between you, your team, and your organization, simply focusing on implementing and gradually improving the DR strategy is the best way to go about it. Like everything else in IT, disaster recovery also takes time. See what works best for your organization given the current state, what fits your budget, and what makes sense in the long run. Above all, DR heavily relies on testing – don't deploy static backups and restore plans just to compromise critical data when an outage actually happens. Progressive steps will bulletproof your organization from disasters – be it something as small as a temporary offline system to a massive ransomware attack.