Take a Realistic Approach to Disaster Recovery Testing | Info-Tech Research Group

live

00:00

You have made significant investments in availability and disaster recovery – but your ability to recover hasn’t been tested in years. Testing will:

Improve your DR capabilities.
Identify required changes to planning documentation and procedures.
Validate DR capabilities for interested customers and auditors.

Our Advice

Critical Insight

If you treat testing as a pass/fail exercise, you aren’t meeting the end goal of improving organizational resilience.
Focus on identifying gaps and risks, and addressing them, before a real disaster hits.
Take a realistic, iterative approach to resilience testing that starts with small, low-risk tests and builds on lessons learned.

Impact and Result

Identify testing scenarios and scope that can deliver value to your organization.
Create practical test plans with Info-Tech’s template.
Demonstrate value from testing to gain buy-in for additional tests.

Take a Realistic Approach to Disaster Recovery Testing Research & Tools

1. Take a Realistic Approach to Disaster Recovery Testing Storyboard – A guide to establishing a right-sized approach to DR testing that delivers durable value to your organization.

Use this research to understand the different types of tests, prioritize and plan tests for your organization, review the results, and establish a cadence for testing.

2. Disaster Recovery Test Plan Template – A template to document your organization's DR test plan.

Use this template to document scope and goals, participants, key pre-test milestones, the test-day schedule, and your findings from the testing exercise.

3. Disaster Recovery Testing Program Summary – A template to outline your organization's DR testing program.

Identify the tests you will run over the next year and the expertise, governance, process, and funding required to support testing.

Member Testimonials

After each Info-Tech experience, we ask our members to quantify the real-time savings, monetary impact, and project improvements our research helped them achieve. See our top member experiences for this blueprint and what our clients have to say.

9.0/10

Overall Impact

$17,810

Average $ Saved

6

Average Days Saved

Client

Experience

Impact

$ Saved

Days Saved

Testimonial

Innovate 10X

Guided Implementation

9/10

$1,370

7

So allow me to align your POV with my project's context. The call with Andrew plus his follow-up interaction saved me close to 7 workdays in disco... Read More

Boston Dynamics

Guided Implementation

9/10

$34,250

5

Take a Realistic Approach to Disaster Recovery Testing

Reduce costly downtime with a right-sized testing program that improves IT resilience.

Analyst Perspective

Reduce costly downtime with a right-sized testing program that improves IT resilience.

Andrew Sharp

Most businesses make significant investments in disaster recovery and technology resilience. Redundant sites and systems, monitoring, intrusion prevention, backups, training, documentation: it all costs time and money.

But does this investment deliver expected value? Specifically, can you deliver service continuity in a way that meets business requirements?

You can’t know the answer without regularly testing recovery processes and systems. And more than just validation, testing helps you deliver service continuity by finding and addressing gaps in your plans and training your staff on recovery procedures.

Use the insights, tools, and templates in this research to create a streamlined and effective resilience testing program that helps validate recovery capabilities and enhance service reliability, availability, and continuity.

Andrew Sharp

Research Director, Infrastructure & Operations
Info-Tech Research Group

Executive Summary

Your Challenge

You have made significant investments in availability and disaster recovery (DR) – but your ability to recover hasn’t been tested in years. Testing will:

Improve your DR capabilities.
Identify required changes to planning documentation and procedures.
Validate DR capabilities for interested customers and auditors.

Common Obstacles

Despite the value testing can offer, actually executing on DR tests is difficult because:

Testing is often an IT-driven initiative, and it can be difficult to secure business buy-in to redirect resources away from other urgent projects or accept risks that come with testing.
Previous tests have been overly complex and challenging to coordinate and leave a hangover so bad that no one wants to do them again.

Info-Tech's Approach

Take a realistic approach to resilience testing by starting with small, low-risk tests, then iterating with the lessons you’ve learned:

Identify testing scenarios and scope that can deliver value to your organization.
Create practical test plans with Info-Tech’s template.
Get buy-in for regular DR testing from key stakeholders with a testing program summary.

Info-Tech Insight

If you treat testing as a pass/fail exercise, you aren’t meeting the end goal of improving organizational resilience. Focus on identifying gaps and risks so you can address them before a real disaster hits.

Process and Outputs

This research is accompanied by templates to help you achieve your goals faster.

1 - Establish the business rationale for DR testing.
2 - Review a range of options for testing.
3 - Prioritize tests that are most valuable to your business.
4 - Create a disaster recovery test plan.
5 - Establish a Test Program to support a regular testing cycle.

Outputs:

DR Test Plan
DR Testing Program Summary

Example Orange Activity slide.
Orange activity slides like the one on the left provide directions to help you make key decisions.

Key Deliverable:

Disaster Recovery Test Plan Template

Build a plan for your first disaster recovery test.

This document provides a complete example you can use to quickly build your own plan, including goals, milestones, participants, the test-day schedule, and findings from the after-action review.

Why test?

Testing helps you avoid costly downtime

In a disaster scenario, speed matters. Immediately after an outage, the impact on the organization is small, but impact increases rapidly the longer the outage continues.
A quick and reliable response and recovery can protect the organization from significant losses.
A DRP testing and maintenance program helps ensure you’re ready to recover when you need to, rather than figuring it out as you go.

“Routine testing is vital to survive a disaster… that’s when muscle memory sets in. If you don’t test your DR plan it falls [in importance], and you never see how routine changes impact it.”

– Jennifer Goshorn
Chief Administrative Officer
Gunderson Dettmer LLP

Info-Tech members estimated even one day of system downtime could lead to significant revenue losses. Estimated loss of revenue over 24 hours. Core Infrastructure has the highest potential for lost revenue.

Average estimated potential loss* in thousands of USD due to a 24-hour outage (N=41)

*Data aggregated from 41 business impact analyses (BIAs) conducted with Info-Tech advisory assistance. BIAs evaluate potential revenue loss due to a full day of system downtime, at the worst possible time.

Run tests to enhance disaster recovery plans

Testing improves organizational resilience

Identify and address gaps in your plans before a real disaster strikes.
Cross-train staff on systems recovery.
Go beyond testing technology to test recovery processes.
Establish a culture that centers resilience in everyday decision-making.

Testing keeps DR documentation ready for action

Update documentation ahead of tests to prepare for the testing exercise.
Update documentation after testing to incorporate any lessons learned.

Testing validates that investments in resilience deliver value

Confirm your organization can meet defined recovery time objectives (RTOs) and recovery point objectives (RPOs).
Provide proof of testing for auditors, prospective customers, and insurance applications

Overcome testing challenges

Despite the value of effective recovery testing, most IT organizations struggle to test recovery plans

Common challenges

Key resources don’t have time for testing exercises.
You don’t have the technology to support live recovery testing.
Tests are done ad hoc and lessons learned are lost.
A lack of business support for test exercises as the value isn’t understood.
Tests are always artificially simple because RTOs and RPOs must be met to satisfy customer or auditor inquiries

Overcome challenges with a realistic approach:

Start small with tabletop and recovery tests for specific systems.
Include recovery tests in operational tasks (e.g. restore systems when you have a maintenance window).
Create testing plans for larger testing exercises.
Build on successful tests to streamline testing exercises in the future.
Don’t make testing a pass-fail exercise. Focus on identifying gaps and risks so you can address them before a real disaster hits.

Go beyond traditional testing

Different test techniques help validate recovery against different threats

There are many threats to service continuity, including ransomware, severe weather events, geopolitical conflict, legacy systems, staff turnover, and day-to-day outages caused by human error, software updates, hardware failures, or network outages.
At its core, disaster recovery planning is about recovery. A plan for service recovery will help you mitigate against many threats at once. The testing approaches on the right will help you validate different aspects of that recovery process.
This research will provide an overview of the approaches outlined on the right and help you prioritize tests that are most valuable to your organization.

Different test techniques for disaster recover training: System Failover tests, tabletop exercises, ransomware recovery tests, etc.

00 Identify a working group

30 minutes

Identify a group of participants who can fill the following roles and inform the discussions around testing in this research. A single person could fill multiple roles and some roles could be filled by multiple people. Many participants will be drawn from the larger DRP team.

Roles and expectations for Disaster Recovery Planning. DRP sponsor, Testing coordinator, System testers, business liaisons, executive team.

Input

Organizational context

Output

A list of key participants for test planning and execution

Participants

Typically, start by identifying the sponsor and coordinator and have them identify the other members of the working group.

Start by updating your disaster recovery plan (DRP)

Use Info-Tech’s Create a Right-Sized Disaster Recovery Plan research to identify recovery objectives based on business impact and outline recovery processes. Both are tremendously valuable inputs to your test plans.

Overall Business Continuity Plan

IT Disaster Recovery Plan

A plan to restore IT services (e.g. applications and infrastructure) following a disruption. A DRP:

Identifies critical applications and dependencies.
Defines appropriate recovery objectives based on a business impact analysis (BIA).
Creates a step-by-step incident response plan.

BCP for Each Business Unit

A set of plans to resume business processes for each business unit. A business continuity plan (BCP) is also sometimes called a continuity of operations plan (COOP).

BCPs are created and owned by each business unit, and creating a BCP requires deep involvement from the leadership of each business unit.

Info-Tech’s Develop a Business Continuity Plan blueprint provides a methodology for creating business unit BCPs as part of an overall BCP for the organization.

Crisis Management Plan

A plan to manage a wide range of crises, from health and safety incidents to business disruptions to reputational damage.

Info-Tech’s Implement Crisis Management Best Practices blueprint provides a framework for planning a response to any crisis, from health and safety incidents to reputational damage.

01 Confirm: why test at all?

15-30 minutes

Identify the value recovery testing for your organization. Use language appropriate for a nontechnical audience. Start with the list below and add, modify, or delete bullet points to reflect your own organization.

Drivers for testing – Examples:

Improve service continuity.
Identify and address gaps in recovery plans before a real disaster strikes.
Cross-train staff on systems recovery to minimize single points of failure.
Identify how we coordinate across teams during a major systems outage.
Exercise both recovery processes and technology.
Support a culture that centers system resilience in everyday decision-making.
Keep recovery documentation up-to-date and ready for action.
Confirm that our stated recovery objectives can be met.
Provide proof of testing for auditors, prospective customers, and insurance applications.
We require proof of testing to pass audits and renew cybersecurity insurance.

Info-Tech Insight

Time-strapped technical staff will sometimes push back on planning and testing, objecting that the team will “figure it out” in a disaster. But the question isn’t whether recovery is possible – it’s whether the recovery aligns with business needs. If your plan is to “MacGyver” a solution on the fly, you can’t know if it’s the right solution for your organization.

Input

Business drivers and context for testing

Output

Specific goals that are driving testing

Participants

DR sponsor
Test coordinator

Think about what and how you test

Different layers of the stack to test: Network, Authentication, compute and storage, visualization platforms, database services, middleware, app servers, web servers.

Find gaps and risks with tabletop testing

Tabletop planning had the greatest impact on meeting recovery objectives (RTOs/RPOs).

In a tabletop planning exercise, the team walks through a disaster scenario to outline the recovery workflow, and risks or gaps that could disrupt that workflow.

Tabletops are particularly effective because:

It enables you to play out a wider range of scenarios than technology-based testing (e.g. full-scale, parallel) due to cost and complexity factors.
It is non-intrusive, so it can be executed more easily than other testing methodologies.
The exercise translates into recovery documentation: you create a workflow as you go.
A major site or service recovery scenario will review all aspects of the recovery process and create the backbone of your recovery plan.

02 Run a tabletop exercise

2 hours

Tabletop testing is part of our core DRP methodology, Create a Right-Sized Disaster Recovery Plan. This exercise can be run using cue cards, sticky notes, or on a whiteboard; many of our facilitators find building the workflow directly in flowchart software to be very effective.

Use our Recovery Workflow Template as a starting point.

Some tips for running your first tabletop exercise:

Do

Review the complete workflow from notification all the way to user acceptance testing.
Keep focused; stay on task and on time.
Revisit each step and record gaps and risks (and known solutions, but don’t dwell on this).
Revise and improve the plan with task owners.

Don't

Get weighed down by tools.
Try to find solutions to every gap/risk as you go. Save in-depth research/discussion for later.
Document the details right away – stick to the high-level plan for the first exercise.

Ahead of the exercise, decide on a scenario, identify participants, and book a meeting time.
- For your first walkthrough of a DR scenario, we often recommend a scenario that considers a site failure requiring failover to a DR site.
- For the first exercise, focus on technical aspects of recovery before bringing in members of the business. The technical team may need space to discuss the appropriate steps in the recovery process before you bring in business liaisons to discuss user acceptance testing (UAT).
- A complete failover considers all systems, the viability of your second site, and can help identify parts of the process that require additional exercises.
Review the scenario with participants. Then, discuss and document the recovery process, starting with initial notification of an event.
- Record steps in the process on white cards or boxes.
- On yellow and red cards, document gaps and risks in people process and technology requirements.
Once you’ve walked through the process, return to the start.
- Record the time required to complete each step. Consider identifying who is responsible for key steps. Identify any additional gaps and risks.
Clean up and record the results of the workflow. Save a copy with your DRP documentation.

Input

Expert knowledge on systems recovery

Output

Recovery workflow, including gaps and risks

Participants

Test coordinator
Technical SMEs

Move from tabletop testing to functional exercises

See how your plans fare in the real world

In live exercises, some portion of your recovery plans are executed in a way that mimics a real recovery scenario. Some advantages of live testing:

See how standby systems behave. A tabletop exercise can miss small issues that can make or break the recovery process. For example, connectivity or integration issues on a new subnet might be difficult to predict prior to actually running services in that environment.
Hands-on practice: Familiarize the team with the steps, commands, and interfaces of your recovery toolset.
Manage the pressure of the DR scenario: Nothing’s quite like the real thing, but a live exercise may be the closest your team can get to a disaster situation without experiencing it firsthand.

Examples of live exercises

Boot and smoke test	Turn on a standby system and confirm it boots up correctly.
Restore and validate data	Restore data or servers from backup. Confirm data integrity.
Parallel testing	Send familiar transactions to production and standby systems. Confirm both systems produce the same result.
Failover systems	Shut down the production system and use the standby system in production.

Run local tests ahead of releases

Think small

Most unacceptable downtime is caused by localized issues, such as hardware or software failures, rather than widespread destructive events. Regular local testing can help validate the recovery plan for local issues and improve overall service continuity.

Make local testing a standard step in maintenance work and new deployments to embed resilience considerations in day-to-day activities. Run the same tests in both your primary and your DR environment.

Some examples of localized tests:

Review backup logs and check for errors.
Restore files or whole systems from backup.
Run application-based tests as part of release management, including unit, regression, and performance tests.
- Ensure application tests are run for both the primary and DR environment.
- For a deep-dive on application testing, see Info-Tech’s research Automate Testing to Get More Done.

Info-Tech Insight

Local tests will vary between different services, and local test design is usually best left to the system SMEs. At the same time, centralize reporting to understand where tests are being done.

Investigate whether your IT Service Management or ticketing system can create recurring tasks or work orders to schedule, document, and track test exercises. Tasks can be pre-populated with checklists and documentation to support the test and provide a record of completed tests to support oversight and reporting.

Have the business validate recovery

If your business doesn’t think a system’s recovered, it’s not recovered.

User acceptance testing (UAT) after system recovery is a key step in the recovery process. Like any step in the process, there’s value in testing it before it actually needs to be done. Assign responsibility for building UATs to the person who will be responsible for executing them.

An acceptance test script might look something like the checklist below.

Does the application open?
Does the interface look right?
Do you see any unusual notifications or warnings?
Can you conduct a key transaction with dummy data?
Can you run key reports?

“I cannot stress how important it is to assign ownership of responsibilities in a test; this is the only way to truly mitigate against issues in a test.”

– Robert Nardella
IT Service Management
Certified z/OS Mainframe Professional

Info-Tech Insight

Build test scripts and test transactions ahead of time to minimize the amount of new work required during a recovery scenario.

Beyond the Basics: Full Failover Testing

A failover test – a full failover of your production environment to a secondary environment – is what many IT and businesspeople think about when they think of disaster recovery testing.
A full test can validate previous local or tabletop tests, identify additional gaps and risks, and provide hands-on training experience with recovery processes and technologies.
Setting a date for failover testing can also inject some urgency into otherwise low-priority (but high importance) disaster recovery planning and documentation exercises, which need to be completed prior to the test.
Despite these benefits, full failover tests carry significant risk and require a great deal of effort and cost. Typically, only businesses that already have an active-active environment capable of supporting in-scope production systems are able to run a full environment failover.
This is especially true the first time you test. While in theory a DR plan should be ready to go at any time, there will be documents to update, gaps to address, and risks to mitigate before you go ahead with the test.

Full Failover Testing

What you get:

Provide hands-on experience with recovery processes and technology.
Confirm that site failover works in practice as you assumed in tabletop or local testing exercises.
Identify critical gaps you might have missed without a full failover test.

What you need:

An active-active secondary site, with sufficient standby equipment, data, and licensed standby software to support production.
A completed tabletop exercise and documented recovery workflow.
A documented test plan, backout plan, and formal sign-off.
An off-hours downtime window.
Time from technical SMEs and business resources, both for creating the plan and executing the test.

Beyond the Basics: Site Reliability Engineering

Site reliability engineering (SRE) is an application of skills and approaches from software engineering to improve system resilience.
SRE is focused on “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning” across a set portfolio of services (Sloss, 2017).
In many organizations, SRE is implemented as a team that supports separate applications teams.
Applications must have defined and granular resilience requirements, translated into service objectives. The SRE team and applications teams will work together to meet these objectives.
Site reliability engineers (the folks that do SRE, and often also abbreviated as SREs) are expected to build solutions and processes to ensure services remain stable and performant, not just respond when they fail. For example, Google allows their SREs to spend just half their time on incident response, with the rest of their time focused on development and automation tasks.

Site Reliability Testing

What you get:

Improved reliability and reduced frequency and impact of downtime.
Increased use of automation to address problems before they cause an incident.
Granular resilience objectives.

What you need:

Systems running on software-defined infrastructure.
Specialized skills in programming, infrastructure-as-code.
Business & product owners able to define and fund acceptable and appropriate resilience objectives.
Technical experts able to translate product requirements into technical design requirements.

live

00:00

About Info-Tech

Info-Tech Research Group is the world’s fastest-growing information technology research and advisory company, proudly serving over 30,000 IT professionals.

We produce unbiased and highly relevant research to help CIOs and IT leaders make strategic, timely, and well-informed decisions. We partner closely with IT teams to provide everything they need, from actionable tools to analyst guidance, ensuring they deliver measurable results for their organizations.

What Is a Blueprint?

A blueprint is designed to be a roadmap, containing a methodology and the tools and templates you need to solve your IT problems.

Each blueprint can be accompanied by a Guided Implementation that provides you access to our world-class analysts to help you get through the project.

Table of Contents

Talk to an Analyst

Our analyst calls are focused on helping our members use the research we produce, and our experts will guide you to successful project completion.

Book an Analyst Call on This Topic

You can start as early as tomorrow morning. Our analysts will explain the process during your first call.

Get Advice From a Subject Matter Expert

Each call will focus on explaining the material and helping you to plan your project, interpret and analyze the results of each project step, and set the direction for your next project step.

Authors

Frank Trovato

Andrew Sharp

Contributors

Bernard A. Jones, Business Continuity & Disaster Recovery Expert
Robert Nardella, IT Service Management, Certified z/OS Mainframe Professional
Larry Liss, Chief Technology Officer, Blank Rome LLP
Jennifer Goshorn, Chief Administrative and Chief Compliance Officer, Gunderson Dettmer LLP
Paul Kirvan, FBCI, CISA, Independent IT Consultant/Auditor
Steve Tower, Principal Consultant, Prompta Consulting Group
Joe Starzyk, Senior Business Development Executive, IBM Global Services
Paul S. Randal, CEO & Owner, SQLskills.com
Tom Baumgartner, Business Continuity/Disaster Recovery Analyst, Catholic Health

Search Code: 76225
Last Revised: April 10, 2023

TAGS:

tabletop exercise, DRP testing, disaster recovery test, continuity testing, DRP test, DR testing, DR test, failover testing, live exercise, lfbp

Visit our IT’s Moment: A Technology-First Solution for Uncertain Times Resource Center
Over 100 analysts waiting to take your call right now: +1 (703) 340 1171