- Senior leadership is asking difficult questions about the organization’s dependency on third-party cloud services and the risk that poses.
- IT leaders have limited control over third-party incidents and that includes cloud services. Yet they are on the hot seat when cloud services go down.
- While vendors have swooped in to provide resilience options for the more-common SaaS solutions, it is not the case for all cloud services.
Our Advice
Critical Insight
- No control over the software does not mean no recovery options. Solutions range from designing an IT workaround using alternate technologies to pre-defined third-party service continuity options (e.g. see options for O365) to business workarounds.
- Even where there is limited control, you can at least define an incident response plan to streamline notification, assessment, and implementation of workarounds. Leadership wants more options than simply waiting for the service to come back online.
- At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA issues and overall resilience gaps.
Impact and Result
- Follow a structured process to assess cloud resilience risk.
- Identify opportunities to mitigate risk – at the very least, ensure critical data is protected.
- Summarize cloud services risk, mitigation options, and incident response for senior leadership.
Mitigate the Risk of Cloud Downtime and Data Loss
Resilience and disaster recovery in an increasingly Cloudy and SaaSy world.
Analyst Perspective
If you think cloud means you don’t need a response plan, then get your resume ready. |
Most organizations are now recognizing that they can’t ignore the risk of a cloud outage or data loss, and the challenge is “what can I do about it?” since there is limited control. If you still think “it’s in the cloud, so I don’t need to worry about it,” then get your resume ready. When O365 goes down, your executives are calling IT, not Microsoft, for an answer of what’s being done and what can they do in the meantime to get the business up and running again. The key is to recognize what you can control and what actions you can take to evaluate and mitigate risk. At a minimum, you can ensure senior leadership is aware of the risk and define a plan for how you will respond to an incident, even if that is limited to monitoring and communicating status. Often you can do more, including defining IT workarounds, backing up your SaaS data for additional protection, and using business process workarounds to bridge the gap, as illustrated in the case studies in this blueprint. Frank Trovato Info-Tech Research Group |
Use this blueprint to expand your DRP and BCP to account for cloud services
As more applications are migrated to cloud-based services, disaster recovery (DR) and business continuity plans (BCP) must include an understanding of cloud risks and actions to mitigate those risks. This includes evaluating vendor and service reliability and resilience, security measures, data protection capabilities, and technology and business workarounds if there is a cloud outage or incident.
Use the risk assessments and cloud service incident response plans developed through this blueprint to supplement your DRP and BCP as well as further inform your crisis management plans (e.g. account for cloud risks in your crisis communication planning).
Overall Business Continuity Plan |
||
---|---|---|
IT Disaster Recovery Plan A plan to restore IT application and infrastructure services following a disruption. Info-Tech’s Disaster Recovery Planning blueprint provides a methodology for creating the IT DRP. Leverage this blueprint to validate and provide inputs for your IT DRP. |
BCP for Each Business Unit A set of plans to resume business processes for each business unit. Info-Tech’s Develop a Business Continuity Plan blueprint provides a methodology for creating business unit BCPs as part of an overall BCP for the organization. |
Crisis Management Plan A plan to manage a wide range of crises, from health and safety incidents to business disruptions to reputational damage. Info-Tech’s Implement Crisis Management Best Practices blueprint provides a framework for planning a response to any crisis, from health and safety incidents to reputational damage. |
Executive Summary
Your Challenge |
Common Obstacles |
Info-Tech’s Approach |
---|---|---|
|
|
|
Info-Tech Insight
Asking vendors about their DRP, BCP, and overall resilience has become commonplace. Expect your vendors to provide answers so you can assess risk. Furthermore, your vendor may have additional offerings to increase resilience or recommendations for third parties who can further assist your goals of improving cloud service resilience.
Key deliverable
Cloud Services Resilience Summary
Provide leadership with a summary of cloud risk, downtime workarounds implemented, and additional data protection.
Additional tools and templates in this blueprint
Cloud Services Incident Risk and Mitigation Review Tool Use this tool to gather vendor input, evaluate vendor SLAs and overall resilience, and track your own risk mitigation efforts. |
SaaS Incident Response Workflows Use the examples in this document as a model to develop your own incident response workflows for cloud outages or data loss. |
This blueprint will step you through the following actions to evaluate and mitigate cloud services risk
- Assess your cloud risk
- Review your cloud services to determine potential impact of downtime/data loss, vendor SLA gaps, and vendor’s current resilience.
- Identify options to mitigate risk
- Explore your cloud vendor’s resilience offerings, third-party solutions, DIY recovery options, and business workarounds.
- Create an incident response plan
- Document your cloud risk mitigation strategy and incident response plan, which might include a failover strategy, data protection, and/or business continuity.
Cloud Risk Mitigation
Identify options to mitigate risk
Create an incident response plan
Assess risk
Phase 1: Assess your cloud risk
Phase 1 | Phase 2 | Phase 3 |
---|---|---|
Assess your cloud risk | Identify options to mitigate risk | Create an incident response plan |
Cloud does not guarantee uptime
Public cloud services (e.g. Azure, GCP, AWS) and popular SaaS solutions experience downtime every year.
A few cloud outage examples:
- Microsoft Azure AD outage, March 15, 2022:
Many users could not log into O365, Dynamics, or the Azure Portal.
Cause: software change. - Three AWS outages in December 2021: December 7 (Netflix and others impacted), December 15 (Duo, Zoom, Slack, others), December 20 (Slack, Epic Games, others). Cause: network issues, power outage.
- Salesforce outage, May 12, 2022: Users could not access the Lightning platform. Cause: expired certificate.
Cloud availability
- Migrating to cloud services can improve availability, as they typically offer more resilience than most organizations can afford to implement themselves.
- However, having multiple data centers, zones, and regions doesn’t prevent all outages, as we see every year with even the largest cloud vendors.
DR challenges for IaaS, PaaS, and cloud-native
While there are limits to what you control, often traditional “failover” DR strategy can apply.
High-level challenges and resilience options:
- IaaS: No control over the hardware, but you can failover to another region. This is fairly similar to traditional DR.
- PaaS: No control over the software platform (e.g. SQL server as a service), but you can back up your data and explore vendor options to replicate your environment.
- Cloud-native applications: As with PaaS, you can back up your data and explore vendor options to replicate your environment.
Plan for resilience
- Include DR requirements when designing cloud service implementation. For example, for IaaS solutions, identify what data would need to be replicated and what services may need to be “always on” (e.g. database services where high-availability is demanded).
- Similarly, for PaaS and cloud-native solutions, consult your vendor regarding options to build in resilience options (e.g. ability to failover to another environment).
DR challenges for SaaS solutions
SaaS is the biggest challenge because you have no control over any part of the base application stack.
High-level challenges and resilience options:
- No control over the hardware (or the facility, maintenance processes, and so on).
- No control over the base application (control is limited to configuration settings and add-on customizations or integrations).
- Options to back up your data will depend on the service.
Note: The rest of this blueprint is focused primarily on SaaS resilience due to the challenges listed here. For other cloud services, leverage traditional DR strategies and vendor management to mitigate risk (as summarized on the previous slides).
Focus on what you can control
- For SaaS solutions in particular, you must toss out traditional DR. If Salesforce has an outage, you won’t be involved in recovering the system.
- Instead, DR for SaaS needs to focus on improving resilience where you do have control and implementing business workarounds to bridge the gap.