Comprehensive software reviews to make better IT decisions
SLOs in Site Reliability Engineering
Thor, the Norse God of Thunder, tells Jane Foster, the woman he’s trying to impress, that on his home world of Asgard, the realm eternal, science and magic are two sides of the same coin. Had Jane been a part of the operations teams at Google (or other mature online service providers), she would have immediately realized we have a similar technology right here on good old Earth. We call the science site reliability engineering (SRE), and service level objectives (SLO) is the magic behind it.
SRE is a powerful concept for organizations that are serious about keeping their customers happy. It is therefore important for them to develop well-thought-out SLOs and make certain that management is intellectually equipped to derive valuable business perspectives from them.
Old habits die hard. Things we take as being default in our lives don’t ever leave, like those pesky dandelions in our backyards. You can attempt to pull them out at their roots, and for a bit, might even feel you were successful, but they are always lingering, deep down in our psyche, itching to pop up again and coerce us to business as usual.
The traditional rift between development and operations team has a similar dynamic. It’s an old habit that both functions have bravely attempted to root out, but the lack of a standard vocabulary and assumptions is a challenge that is hard to overcome. If one is to understand why this happens, they must embark on a Shamanic out-of-body experience and look down upon the motivations that drive both functions. Development teams like to manufacture new features, increase release velocity, and wow customers with bells and whistles. Operations, on the other hand, has a mandate to keep services stable, keep the lights on, and remove any risk that developed features might be introducing into a system. Individually, both have the right end goals, but the hurdles they face from each other stands in the way of their interests advancing. This is exactly where the SRE framework can step in and, like an astute relationship counselor, remove the structural conflicts between the two.
SRE and the accompanying SLOs it inspires can align teams to focus on what matters. SLOs are a precise numerical target for a system’s reliability and are conceived through a marriage of ideas between product and SRE teams. This way, teams have a personal stake in the success of the business and the SLOs become the shared reliability goal for development, operations, and product teams.
SLOs: A deeper look
SLOs are numerical thresholds for system availability and reliability. SLOs remove any ambiguity hovering round the concept of reliability and provide engineering, and business teams, a clear vision of what will keep customers happy and engaged. The data gathered for measuring SLO satisfaction forms the basic framework around which important conversations about service design, architecture, testing, deployment, and feature releases can be had.
In most cases, SLOs must answer three important questions:
- We want to prioritize reliability vs. features. How do we do that without compromising one over the other?
- What can we do to keep releasing new features at the risk of breaking the system but not let that diminish user experience?
- We must manage operational work versus project/product work. What’s the most efficient way of achieving the right balance?
We want to prioritize reliability vs. features. How do we do that without compromising one over the other?
The precise numerical nature of SLOs removes any ambiguity about system reliability. Since SLOs are the common language between teams and are reflective of the shared understanding between various parts of the organization, the development, business, and operations teams know what they are trying to achieve. Suddenly, system reliability becomes a feature, and like all features, is “deployed” every time features are put into production.
What can we do to keep releasing new features at the risk of breaking the system but not let that diminish user experience?
SLOs can surface the reliability cost of new features to be released. If a new feature release has an undesirable effect on reliability cost, teams can remove the feature from the release set and spend time on making them production ready. The opposite case is also a possible outcome, where new feature releases don’t impact reliability cost to the point where deploying them can be considered risky.
We must manage operational work versus project/product work. What’s the most efficient way of achieving the right balance?
With SLOs, teams can gather data that informs them of the items that may be causing operational overload (like outage management, incident response). Having this data and measuring it against SLOs makes it easier for teams to develop a data-driven decision framework for choosing operational vs. project/product work.
How can I develop meaningful SLOs?
Realizing that SLOs may be different for every organization, even if they have similar businesses, useful SLOs are:
- User-centric: Make the customer happy; keep them engaged and revenues should increase.
- Reasonably challenging: Setting 100% reliability targets is a silly idea. No system with any level of complexity, however well developed, can always be humming along without a glitch. Things will go awry occasionally, and it’s important to set thresholds that account for systemic errors that will inevitably occur. Yet, avoid setting the bar so low that even the worst kind of production snafu still comes out looking like a win.
- Simple and specific: There is no need to make SLOs complicated. “Achieve 97% uptime, every time” is a simple, specific, and powerful SLO. There is no ambiguity and therefore a reduced risk of misunderstanding between teams.
- Collaborative conceptualization: All teams who have skin in the game must have their say in the SLO. Period.
Our Take
Site reliability engineering is going to take center stage in a world where a significant proportion of the human population works from home, orders lunch using the dozen or so food delivery services, buys their weekly groceries from Amazon and others like it, and stream their entertainment from Netflix or similar services.
The volume of online activity is going to crush the ceiling set in the previous years and online services will be bombarded by users. For these services to keep afloat and keep customers coming back, they NEED to be reliable. Thinking of these traditionally non-functional needs as features is table stakes for organizations that intend to be top of mind with customers. Ignore this eventuality at your own peril.
Want to know more?
Site Reliability Engineering: What Is It? Why Is It Important for Online Businesses?