Site reliability engineers (SREs) take proactive measures to improve app performance, decrease the number of defects found in production, and reduce the impact of production incidents. Their responsibility requires making trade-offs because increasing operational performance often comes at exponentially increasing costs.
Devops organizations with SREs use two measurement tools to guide decisions: service-level objectives and error budgets. Service-level objectives (SLOs) benchmark application and business service performance and reliability. When apps and services miss these objectives, it taxes their error budgets and signals devops teams to shift their efforts from investing in features and business capabilities to addressing operational issues.
There are different types of SLOs, but they start by capturing error events and benchmarking them to an acceptable threshold. For example, a mobile app may capture application errors and interactions with poor response times and define an SLO targeting 99.9% error-free user events per rolling 24-hour period. When events surpass this SLO, they are captured against the error budget, and devops teams typically prioritize their recommended remediations.
SLOs and error budgets are simple concepts, but measuring and managing to them require technology platforms and defined practices. Site reliability engineers need tools to capture and report on SLOs and manage error budgets, but they also need technologies that operate within the dev and ops life cycles to improve performance and reliability.
Here are some tools SREs should consider.
Use feature flags to isolate problems and reduce errors
“Houston, we have a problem,” and now the SRE’s challenge is to pinpoint the root cause. In some cases, they can remediate the issue, but when code changes are required, SREs need tools to circumvent the problem. A better option is to control the feature’s rollout so that problems can be identified faster and have fewer user impacts.
“I’m a big fan of feature flagging tools like LaunchDarkly and Optimizely, which allow companies to ship full-fledged features to fractional traffic,” says Marcus Merrell, vice president of technology strategy at SauceLabs. “Feature flagging allows a limited subset of users to see the changes while the team can monitor for problems. Once it’s been in production and behaving well for a certain amount of time, you can roll the changes to the full audience.”
Feature flagging is a tool to minimize errors from defects that make it into production. Merrell says, “In the old days, you’d have to risk shutting down your whole software development life cycle if there was a problem, but with feature flagging, you code the safety net with the feature itself.
Develop a strategy for observability, monitoring, and AIops
We know the saying, “If a tree falls in a forest and no one is around to hear it, does it make a sound?” If we apply this question to IT operations, it’s the network operations center’s (NOC) responsibility to hear the sound of an app going down or users experiencing poor performance. Are there monitoring systems to alert the NOC, and will they have the knowledge and tools to fix it?
Unfortunately, outages are more like forest fires because dependencies between microservices, third-party software as a service, and applications can set off a barrage of alerts. On the other extreme, sometimes monitoring tools are like your web-connected doorbell that fires off alerts every time a bunny crosses the road.
Roni Avidov, R&D lead at Monday.com, says, “Like many fast-growing companies, we experienced alert fatigue and a growing number of false negatives, which impacted trust in our existing tools.”
Devops teams need a strategy to help connect alerts and relevant observability data into correlated and actionable incidents. This can be challenging for organizations developing microservices, running on multicloud architectures, and increasing the deployment frequency of mission-critical applications. At that scale, AIops platforms can help reduce incident resolution time and identify remediations to problem root causes.
Avidov shares Monday.com’s approach: “We use Sentry to support all the platforms in our stack, and it allows for easy correlation between alerts. We’ve reduced time to resolution by over 70%, client-side errors by 60%, and false alerts by 50%.”
Another example: Bungie, an American video game company owned by Sony Interactive Entertainment, used BigPanda to achieve a 99% compression ratio from 3,000 alerts to 35 correlated incidents.
Emily Arnott, community manager at Blameless, adds that capturing real-time data is critical to success. “SLOs and error budgets need to reflect the absolute latest incident data accurately,” she says. “If they don’t, they could be breached, and customers could be impacted before engineers notice. Automated tooling is the best way to keep your SLOs up to date consistently.”
Create SLO templates and dashboards to align business and devops
Site reliability engineers can use policies defined as SLOs, monitoring and AIops platforms, and error budgets to drive actions that improve service reliability and performance.
Zac Nickens, global reliability and observability engineering manager and “SLOgician” at OutSystems, recommends reviewing The SLO Development Lifecycle, an open source methodology that includes a handbook, worksheets, templates, and examples for adopting service-level objectives. “We use it for our team to run internal SLO discovery and design sessions using templates from the SLODLC website,” says Nickens.
Discovering and designing the SLOs is just the first step to forming a business and devops collaboration with site reliability. Nickens continues, “We publish these SLOs on our internal wiki and link to them from our SLO dashboard on Nobl9. The SLO design documents from SLODLC make it easy to share the business context on the why behind each metric and error budget we use to keep our platform running and reliable.”
Implement SLOs as code
Is there a better way to capture and leverage implementable SLOs? Bruno Kurtic, founding chief strategy officer of Sumo Logic, recommends reviewing OpenSLO, an open source project for defining SLOs as code. “OpenSLO consists of an API definition and a command-line tool (oslo) to validate and convert SLO definitions,” says Kurtic.
OpenSLO announced Version 1.0 of its specification earlier this year. Contributing companies include GitLab, Lightstep, Nobl9, Red Hat, Sumo Logic, and Tapico.io.
It’s a strong sign that more companies are building open and interoperable tools to help site reliability engineers succeed at improving the performance and reliability of business services.