As the world has shifted online, the reliability of websites, cloud applications, and cloud infrastructure has become a critical business imperative—for everything from e-commerce operations to global banks to search engines.
The way we manage systems and their workloads has changed. Today, we seldom think in terms of precious, high-touch, high-performance servers, but instead rack upon rack of commodity servers pooled together through virtualization, with distributed software architecture preventing server outages from causing downtime. The focus has shifted from hardware to software-defined infrastructure and from inconsistent and error-prone manual processes to consistent, reliable, and repeatable automated tasks.
Site reliability engineering is the practice of maintaining that programmable infrastructure and maximizing the availability of the workloads that run on it. The site reliability engineer (SRE) job title originated in the halls of Google, which, at the turn of the millennium, wanted to redefine the relationship between software developers and operations staff – and help them work together to build sturdy, flexible systems, with constant improvement and automation as core principles.
What is an SRE?
At a base level, SREs bring software engineering principles to infrastructure and operations problems, with the north star goal of creating highly scalable and reliable systems.
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function,” as Ben Treynor, VP of engineering at Google and the godfather of SRE, is oft-quoted as saying.
Chief among SRE responsibilities is establishing service level thresholds, often manifested as service-level objectives (SLOs), which help inform whether or not a release gets greenlighted. The holy grail is always the hallowed ‘five nines’ or 99.999% uptime. The better the uptime, the more rope developers get to launch cool new stuff and the more sleep SREs get, leading to a mutually beneficial relationship between the functions, a far cry from the old days of developer and operations antagonism.
An SRE function will typically be measured on a set of key reliability metrics, namely: system performance, availability, latency, efficiency, monitoring, capacity planning and emergency response.
[ Also on InfoWorld: Application monitoring: What devops can do better ]
Key job responsibilities of an SRE
Any good SRE will be obsessed with one thing in particular: automation.
As Jason Qualman, an SRE at monitoring software vendor New Relic, states in a blog post: “A lot of this role is thinking about inefficient and time-consuming things people are doing and putting a stop to them as soon as possible. Instead of kicking a can down the road on manual work, you’re saying, ‘I’m going to take the time to automate this right now and stop anyone else from having to do this painful thing.’”
Another key element of the SRE role is something termed “release engineering,” which involves defining best practices to ensure software releases are consistent and repeatable.
“Release engineers have a solid (if not expert) understanding of source code management, compilers, build configuration languages, automated build tools, package managers, and installers. Their skill set includes deep knowledge of multiple domains: development, configuration management, test integration, system administration, and customer support,” wrote Dinah McNutt, technical program manager at Google, for the seminal book Site Reliability Engineering (published by O’Reilly in 2016 and authored by Googlers Jennifer Petoff, Niall Richard Murphy, Chris Jones, and Betsy Beyer).
Then there’s the response part of the role, which involves alerting, being on-call, and troubleshooting, along with emergency and incident response and postmortems.
Essentially, it’s important that SREs know how best to monitor systems and react when things go wrong, constantly writing and rewriting response playbooks to reduce the time to fix any breakdown which may occur. At Google, this involves documenting an incident, understanding all contributing root causes, and implementing future preventive actions.
“Writing a postmortem is not punishment – it is a learning opportunity for the entire company,” write Googlers John Lunney and Sue Lueder in a contributed chapter of the Site Reliability Engineering book.
[ Also on InfoWorld: 3 steps to applying agile methodologies in IT operations ]
SREs vs. devops engineers
I know what you’re thinking. That all sounds a lot like devops, but when it comes to terminology, the SRE job title actually pre-dates devops engineer by about five years.
Both are grounded in similar principles, but the difference is both subtle and important. Both ways of working involve breaking down the barriers between developers and operations staff, and both aim to increase the velocity of developer teams while maintaining core resiliency of those services.
The key difference is that devops engineers tend to focus on supporting continuous delivery and developer velocity, whereas SREs take responsibility for reliability and automation throughout the software lifecycle, with an emphasis on successfully deploying and monitoring releases and keeping software-defined infrastructure humming. The SRE has an integral function within the wider engineering team: ensuring there’s a specialist’s seat at the table focused on building stable systems.
As Jayne Groll at The Devops Institute puts it: “Devops focuses on engineering continuous delivery to the point of deployment; SRE focuses on engineering continuous operations at the point of customer consumption.”
The history of SRE at Google
Tracing SRE principles back to their origins at Google in the early 2000s provides a pivotal object lesson in the discipline.
“When I came to Google, I was fortunate enough to be part of a team that was partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the ‘everything can be treated as a software problem’ approach and run with it,” Ben Treynor stated in an interview on Google’s internal blog.
“So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor,” adds Treynor.
Google also thinks quite rigidly about how to put together an SRE team. All Google SREs must either be Google Software Engineers or “candidates who are very close to Google Software Engineering qualifications.” They must also have infrastructure management skills, most commonly “Unix system internals and networking (Layer 1 to Layer 3) expertise.”
SRE qualifications still tend to vary from company to company, but as far as basic principles go, the Google approach is a solid starting point. The details will depend on the business needs, established processes, and tech stack already adopted by the organization.
SRE job description and salary
SREs typically spend about 50 percent of their time performing traditional operations functions, such as being on call and jumping in to resolve issues. The other 50 percent is focused on developing software to make underlying systems more resilient, automated, and self-healing over time. That’s why the role requires a solid mix of software engineering chops and operations skills. A good SRE will be organized, cool under pressure, and a problem solver. SRE managers are responsible for team performance, strategy, and optimization.
But what about organizations where the SRE role doesn’t exist? In the O’Reilly report “What is SRE?” Kurt Andersen from LinkedIn and Craig Sebenik from Split (a release management software vendor) recommend taking a “grassroots” approach. They recommend finding “a development team that is motivated to change and implement a small SRE team (or individual) there. Over time, you can use that success as a positive example to other teams.”
The average annual salary for an SRE is roughly $130,000 in the U.S. and £76,000 in the U.K., according to job site Indeed.
SRE resources
Resources abound to build SRE skills, from certifications from the DevOps Institute to books and online resources from O’Reilly, Microsoft, and Google. The aforementioned 550-page behemoth Site Reliability Engineering by Jennifer Petoff, Niall Richard Murphy, Chris Jones, and Betsy Beyer is the go-to tome on the topic, published in 2016. The book is also available free online from Google.
Other more recent books on the topic include Training Site Reliability Engineers by Jennifer Petoff, JC van Winkel, and Preston Yoshioka; What Is SRE? by Kurt Andersen and Craig Sebenik; Seeking SRE by David N. Blank-Edelman, and The Site Reliability Workbook by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, and Stephen Thorne.
O’Reilly also has a comprehensive library of online assets, videos, and ebooks on the topic, handily curated in this SRE Essentials playlist by former Google site reliability engineer Liz Fong-Jones.
Online learning juggernaut Coursera offers several courses, including the popular Site Reliability Engineering: Measuring and Managing Reliability from Google Cloud Training. This course is also available from Pluralsight, as is the beginner course Site Reliability Engineering (SRE): The Big Picture by Elton Stoneman. The Linux Foundation offers a self-guided course titled DevOps and SRE Fundamentals: Implementing Continuous Delivery.
UK-based Jellyfish Training offers various two-day private training course options for SRE Foundation (SREF).
Read more about devops
- What is devops? Transforming software development
- 3 ways to kick off a devops program
- Devops best practices: The 5 methods you should adopt
- 15 KPIs to track devops transformation
- Application monitoring: What devops can do better
- Where site reliability engineering meets devops
- 5 principles to becoming a collaborative agile devops team
- 3 steps to applying agile methodologies in IT operations
- How agile teams can support incident management
- How dataops improves data, analytics, and machine learning
- Applying devops in data science and machine learning
- 7 questions to prioritize your devops backlog