Back in the days before cloud applications, devops practices, test automation, and site reliability engineers, we had developers, testers, and system administrators developing and supporting Web and mobile applications. Developers followed agile methodologies, whereas system administrations often adopted ITIL’s incident management and other practices.
We had fewer tools to automate testing, deployment, and infrastructure in those days, so there was much toil going from code-complete to production-ready. Monitoring production infrastructure and applications and discovering root causes of production issues required both craft and skill because operational data, monitoring tools, and support workflows did not easily integrate.
In many ways, developing, testing, and supporting applications is somewhat easier today, but the terminology, role definitions, and practical responsibilities are much harder to decipher and apply. Is site reliability engineering part of devops or a complementary service? Who is responsible for implementing CI/CD (continuous integration/continuous delivery) pipelines and infrastructure as code? When there is a production incident, what’s the most efficient process to resolve the issue, discover the root cause, and implement the optimal remediations?
Google’s culture and practices may not work at your organization
The answers cannot be universally applied because of differences in company size, scale, and complexity. What works for a startup with a few dozen engineers does not work for geographically dispersed enterprises operating in regulated industries. Similarly, the culture, practices, and technologies that work well for large-scale technology companies such as Google, Netflix, or Microsoft are often not achievable in other industries or businesses with more legacy systems and technical debt.
To reach a common understanding, I suggest defining terminology around five dimensions:
- Mission, culture, collaboration, and mindset espoused by the organization
- Business goals, customer definition, value propositions, and success KPIs (key performance indicators)
- Technical work to develop, test, deploy, and monitor applications
- Workflow and practices used by teams to plan, deliver, and support applications
- Tools and automation needed to do this work successfully
This may look simple, but in practice, the way organizations operate is messier and the practice boundaries are less straightforward. For example, I define agile methodologies as a workflow practice, but how an organization defines its agile practice depends on the business objectives, technical architecture, software development lifecycle, and the devops automation that’s in place.
The same is true for the intersection of devops practices and culture with the work and responsibilities of site reliability engineers.
When it comes to devops and SRE (site reliability engineering), there are many, many opinions. Here is a sample:
- Google defines several devops objectives including reducing organizational silos, accepting failure as normal, implementing gradual change, using tools and automation, and measuring everything. Google then suggests several SRE operating models ranging from “everything SRE” where they have a comprehensive charter, to one where they serve as consultants to development teams and are less likely to make code modifications.
- The Association for Computing Machinery defines a more operations-centric charter for the SRE role and identifies monitoring, metrics, emergency response, capacity planning, service management, change management, and performance as SRE core functions.
- Another article tries to articulate the differences between devops and SRE, suggesting that “Devops generally focuses on the what whereas SRE focuses on the how,” which I agree with, but also advises that “SRE is well-suited for enterprises and organizations that want to manage large-scale applications.”
I would challenge the suggestion that SRE is mainly for large organizations. I believe that organizations of all sizes that build and support customized applications, data integrations, data science experimentation, and machine learning models need SREs to complement the development responsibilities. You can see this more clearly in the seven responsibilities of SREs that include managing by service-level objectives, working to minimize toil, and sharing ownership with developers. Who doesn’t need these responsibilities owned and addressed?
Shifting focus from system administration to site reliability
In my book, Driving Digital, I describe the SRE role as “agile operations.” I state an oversimplified, often unrealistic goal that “ideally, developers ought to be spending 70–80% of their time working with users on new capabilities and ideally only 20–30% on support.” Support is anything nonfunctional, from automating CI/CD pipelines to implementing infrastructure as code and automating application monitoring.
These functions are essential, but they are merely the scaffolding that makes development efficient, consistent, and reliable. The automation requires coding, but I’d rather have engineers skilled with devops platforms such as Jenkins, CircleCI, Azure DevOps, Puppet, Chef, Ansible and many others to implement it.
But ensuring reliable, high-performance, scalable, and secure applications in development, test, and production is also a critical responsibility. Increasingly, more of this work is shifting left so that these concerns get identified and addressed during development and testing phases. Ensuring reliability is a full team responsibility, but I’d prefer that others on the team learn the disciplines and tools, and take charge of the implementation.
Ten years ago, the industry called that responsibility systems administration or systems engineering. Today cloud infrastructure, tool provisioning, and automation have enabled more efficient infrastructure patterns and reduced systems work. Similarly, there’s a lot less administration because automation and controls such as change management have reduced some of the toil.
Developers and SREs must collaborate on application reliability
Today, the responsibility formally named “systems administration” or “systems engineering” is better called “site reliability engineering.” The people taking on this role need to understand performance from the infrastructure up to the end-user experience. They are the front line of defense when incidents occur and require leadership and technical responsibility to identify root cause and remediate the issue.
Those are some of the reactive responsibilities of SRE, but the more critical skills and tasks are proactive functions. What those responsibilities and tools look like depends on the nature of the application. They might include automating penetration tests, performing code reviews, running performance tests, and refactoring code that has nonfunctional technical debt that impacts performance, reliability, or scalability.
Developer collaboration starts with understanding the SRE role and where development responsibilities end and SRE begins. Even with this understanding, most organizations have far fewer SREs compared to developers, so collaboration is needed regardless of where the SREs sit, to whom they report, and whether they are part of the agile team or operate as external service providers. Although SREs should use the same tools and development practices as developers, what work gets assigned to them and what becomes a pattern or best practice implemented by developers differs by many factors.
Why is this critical? Because no one wants to be constantly fire-fighting incidents that impact end-users and customers.