The term “observability” started to gain serious momentum in software engineering circles around 2018, as a natural evolution of monitoring practices. By bringing together the raw outputs of metrics, events, logs, and traces, software developers could start to gain a real-time picture of how their software systems are performing and where issues might be occurring.
The concept itself, however, has deep roots in the broader engineering principles of control theory, where the measure of the internal state of a system can be observed using only its external outputs.
Now, with the broad shift towards distributed software systems through microservices and containers, the old adage of not being able to manage what you can’t measure has never been more relevant.
Observability vs. monitoring
For many people, observability will just sound like a convenient rebranding of application monitoring, and any skepticism around the latest industry buzzword is justified. However, as my colleague David Linthicum puts it, there is a basic difference: Monitoring “is something you do (a verb); observability is an attribute of a system (a noun),” he wrote.
Taking things one step further, engineering manager and technical blogger Ernest Mueller wrote back in 2018 that “observability is a property of a system. You can monitor a system using various instrumentation, but if the system doesn’t externalize its state well enough that you can figure out what’s actually going on in there, then you’re stuck.”
As developers have broken up their applications into smaller chunks—called microservices—hosted them in containers across distributed cloud servers, and deployed them continuously under the all-seeing eye of the devops team, the need for true observability has become increasingly critical.
“As systems become more distributed, methods for building and operating them are rapidly evolving—and that makes visibility into your services and infrastructure more important than ever,” software developer Cindy Sridharan wrote in her book Distributed Systems Observability.
“Observability is a superset of monitoring,” Sridharan wrote. “It provides not only high-level overviews of the system’s health but also highly granular insights into the implicit failure modes of the system. In addition, an observable system furnishes ample context about its inner workings, unlocking the ability to uncover deeper, systemic issues.”
The three pillars of observability
There are three commonly agreed upon pillars of observability: metrics, traces, and logs.
Taken individually, these pillars represent a developer’s ability to instrument and monitor their systems. Once brought together and presented in as close to real time as possible, you can start to make those systems observable.
That being said, the three pillars do not miraculously add up to observability. “It’s not about logs, metrics, or traces, but about being data-driven during debugging and using the feedback to iterate on and improve the product,” Sridharan wrote.
Greg Ouillon, the CTO for Europe, the Middle East, and Africa at monitoring vendor New Relic, sees observability as a confluence of the software engineering and monitoring trends that have shaped the cloud era.
“Observability addresses these challenges by rethinking monitoring and adapting to the new technology paradigm,” Ouillon said. “By providing you with a fully connected view of all software telemetry data in one place, real-time observability allows you to proactively master the performance of your digital architecture, accelerate innovation and software velocity, and reduce toil and operational costs.”
Observability tools and vendor landscape
The vendor landscape is fairly complex when it comes to observability, as makers of logging, monitoring, and application performance management (APM) software all stake claims to offering observability tools. “Observability a year ago was a useful term, but now is becoming a buzzword,” says Gartner analyst Josh Chessman.
Take log monitoring specialists like Splunk and Sumo Logic, both of which have moved further toward end-to-end observability by developing new features and making key acquisitions to round out their platforms. Splunk’s acquisitions include cloud network performance monitoring specialist Flowmill and user and application performance monitoring specialist Plumbr in 2020. Combined with the $1 billion purchase of real-time monitoring company SignalFx in 2019, it is clear that Splunk wants to be a one-stop-shop for observability tools.
Vendors like Dynatrace, Datadog, New Relic, SolarWinds, Scalyr (recently acquired by security specialist SentinelOne), and newcomer Honeycomb all also look to provide off-the-shelf instrumentation and observability as a service for engineering teams.
On the open source side, Grafana Labs has built a massively popular open source monitoring and observability platform. Apache Skywalker is another open source observability tool that allows system administrators to identify issues, receive key alerts, and monitor overall system health, with or without a service mesh.
The OpenTelemetry initiative is another open source project that has rapidly grown in popularity. The sandbox project—which came about as a merger between OpenCensus and OpenTracing—sits with the Cloud Native Computing Foundation (CNCF) and has gathered broad support as an emerging industry standard for observability.
For developers looking to build their own observability stack from scratch, open source tools like Prometheus for metrics, Logstash for logs, and Jaegar for tracing can provide the building blocks required to get the three pillars of observability.
The next phase of observability
The Holy Grail for users and vendors in the observability space—whether the toolkit is proprietary, open source, or even homegrown—is to automate away the fact-finding part of the process to the point where issues are automatically spotted and can be fixed before they affect users, or, better still, where the software fixes faults before the developers are even aware of the issue on their dashboard.
There is also a growing community of startups and open source projects looking at the next crop of observability challenges, such as the Signoz.io open source observability platform for Kubernetes and microservices, or Jeli, a project founded by an ex-Netflix engineer that focuses on giving developer teams the tools to map where their code is failing against the structure of their organization.
Building a culture of observability
It’s important to remember that the three pillars alone do not instantly combine to achieve observability; people and process must also be aligned around a set of shared goals.
“The process of knowing what information to expose and how to examine the evidence (observations) at hand—to deduce likely answers behind a system’s idiosyncrasies in production—still requires a good understanding of the system and domain, as well as a good sense of intuition,” Cindy Sridharan wrote.
Observability should not be the goal in and of itself, but rather viewed as a means to build and operate more reliable software for customers. “The value of the observability of a system primarily stems from the business and organizational value derived from it,” Sridharan wrote. “Being able to debug and diagnose production issues quickly not only makes for a great end-user experience, but also paves the way toward the humane and sustainable operability of a service, including the on-call experience.”
Those dual incentives of better customer outcomes and a potentially easier life for software engineers should be enough to drive many organizations towards gaining better observability of their systems for years to come.