Sep 12, 2023 2:00 AM

What’s next for observability?

Today’s systems are exposing more of their underlying complexity to operators. These are the most exciting new developments along the journey of taming that complexity.

Getty Images

The concept of observability traces back to the 1960s, with Rudolf E. Kalman’s canonical work around decomposing complex systems for human understanding. It was a heady time for new compute systems in aerospace and navigation. The advances in these systems exceeded humans’ ability to reason about them, and Kalman’s work is largely credited for laying the foundation for observability theory.

Observability as we know it today—the $9 billion category that is a staple of modern IT operations—is more commonly associated with Google’s site reliability engineering approaches to hyperscale services like Google Search, Google Ads, and YouTube.

According to Google’s Site Reliability Engineering book, it was in 2003—corresponding with the creation of Borg, the cluster operating system that would inspire Kubernetes—that Google created a novel monitoring system called Borgmon. Google recognized that, with the many moving parts of microservices operating across distributed infrastructure, a new model was required for understanding dynamic systems—one that worked in real time and didn’t swamp platform teams with noisy pages.

Borgmon “made the collection of time-series data a first-class role of the monitoring system and replaced check scripts with a rich language for manipulating time-series into charts and alerts.”

Enter Prometheus and Grafana

Borgmon became the inspiration for Prometheus, which was publicly released in 2014 and today is the most popular open source technology for metric-based monitoring and alerting. Roughly at the same time, Grafana was independently introduced for visualizing observability. Together these two open source technologies created communities of tens of millions of developers and a flywheel of innovation for the observability category centering around metrics, logs, and traces.

But the story certainly does not stop there. Today’s systems are deliberately exposing more and more of their underlying complexity. We’re working with more data, not less. Our systems are getting more disparate, and user expectations keep moving higher. While the engineering and business trade-offs make sense, this means it becomes harder for human operators to understand what’s happening. They need the tooling to regain control, to contain and distill the exposed complexity.

So let’s take a look at what’s around the corner for observability, and the most exciting new developments in this journey of human understanding of distributed systems.

Kernel-level observability with eBPF and Cilium

As distributed systems have evolved, so too have the abstractions at the networking layer. Two of the most exciting technologies in this area are eBPF and Cilium, which extract kernel-level intelligence wherever applications involve Linux for file access, network access, and other operating system functionality.

These technologies—and some of their storage back ends like Hubble—create a new connectivity fabric that requires no changes to applications, and produce a treasure trove of fine-grained telemetry data for observability of events at the kernel level. As of today, the most successful users of eBPF and Cilium are focusing on network telemetry or on supporting large-scale analysis across large fleets of services.

Observability and software supply chain security

Security exploits like Log4j illustrated the relative insecurity of software artifacts—the frameworks and libraries that developers use to build software—and the need to lock down the origins and integrity of these building blocks. Especially troubling with Log4j was the difficulty that security teams had not just patching the vulnerability, but determining where or even if it was in their environment.

The massive catalog of artifacts that comprise the vast array of services in the typical enterprise, and the highly distributed nature of where services run, have created a morass for security teams that outstrips the powers of human reasoning. All signs point to observability needing to go hand in hand with software supply chain security, and I believe that we will see supply chain security technologies deeply integrated with the observability space.

Advances in observability ease of use

For all of the popularity of observability, it still requires too much subject matter expertise. Gains in automation will help to both improve and simplify the detection and selection of useful data. There is an opportunity for dashboards to be auto-created based on the type of data coming in. And auto-instrumentation will shorten time to value by creating a  baseline of data extraction, albeit at slightly increased overall data generation and thus cost.

Further, observability visualizations will evolve from GUI-first generation to configuration-as-code being a first-class concept. Instead of doing configurations on web interfaces, developers will interact with observability data and configurations through APIs and Git integrations—the familiar tools and concepts developers work with every day. Developers will be able to treat all of their observability data and visualizations as they would program code, even down to CI/CD, rollbacks, and all of the other mainstays of modern operations.

Application observability

Classic application monitoring, or APM, is severely limited in scope, looking only at the application while ignoring the underlying infrastructure. Classic app monitoring is blind to issues with the cloud provider, network, storage, databases, other services, the cluster scheduler, or anything else the application interacts with and depends on. With the new generation of observability tools, developers and operators can finally take a holistic view of everything a company does, seamlessly moving between metrics, logs, traces, and profiles—all while reducing mean time to recovery (MTTR) and increasing user satisfaction.

Concluding thoughts

The current wave of requirements for deeper observability is being driven in large part by the adoption of microservices and cloud-native computing, which have broken up old service boundaries. As organizations have implemented these new architectures to allow more work to be done in parallel by independent teams, they have also enjoyed the benefits of horizontal scalability, arguably at the cost of some vertical scalability.

This engineering trade-off makes business sense, but comes at a cost: More of the inherent complexities of these systems are directly exposed. As many teams have found, last-generation tooling is not capable of containing, distilling, and making sense of this complexity. Put differently, the lack of modern tooling has set up many teams for failure.

In the current macro-economic environment, it’s vitally important for companies to be more reliable than competitors while also using fewer resources. By enabling human subject matter experts to be more efficient and effective, we set up our teams and companies for success.

Channeling human understanding into greater automation underlies the industrial revolution, the silicon revolution, and today’s cloud revolution. Every revolution has losers and winners. This time the winners will be using modern observability principles to be able to understand what’s happening in their cloud.

Richard “RichiH” Hartmann is the director of community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, and member of the CNCF governing board and various committees.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.