As enterprise technology becomes more and more complex, the term “observability” is gaining traction among those tasked with managing the distributed infrastructure their companies increasingly depend on. Never has the old adage that you can’t control what you can’t measure been so relevant for people in the software business, with the need for observability becoming clear.
Back in March 2020, before the vast majority of the world knew what Reddit’s r/wallstreetbets was, or how much GameStop stock was trading for, the popular investing app Robinhood was struggling with regular service outages, blocking users from buying and selling shares of companies like Tesla, Apple, and Nike.
The outages, which cropped up a few times over the course of 2020, were caused by “stress on our infrastructure,” wrote Robinhood cofounders Baiju Bhatt and Vlad Tenev in a March 2020 blog post.
This is obviously bad for business, because Robinhood makes a small amount of money on every trade that flows through its systems, and it’s bad for Robinhood’s reputation as a company that is trying to democratize the buying and selling of company stocks over the internet. Such outages can even lead to lawsuits from disgruntled users who missed out on selling at the top or buying at the bottom of the market.
For all those reasons, being able to spot those infrastructure stresses before they affect customers, or at least to limit the blast radius of such incidents, can quickly become a board-level priority for companies like Robinhood.
The complexity of modern cloud-based software has allowed businesses to scale their digital services effectively, but that complexity also creates bottlenecks and dependencies that can be hard to foresee or fix on the fly.
“With thousands of microservices, hundreds of releases per day, and hundreds of thousands of containers, there’s no way that the human eye can cope with that level of complexity,” said Greg Ouillon, CTO for Europe, Middle East, and Africa at the monitoring vendor New Relic.
Observability promises to help harness today’s IT complexity
Observability has its roots in the engineering principles of control theory, where the measure of how the internal state of a system can be observed using only its external outputs. In software specifically, it is a natural evolution of monitoring, taking the raw outputs of metrics, events, logs, and traces to build up a real-time picture of how your systems are performing and where issues might be cropping up. It is the means by which developers can start to peel back the black box encasing their complex systems.
The problem for most organizations is the sheer volume of data being generated by their large, distributed systems and thus the ability to find a scalable way to spot and react to issues quickly enough to stop users being affected.
“Containers and microservices are so complex and the interactions are so vast, it is virtually impossible to make sense of it. As we add more instrumentation we get more data and no one can look at all that,” said Josh Chessman, a Gartner analyst specializing in network and application performance monitoring. “How do you find that needle in the haystack? That is what observability is about in the end—finding that and fixing it, because downtime costs money.”
How the pandemic pushed observability forward
The COVID-19 pandemic has pushed cloud spending up across the board, which means more and more companies need to be able to monitor and remediate the underlying complexity that comes with the cloud. “Being able to view the entire software stack is now a must-have within much more complex IT and development environments and during continued cloud migration and accelerated application modernization,” said New Relic’s Ouillon.
Spiros Xanthos is the cofounder of the distributed tracing startup Ominition, which was acquired by the monitoring vendor Splunk in 2019. Having spent years working with the tools required to effectively observe modern, distributed software systems, he is now VP of product management, observability, and IT operations at Splunk, where he has seen customer interest in observability as an idea grow quickly in the past year.
“In 2018, we saw many companies that are cloud-native and in the tech sector talking about observability,” he said. “Last year, we saw this becoming more mainstream, with large organizations adopting cloud-native technologies and becoming interested in observability.”
British bank TSB has had its own well-publicized issues with customer-impacting technology, following its disastrous core banking system migration in 2018. Since then, the bank has had to grapple with regular IT outages, making reliability and incident response board-level priorities. “We want to be architected for the cloud, where any failure is like the Netflix model, where there is no massive system outage and we limit anything to a handful of customers,” said Suresh Viswanathan, TSB’s chief operating officer.
TSB no longer owns and operates any data centers, so its call center system is in BT’s cloud, its CRM is Microsoft Dynamics 365, and its core banking system is managed by IBM, to name just a few key partners—all linked together by a complex web of microservices and APIs. That’s a good example of where observability is needed.
“In theory, we can replace any of those platforms, but as you roundtrip these transactions we don’t have the instrumentation to know what goes pop [fails],” Viswanathan said. So the bank is using the monitoring vendor Dynatrace to gain this instrumentation and visibility. Observability is “not just a tool but a cultural journey as a firm,” he said, “so we can track what is happening in the hands of our customers and roundtrip that. This is important to be one step ahead of any problems.”
Going beyond the three pillars of observability
Speaking at the first Dash conference in 2018, Datadog CEO Olivier Pomel outlined what are now commonly agreed upon as the three pillars of observability: metrics, traces, and logs. Taken individually, these pillars each represents a developer’s ability to monitor their systems. Once brought together, you can start to get to observability.
“Developers have been doing those three things for a long time, so rebranding them is not particularly useful,” said Dan Taylor, head of engineering at the popular travel booking company Trainline. “For us, the crux of the issue is to go beyond those three technical pieces to looking at a system in a holistic way, rather than as individual components.”
Trainline is a typically complex modern application, made up of interconnected microservices and hundreds of APIs for external travel companies to plug into its booking platform. This creates a whole host of dependencies that can be hard to observe in a consistent way, especially when you want to give developer teams autonomy over how they manage their software.
“It’s not about prescriptively telling them how much to log or what metrics are important, but bringing them to the understanding of their impact on customers and the business as a whole,” Taylor said.
For most organizations, instrumentation is just the start. Being able to understand the value of that information and how it can help your customers and engineers is the more important part of the puzzle.
For example, at Porsche Informatik, an Austrian software company primarily serving the automotive sector, “customers expect round-the-clock availability, which requires an understanding of the root cause of a problem before the customer sees the issue. We needed integrated monitoring of every component across our full stack,” said Peter Friedwagner, head of infrastructure and cloud services at Porsche Informatik, during his session at this year’s Dynatrace Perform virtual conference.
The firm hosts a dealer management system used by 50,000 car dealers across Europe, where uptime is vital. It recently broke this monolithic on-premises application down into microservices hosted across containers using Red Hat OpenShift, both on-prem and in the Microsoft Azure public cloud. Understanding the communication patterns among those microservices as they cascade was, and still is, challenging for its developers. The hope is that observability tools will lead to that understanding.
Beware the ‘observability’ buzzword
“Observability a year ago was a useful term, but now is becoming a buzzword,” said Gartner analyst Chessman, with plenty of vendors proving more than happy to co-opt the observability moniker.
“As the need and demand for observability grows, some monitoring tool vendors are jumping on the bandwagon—about as fast as they did with devops a few years ago,” the vendor Splunk notes in its own Beginner’s Guide to Observability, with at least some degree of self-awareness.
As engineering manager and technical blogger Ernest Mueller wrote back in 2018, “No tool is going to give you observability and that’s the usual silver-bullet fallacy heard from someone who wants to sell you something.”
Instead, organizations have to work out their own path to better observability. “It’s like putting the cart before the horse to buy observability,” said George Bashi, vice president of engineering infrastructure at Yelp.
That’s why the popular reviews site—which is primarily a highly distributed Python application running on Kubernetes—believes in product ownership and empowering developers to be responsible for their own services. “When a developer team owns something, the classic trade-off is performance, reliability, and cost. We put the data into the hands of those teams so they have the tools to make those decisions,” Bashi said.
What’s next for observability
When you talk to anyone tasked with thinking about the observability of their systems, you get a common wish list, often topped with automated insights and remediation powered by machine learning.
TSB’s Viswanathan wants tools that can apply intelligence “to know the top issues and apply the home remedy kit, as it were, so the system is self-starting, without us noticing. That is where we want to go to.”
This is also where the vendors want to go. “We are finally close enough to machine-based intelligence for observability,” said Splunk’s Xanthos. “For the first time, we are able to apply machine intelligence to correlate effectively. If I can solve this once, we can move towards automated remediation.”
The machines might not be ready to take over just yet, though. In her foundational book Distributed Systems Observability, software developer Cindy Sridharan preaches for engineer-led observability:
The process of knowing what information to expose and how to examine the evidence at hand—to deduce likely answers behind a system’s idiosyncrasies in production—still requires a good understanding of the system and domain, as well as a good sense of intuition.
The Holy Grail for anyone building out their observability capability is a system that can eventually spot and fix issues automatically, before engineers are even aware of it and regardless of the environment they are running. To get there, “vendors will have to stand out on their ability to consolidate and make sense of those piles of data, with intelligence and automation capabilities layered on top of that instrumentation,” said Gartner’s Chessman.