by Ben Evans and Charles Humble

Why observability is the future of systems monitoring

feature
Aug 12, 20209 mins
Software Development

Combining metrics, events, logs, and traces is essential to understanding our increasingly complex software environments

Binary streams across a row of computer displays.
Credit: Loops7 / Getty Images

While the shift to cloud continues to be a major trend within our industry, it remains the case that different organizations are performing that migration in vastly different ways.  The firms that typically attract the headlines are those that have undergone a root-and-branch transformation. After all, the story of a complete overhaul and radical restructuring along cloud-native lines is a compelling one.

However, this is far from the only narrative in the marketplace. Not every business is on the same trajectory toward cloud adoption, and an extensive hinterland of applications and companies still have not moved to the cloud. In addition, there exists a major subset of companies that have migrated only partially, or in a way that closely resembles their historic technology practices — the “lift and shift” approach.

As an example, O’Reilly Radar conducted a 2020 Cloud Adoption survey of 1,283 engineers, architects, and IT leaders from companies across many industries. More than 88% percent of respondents use cloud in one form or another. However, over 90% of respondent organizations also expect to grow their usage over the next 12 months, with only 17% of respondents from large organizations (over 10,000 employees) indicating they have already moved 100% of their applications to the cloud. Clearly, most of the world has a ways to go in their cloud migration journey.

What’s the holdup? One simple, inescapable conclusion is that software has never been more complex than it is today. We live in a world that is increasingly driven by cloud, but also has a large number of heterogeneous technology stacks. More than half of the O’Reilly survey respondents indicated that they are using multiple cloud services and have implemented microservices. Among cloud service and solutions providers, there are no clear winners that look ready to drive out the competition and dominate. If anything, we should expect the diversity of popular solutions to increase, rather than decrease.

From APM to observability

One aspect of this persistent diversity is manifested in the need of companies to make sense of the performance of their applications. Many software shops have long made use of application performance monitoring (APM) solutions, which collect application and machine level metrics and display them in dashboards. The APM approach provides insights and allows engineers to find and fix problems, but also leads to its own anti-patterns, such as the trap of trying to collect everything (what we might call “Pokemon Monitoring”). In reality, the vast majority of these collected metrics will never be looked at. Moreover, collecting the data is, relatively speaking, the easy part. The hard part is making sense of it. In order to be useful, monitoring data needs to be in context and actionable.

In response to these issues, the industry is increasingly turning from conventional monitoring tools to observability. The term isn’t clearly defined, and as such it might mean different things to different people. For some, observability is just a rebranding of monitoring. For others, observability is about logs, metrics, and traces. For the purposes of this article, we’re focusing on the latter, taking the definition derived from control theory. This represents an emergent practice that relies on a new view of what monitoring data is and how it should be used.

At a high level, the goal of observability is to be able to answer any arbitrary question at any point in time about what is happening inside a complex software system just by observing the outside of the system. An example question might be, “Is this issue impacting all iOS users, or just a subset?” Or “Show me all the page loads in the UK that take more than 10 seconds.”

The ability to ask ad hoc questions is useful for both debugging and incident response, where you typically see engineers asking questions that they hadn’t thought of up front. This is also the key difference between monitoring and observability. Monitoring is set up in advance, which means teams need to know what to care about ahead of a system issue occurring. Observability allows you to discover what’s important by looking at how the system actually behaves in production over time. The ability to understand a system in this way is also one of the mechanisms that allow engineers to evolve it.

Keys to observability 

To achieve observability for distributed systems, such as container-based microservices deployments, we typically aggregate telemetry data from four major categories. In summary, these data are:

  • Metrics: A numerical representation of data measured over a time interval. Examples might include queue depth, how much memory is being used, how many requests per second are being handled by a given service, the number of errors per second, and so on. Metrics are particularly useful for reporting the overall health of a system, and also naturally lend themselves to triggering alerts and visual representations such as gauges.
  • Events: An immutable, time-stamped record of events over time. These are typically emitted from the application in response to an event in the code.
  • Logs: In their most fundamental form, logs are essentially just lines of text that a system produces when certain code blocks get executed. They might be in plaintext, structured (for example, emitted in JSON), or binary (such as the MySQL binlogs used for replication and point-in-time recovery). Logs prove valuable when retroactively verifying and interrogating code execution. In fact, logs are incredibly valuable for troubleshooting databases, caches, load balancers, or older proprietary systems that aren’t friendly to in-process instrumentation, to name a few. Similar to events, log data is discrete and is typically more granular than events.
  • Traces: Traces show the activity for a single transaction or request as it “hops” through a system of microservices. A trace should show the path of the request through the system, the latency of the components along that path, and which component is causing a bottleneck or failure.

Of the four types of telemetry data, traces are generally considered the most difficult to apply retrospectively to an infrastructure. That’s because, for tracing to be truly effective, every component of the system needs to be modified to propagate tracing information. In a microservices architecture, the service mesh pattern can be helpful in this regard.

While a service mesh doesn’t eliminate the need for modifications to the individual services, the amount of work required is substantially reduced. Lyft famously got distributed tracing support for all of its services by adopting the service mesh pattern with Envoy, and the only change required at the client layer was to forward certain headers. Lyft also gained consistent logging and consistent statistics for every hop.

Distributed tracing is also a major component of the widely supported Open Telemetry initiative, currently a Sandbox project of the Cloud Native Computing Foundation (CNCF). The ultimate aim of Open Telemetry is to ensure that support for distributed tracing and other observability-supporting telemetry is a built-in feature of cloud-native software. 

Observability vs. monitoring

It is a mistake to think that the two approaches of observability and monitoring are mutually exclusive, as their goals are different. In addition, while the use of the term observability is comparatively new in software, the concepts behind it are not, as Cindy Sridharan has noted

  • Observability isn’t a substitute for monitoring nor does it obviate the need for monitoring; the two are complementary. Observability might be a fancy new term on the horizon, but it isn’t a novel idea. Events, tracing, and exception tracking are all derivative of logs, and if one has been using any of these tools, one already has some form of observability. True, new tools and new vendors will have their own definition and understanding of the term, but in essence observability captures what monitoring doesn’t.
  • Monitoring is best suited to report the overall health of systems. Aiming to “monitor everything” can prove to be an anti-pattern. Monitoring, as such, is best limited to key business and systems metrics derived from time series based instrumentation, known failure modes, and black box tests. Observability, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. Because it’s not possible to predict every single failure mode a system could potentially run into, or to predict every possible way in which a system could misbehave, we should build systems that can be debugged armed with evidence and not conjecture.

Despite requiring teams to adopt more sophisticated approaches to overseeing their applications, observability brings improvements in visibility and issue resolution that are extremely valuable. It is a fundamentally better approach than monitoring metrics in a “Big Wall of Data.” Observability techniques become even more effective when we design new systems from the ground up to support them. In order for teams to be successful, we believe they need to be united by a single platform that allows everyone to see all telemetry data in one place. This enables software development teams to quickly get the context needed to derive meaning and take the right action.

Observability is simply a requirement for serious cloud-native businesses, which tend to use microservice architectures and have both higher scale and greater complexity as a result. However, the benefits of observability are also a huge boon for the entire industry, regardless of the level of sophistication or maturity of cloud transition.

Ben Evans is principal engineer and JVM technologies architect at New Relic. Charles Humble is a remote engineering team leader at New Relic.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.