Observability: Traceability for Distributed Systems

Have you ever waited for that one expensive parcel that shows “shipped,” but you have no clue where it is? The tracking history stopped updating five days ago, and you have almost lost hope. But wait, 11 days later, you have it at your doorstep. You wished the traceability could have been better to relieve you from all the anxious waiting. This is where “observability” comes into play.

In a technical landscape, you would want to avoid this from happening to your software or data systems. And thereby, you adopt monitoring tools, which collect the logs and metrics of your systems and inform you of their internal state. Monitoring works best when you want your systems to inform you of what the error is, where and when it happened, but it doesn’t tell you how to solve the error.

More than a decade ago, monitoring tools lacked the context and foresight of underlying system issues and teams would be restricted to debugging day-to-day operational errors. Today, we work and live in a distributed world of microservices and data pipelines; even employing multiple monitoring tools won’t help you answer your business questions like “Why is my application always slow?” or “At what stage did the issue occur, and how deep is it in the stack?” or “How can I improve the overall performance of the environment?” It becomes necessary to be proactive in making these decisions and have an overall visibility of your systems, applications, and data.

This blog post by Etsy was published a decade ago, and it states the very fact in the second paragraph:

“Application metrics are usually the hardest, yet most important, of the three. They’re very specific to your business, and they change as your applications change (and Etsy changes a lot).”

So, how do we measure everything and anything? We start with observability.

What Is Observability?

The term “observability” was coined by Rudolf Emil Kálmán in 1960 in his engineering paper to describe mathematical control systems. He defined it as a measure of how well internal states of a system can be inferred from knowledge of its external outputs. But doesn’t it sound like monitoring? Basically, yes, it is monitoring.

These days, observability has become quite a hot topic. According to several market surveys, it is a billion-dollar platform. Many organizations have adopted the concept and employed it as a framework for end-to-end visibility of their distributed systems and pipelines. However, observability is confused with monitoring. For now, I can say that monitoring is a subset of observability, where observability is one big umbrella term.

Observability allows for distributed tracing through collecting and aggregating traces, logs, and metrics. Let’s see what these infer:

Traces: When a system receives a request, traces tell you how that request flows, throughout its lifecycle, from the source to the destination. Traces are represented by “spans.” A trace is a tree of spans, and a span is a single operation within a trace. They help you locate errors, latency, or bottlenecks in the system.
Logs: These are machine-generated time-stamped events that tell you about the operations or changes that happened in the system. Logs are often used for querying these errors or changes in the system.
Metrics: These provide quantitative insights on CPU, memory, disk usage, and how the system is performing over a time period.

These attributes enhance the monitoring framework with traceability. Traceability provides you with the lenses to trace a request that makes a call to your system, how long it takes to traverse from one component to another, what other services it invokes, does it throw any error, what logs it produces, what state it is in, when did it start and end, what is the timeline it stayed in your system, etc. When you collect, aggregate, and analyze these traces, you are able to make valuable informed decisions like customer timeline on an e-commerce website, how long it took them to search for a product, how long they viewed the product, did the HTML page load the complete details like images or embedded videos, how long the system took to authenticate and process the payment, etc.

What Do We Achieve with Observability in a Distributed Environment?

The evolution of distributed systems began when organizations started to move away from their centralized monolith architecture to a distributed and decentralized microservice architecture. And this is still a work in progress where many organizations are embracing the microservice nature of systems and applications. And all this can be attributed to big data and scaling. Managing a distributed environment requires continuous learning, additional workforce, changes in frameworks and policies, IT management and so on. It is indeed a big change.

Earlier, in the limited monolithic environment, the hardware, software, data, and databases all lived under one, single roof. With the advent of big data in the 2000s, monitoring and scaling systems started to become a huge concern. Often, organizations employed different monitoring tools to cater to the needs of their various applications. As a result, it soon became an operational overhead with poor resilience, visibility, and reliability.

All these issues gave rise to the adoption of observability. Today, multiple observability tools exist for security, network, application, and data pipelines for distributed tracing in a complex environment. They co-exist with their cousin, the monitoring tools, and take the leverage of collecting the information from their cousin and aggregate with additional information from its own trace data.

There are a lot of moving components in all these systems, whose traces when captured, can illustrate the story of the 5 Ws: when, where, why, what, and how. For example, you go to DATAVERSITY’s website at 1:43 p.m. to read some blog posts. When you hit dataversity.net, the HTTP request gets logged into the system. You start searching for a blog post and go to a Data Governance post, where you spend 17 minutes reading that post and then you close your tab at 2:00 p.m.

There will also be other calls made to the network system for network packet capture as well. Observability tools collect all the spans and unify them in a trace or traces, enabling you to see the path it formed during its lifecycle. If you have a problem like network latency or a system defect, it is now easier to dissect (peel the onion) and debug the problem (error in which layer).

Now in a large distributed environment, when your applications receive millions of requests, the trace data grows in huge volume. Collecting and analyzing these traces is expensive for storage consumption and data transfer. So, to save costs, the trace data is sampled, because in most cases, engineering teams only need some of the pieces to investigate what went wrong or what is the error pattern.

With that small example, we understand that we get much deeper insights into our systems. So, considering a larger scale of systems, engineering teams can capture and work on the sampled data to improve the current structure of the system, apply or retire new components, add another security layer, remove bottlenecks, and so on.

Should Organizations Choose Observability?

We all should understand that the end goals are better user experience and greater user satisfaction. And the path to achieving these goals can be made easier with an automated and proactive observability framework. Establishing a culture of continuous improvement and optimization is considered the optimal business and leadership approach.

In this age of digital transformation, observability has become a must-have for a business to be successful in its digital journey. Providing you with insightful traces, observability also maneuvers you to be data-informed rather than just data-driven.

Conclusion

Although we have used the terms monitoring and observability interchangeably, we have seen that while monitoring helps you with information on the health of the system and events happening on it, observability facilitates you to make inferences based on evidence gathered from deeper layers of an end-to-end environment.

Observability is and can also be perceived as a component of the Data Governance framework. In this generation, where the ever-increasing data volume resides on a network of commodity hardware, it is vital to keep the architectures as simple as possible. And evidently, it becomes an impossible task to manage the environment down the line. Thereby, implementing appropriate and automated governance policies and rules to keep your large mesh of systems, pipelines, and data decluttered calls for action sooner than later.

Observability: Traceability for Distributed Systems

What Is Observability?

What Do We Achieve with Observability in a Distributed Environment?

Should Organizations Choose Observability?

Conclusion

Doyita Mitra

Book of the Month: The Deployed Data Scientist

Governance Is Asset Management

Mind the Gap: Data Rabbits

Thanks!

Observability: Traceability for Distributed Systems

What Is Observability?

What Do We Achieve with Observability in a Distributed Environment?

Should Organizations Choose Observability?

Conclusion

Doyita Mitra

Related Articles

Book of the Month: The Deployed Data Scientist

Governance Is Asset Management

Mind the Gap: Data Rabbits

Lead the Data Revolution from Your Inbox.

Thanks!