Distributed Tracing — we’ve been doing it wrong
Many thanks to Ramon Nogueira and Sargun Dhillon for reading a draft of this post.
Distributed Tracing is often considered hard to deploy and its value proposition questionable at best. A variety of reasons are attributed to why tracing is “difficult”, an apocryphal concern being the difficulty in instrumenting every component of a distributed system to forward the appropriate headers along with every request. While this might be a valid concern, ultimately it’s not an insurmountable problem. This, incidentally, doesn’t explain why tracing isn’t typically used much by developers, even once it is deployed.
The hard part about distributed tracing isn’t collecting the traces, standardizing on trace propagation and exposition formats, or deciding when, where and how to sample. I’m by no means trivializing these “ingest problems” — indeed, there exist significant technical and (if we’re looking at truly open source standards and protocols) political challenges to overcome to get to the ideal when these can be considered “solved problems”.
However, even if hypothetically all of these problems were to be solved, there’s a good likelihood that nothing much has significantly changed as far as the end user experience is concerned. Tracing might still remain something that, once deployed, doesn’t unlock enough value to be of any practical use in the most common debugging scenarios.
The Many Faces of Tracing
Distributed tracing comprises of several disparate components:
— instrumentation of application and middleware
— distributed context propagation
— trace ingest
— trace storage
— trace retrieval and visualization
An awful lot of conversations about distributed tracing pivot around treating it as some sort of unary undertaking whose sole purpose is to aid in full system diagnostics, in no small part owing to how distributed tracing was historically marketed. The blog post that was written when Zipkin was open-sourced mentioned that it [Zipkin] makes Twitter faster. Initial commercial tracing offerings, likewise, were marketed as APM tools.
Traces contain incredibly valuable data with the potential to aid in efforts such as: testing in production, running disaster recovery tests, enabling fault injection testing and so forth. Some companies, in fact, already use trace data for such purposes. For a start, universal context propagation has several other uses than to simply ferry spans into a storage system. Uber is known to use tracing data to demarcate test traffic from production traffic. Facebook is known to use trace data for critical path analysis and to shift traffic during regular disaster recovery tests. Facebook is also known to use Jupyter notebooks to enable developers to run arbitrary queries over trace data. The LDFI folks are known to use distributed traces for fault injection testing. None of these strictly pertains to the debugging scenario, where an engineer is trying to troubleshoot an issue by looking at a trace.
When it does come to the debugging scenario, the primary interface remains what I call the traceview (though some people also refer to this as a Gantt chart or waterfall diagram). By traceview, I refer to all the spans and attendant metadata that together constitute a trace. Every open source tracing system as well as commercial tracing solution offers a traceview based UI to visualize, query and filter traces.
The problem with every last tracing system I’ve seen so far is that the ultimate visualization (the traceview) closely mirrors the core implementation detail of how a trace is generated. Even when alternative visualizations are offered (heatmap, service topology views, latency histograms) they ultimately lead to a traceview.
I’ve lamented in the past how most of the “innovation” I see happening in the tracing space on the UI/UX front appears to be limited to including additional metadata in a trace, embedding high-cardinality information in traces, or providing the ability to drill down into specific spans or enabling intertrace and intratrace queries, all while cleaving to the traceview as the primary visualization medium. So long as this remains the case, distributed tracing will, at best, come a poor fourth to metrics, logs and stacktraces for debugging purposes, and at worst end up proving to be a time and money sink.
The Problem with the Traceview
A traceview is meant to provide a bird’s eye view of the lifecycle of a single request across every single component of a distributed system which the request traverses through, with some of the more advanced tracing systems offering the ability to drill down into individual spans and look at the breakdown of timing within a single process (when spans are emitted at the function boundaries).
The foundational premise of a microservices architecture is that as the business requirements grow more complex, so will the organization structure. Proponents of microservices argue that decomposing different business functionality into standalone services will enable small, autonomous teams to own the end-to-end lifecycle of such services, unlocking the ability to build, test and deploy these services independently. However, since any such decomposition comes at the cost of a loss of visibility into how each service interacts with the others, distributed tracing is purported to be an indispensable tool for debugging the complex interactions between the services in unison.
If you truly have a mind-bogglingly complex distributed system, then no one single person can have a complete understanding of the system in their head at any given time. In fact, building tooling under the assumption that this is even possible or desirable seems a bit of an anti-pattern. What’s ideally required at the time of debugging is a tool that’ll help reduce the search space, so that engineers can zero in on a subset of dimensions (services/users/hosts etc.) that are of interest to the debugging scenario at hand. Requiring engineers to understand what happened across the entire service graph at the time of debugging an incident seems counter to the ethos of microservices architectures in the first place.
And yet, a traceview is precisely that. Admittedly, some tracing systems provide condensed traceviews when the number of spans in a trace are so exceedingly large that they cannot be displayed in a single visualization. Yet, the amount of information being encapsulated even in such pared down views still squarely puts the onus on the engineers to sift through all the data the traceview exposes and narrow down the set of culprit services. This is an endeavor machines are truly faster, more repeatable and less error-prone than humans at accomplishing.
Another reason why I’ve grown to believe that the traceview is the wrong abstraction is because it lends itself poorly to a hypothesis-driven debugging approach. Debugging is fundamentally an iterative process which involves starting with a hypothesis followed by the inspection of various observations and facts reported by the system along different axes, making deductions and testing whether the hypothesis holds water.
Being able to quickly and cheaply test hypotheses and refine one’s mental model accordingly is the cornerstone of debugging. Any tool that aims to assist in the process of debugging needs to be an interactive tool that helps to either whittle down the search space or in the case of a red herring, help the user backtrack and refocus on a different area of the system. And the ideal tool would do this proactively, drawing the user’s attention to potentially problematic areas.
A traceview is fundamentally anything but an interactive interface. The best I can hope to do with is a traceview is be able to see a source of increased latency and view any possible tags and logs associated with it. This doesn’t help one spot patterns in traffic such as modes of latency distribution or provide the ability to correlate across different dimensions. Aggregate analysis of traces can help get around some of these problems. Indeed, there have been reports of successful trace analysis using machine learning to spot anomalous spans and identify the subset of tags that might be contributing to the anomalous behavior. However, I’m yet to see a convincing visualization of any such insights unearthed by applying machine learning or data mining on spans that’s significantly different from a traceview or a DAG.
Spans are too low level
The fundamental problem with the traceview is that a span is too low-level a primitive for both latency and “root cause” analysis. It’s akin to looking at individual CPU instructions to debug an exception when a much higher level entity like a backtrace would benefit day-to-day engineers the most.
Furthermore, I’d argue that what is ideally required isn’t the entire picture of what happened during the lifecycle of a request that modern day traces depict. What is instead required is some form of higher level abstraction of what went wrong (analogous to the backtrace) along with some context. Instead of seeing an entire trace, what I really want to be seeing is a portion of the trace where something interesting or unusual is happening. Currently, this process is entirely manual: given a trace, an engineer is required to find relevant spans to spot anything interesting. Humans eyeballing spans in individual traces in the hopes of finding suspicious behavior simply isn’t scalable, especially when they have to deal with the cognitive overhead of making sense of all the metadata encoded in all the various spans like the span ID, the RPC method name, the duration of the span, logs, tags and so forth.
Alternatives to the traceview
Trace data is most useful when there exist incisive visualizations to surface vital insights about what’s happening in interconnected parts of a system. Until this is the case, the process of debugging remains very much reactive, hinging on a user’s ability to make the right correlations and inspect the right parts of the system or slice and dice across the right dimensions, as opposed to the tool guiding the user into formulating these hypotheses.
With the caveat that I’m not a visual designer or a UX researcher, in the following section, I want to bandy about a couple of ideas of what such visualizations might look like.
With the industry consolidating around the ideas of SLOs and SLIs, it seems reasonable that individual teams must be primarily responsible to ensure their services meet these goals. It then follows that for such teams, the best suited visualization is a service-centric view.
Traces, especially when unsampled, have a treasure trove of information about every single component of a distributed system. This data can be mined by a sophisticated trace processing engine to tease out service-centric insights to users. Examples of service-centric insights that can be surfaced upfront without requiring a user to look at traces are:
- latency distribution graphs of only the outlier requests
- latency distribution graphs when the service’s SLOs aren’t met
- most “common”, “interesting” or “weird” tags in requests that are being most frequently retried
- latency breakdowns when the service’s dependencies aren’t meeting their SLO
- latency breakdown by different downstream services
Some of these are questions pre-aggregated metrics don’t have a prayer of answering and requiring users to pore over spans to deduce these answers results in an awfully hostile user experience.
Which then begs the question — what about complex interactions between various services owned by different teams, the sort that a traceview is considered best-suited to shine a light on?
Mobile developers, stateless service owners, owners of managed stateful services like databases and platform owners might be interested in a different view of the distributed system; a traceview is a one-size-fits-all solution to these disparate needs. Even in a very complex microservices architecture, service owners don’t need very deep knowledge of more than two or three upstream and downstream services. As a matter of fact, in most scenarios, it should be sufficient for users to answer questions pertaining to a limited number of services.
This is not dissimilar to placing a magnifying glass on a small subgraph of services to examine them with a fine-tooth comb. This will enable the user to ask more pressing questions regarding the complex interaction between these services and their immediate dependencies. This is analogous to the backtrace in the services world, where one knows what is wrong as well as some context into the surrounding services’ to help deduce the why.
The approach I’m championing here is the antithesis of a top-down traceview based approach, where one starts with an entire trace and then progressively drills down into individual spans. On the contrary, something of a bottom-up approach involves starting an investigation with one’s attention laser-focused on a small search space closer to a potential cause of an incident and then expanding the search space if needed (which might then involve enlisting the help of other teams to analyze a larger set of services). The latter approach lends itself better to the quick and dirty validation of initial hypotheses, until one has the smoking gun one needs to embark on a more focused and detailed investigation.
Service Topology Graphs
Service-centric views can be incredibly helpful once the user knows which service or group of services is contributing to increased latency or errors. However, in a complex system, identifying the offending service can be a non-trivial problem at the time of an incident, especially if the individual services haven’t triggered alerts.
Service topology views can be very helpful in pinpointing which service is showing a spike in error rate or increased latency leading to a user visible degradation in service quality. By a service topology view, I’m not referring to a service map, which depicts every single service in the system, famously used to depict “death-star” architectures. That’s no better than a DAG based traceview. Instead, what I’d love to see upfront are dynamically generated service topology views based on specific attributes like error rate or response time or even any user defined attribute, which will help the user to frame questions around specific suspect services.
To provide a more concrete example, the following is a service graph of a hypothetical newspaper webpage as implemented by a number of commercial tracing products. A frontpage service talks to Redis, a recommendation service, an ads service and a video service. The video service fetches videos from S3 and metadata from DynamoDB. The recommendations service fetches metadata from DynamoDB, data from Redis and MySQL, and writes events to Kafka. The Ads service fetches data from MySQL and writes events to Kafka.
The service graph described below depicts this topology. This can be useful if one is trying to understand service dependencies. However, in the debugging scenario where a specific service (say the video service) is experiencing increased response times, such a topology view isn’t particularly useful.
An improvement is the diagram depicted below, which places the problem service (the video service) right at the front and center. This draws a user’s attention to this service off the bat. From this visualization, it becomes obvious that the video service is acting anomalously due to a slowdown in S3 response times, which is impacting the page load time of a portion of the frontpage.
Dynamically generated service topology graphs can be more powerful than static service maps, especially in elastic, auto-scaled infrastructures. Being able to compare and contrast such service topologies immediately paves the way for the user to start asking more germane questions. Asking better questions of the system increases the odds of the user formulating a better mental model of how the system is behaving.
Another useful visualization would be a comparison view. Traces currently lend themselves none too well to being juxaposed on top of each other, since then, for all intents and purposes, one is essentially comparing spans. And the very mainspring of this article is to hammer home the point that spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.
Comparing two traces doesn’t require groundbreaking new visualizations. As a matter of fact, something like a stacked bar chart showing the same data that a traceview depicts can be surprisingly more insightful than having to look at two individual traces separately. Even more powerful would be the ability to visualize the comparison of traces in aggregate. It’d be enormously useful to see how a newly deployed GC configuration change of a database impacts the response time of a downstream service over the course of a few hours. If what I’m describing here sounds like A/B testing the impact of infrastructural changes across multiple services with the help of trace data, then that wouldn’t too far off the mark.
I’m not questioning the utility of tracing data itself. I truly believe there’s no other observability signal that is pregnant with data more rich, causal and contextual than that carried by a trace. However, I also believe that this data is woefully underutilized by every last tracing solution out there. As long as tracing products cleave to the traceview in any way, shape or form, they will be limited in their ability to truly leverage the invaluabe information that can be extracted from trace data. Moreover, this runs the risk of propping up a very user-unfriendly and unintuitive visual interface, effectively stymieing the user’s ability to debug more effectively.
Debugging complex systems, even with the state-of-the-art tooling, is incredibly hard. Tools need to strive to assist a developer in the process of forming and validating a hypothesis by proactively surfacing relevant information, outliers and latency distribution characteristics. For tracing to become the tool developers and operators reach for immediately when debugging a production incident or customer issue that spans multiple services, what is required is novel user interfaces and visualizations that correspond more closely to the mental model of developers building and operating these services.
It will take an enormous amount of careful thought to design a system that will layer different signals available in trace data in a manner optimized for easy exploration and deduction. There needs to be thought put into how the system’s topology at the time of debugging can be abstracted in a way that helps a user in getting over their blind spots without the user ever having to look at an individual trace or a span.
What we need is good abstractions and layering, especially in the UI, the sort that can lend itself well toward hypothesis driven debugging, where one can iteratively ask questions and validate conjectures. While this isn’t going to automatically solve every last observability problem, it can greatly help users hone their sense of intuition and enable them to ask better questions. What I’m calling for is more thought and innovation to happen on the visualization front. There’s a real opportunity to push the envelope.