Distributed Tracing — we’ve been doing it wrong

The Many Faces of Tracing

Distributed tracing comprises of several disparate components:

The Problem with the Traceview

A traceview is meant to provide a bird’s eye view of the lifecycle of a single request across every single component of a distributed system which the request traverses through, with some of the more advanced tracing systems offering the ability to drill down into individual spans and look at the breakdown of timing within a single process (when spans are emitted at the function boundaries).

Spans are too low level

The fundamental problem with the traceview is that a span is too low-level a primitive for both latency and “root cause” analysis. It’s akin to looking at individual CPU instructions to debug an exception when a much higher level entity like a backtrace would benefit day-to-day engineers the most.

Alternatives to the traceview

Trace data is most useful when there exist incisive visualizations to surface vital insights about what’s happening in interconnected parts of a system. Until this is the case, the process of debugging remains very much reactive, hinging on a user’s ability to make the right correlations and inspect the right parts of the system or slice and dice across the right dimensions, as opposed to the tool guiding the user into formulating these hypotheses.

Service-Centric Views

With the industry consolidating around the ideas of SLOs and SLIs, it seems reasonable that individual teams must be primarily responsible to ensure their services meet these goals. It then follows that for such teams, the best suited visualization is a service-centric view.

  1. latency distribution graphs when the service’s SLOs aren’t met
  2. most “common”, “interesting” or “weird” tags in requests that are being most frequently retried
  3. latency breakdowns when the service’s dependencies aren’t meeting their SLO
  4. latency breakdown by different downstream services

Service Topology Graphs

Service-centric views can be incredibly helpful once the user knows which service or group of services is contributing to increased latency or errors. However, in a complex system, identifying the offending service can be a non-trivial problem at the time of an incident, especially if the individual services haven’t triggered alerts.

A hypothetical service view graph of a newspaper front page.
A dynamic service topology graph, showing me only the “interesting” services.

Comparison Views

Another useful visualization would be a comparison view. Traces currently lend themselves none too well to being juxaposed on top of each other, since then, for all intents and purposes, one is essentially comparing spans. And the very mainspring of this article is to hammer home the point that spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.

Conclusion

I’m not questioning the utility of tracing data itself. I truly believe there’s no other observability signal that is pregnant with data more rich, causal and contextual than that carried by a trace. However, I also believe that this data is woefully underutilized by every last tracing solution out there. As long as tracing products cleave to the traceview in any way, shape or form, they will be limited in their ability to truly leverage the invaluabe information that can be extracted from trace data. Moreover, this runs the risk of propping up a very user-unfriendly and unintuitive visual interface, effectively stymieing the user’s ability to debug more effectively.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan

10.9K Followers

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.