Monitoring in the time of Cloud Native

Decision Making in the Time of Cloud Native

A plethora of tools at our disposal to adopt or buy, however, presents an entirely different problem — one of decision making.

Outline

— What even is observability and how is it different from Monitoring?
— An overview of the “three pillars of modern observability”: logging, metrics collection, and request tracing
— The pros and cons of each in terms of resource utilization, ease of use, ease of operation, and cost effectiveness
— An honest look at the challenges involved in scaling all the three when used in conjunction
— What to monitor and how in a modern cloud native environment; what is better-suited to be aggregated as metrics versus being logged; how and when to use the data from all the three sources to derive actionable alerts and insightful analysis
— When it makes sense to augment the three aforementioned tools with additional tools

What to “monitor” and how in a modern cloud native environment?

This post is titled Monitoring in the time of Cloud Native. I’ve been asked why I chose to call it monitoring and not observability. I was expecting more snark about the buzzword that’s actually in the title — Cloud Native — than the one conspicuous by its absence. I chose not to call it observability for this very same reason — two buzzwords was one too many for my liking.

Virtuous cycle of better observability

Monitoring

When I type “monitoring” into a search engine, the first two results that come up are the following:

Observability

Observability, in my opinion, is really about being able to understand how a system is behaving in production. If “monitoring” is best suited to report the overall health of systems, “observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for providing visibility into implicit failure modes and on the fly generation of information required for debugging. Monitoring is being on the lookout for failures, which in turn requires us to be able to predict these failures proactively. An observable system is one that exposes enough data about itself so that generating information (finding answers to questions yet to be formulated) and easily accessing this information becomes simple.

An interlude — Blackbox Monitoring

For the uninitiated, blackbox monitoring refers to the category of monitoring derived by treating the system as a blackbox and examining it from the outside. While some believe that with more sophisticated tooling at our disposal blackbox monitoring is a thing of the past, I’d argue that blackbox monitoring still has its place, what with large parts of core business and infrastructural components being outsourced to third-party vendors. While the amount of control we might have over the performance of the vendor might be limited, having visibility into how services we own are impacted by the vagaries of outsourced components becomes exceedingly crucial insofar as it affects our system’s performance as a whole.

Blackbox monitoring
Whitebox vs Blackbox monitoring

Whitebox Monitoring versus Observability

Whitebox monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. Whitebox monitoring isn’t really a revolutionary idea anymore. Time series, logs and traces are all more in vogue than ever these days and have been for a few years.

Data and Information

The difference between whitebox monitoring and observability really is the difference between data and information. The formal definition of information is:

Monitoring vs Observability

Observability isn’t just about data collection

While having access to data becomes a requirement if we wish to derive information from it, observability isn’t just about data collection alone. Once we have the data, it becomes important to be able to get answers/information from this data easily.

The Three Pillars of Observability

A more concrete example would help us understand logs, metrics and traces better. Let us assume the architecture of our system or sub-system looks like the following:

Logs

A log is an immutable record of discrete events that happened over time. Some people take the view that events are distinct compared to logs, but I’d argue that for all intents and purposes they can be used interchangeably.

Traces

A trace is a representation of a series of causally-related distributed events that encode the end-to-end request flow through a distributed system. A single trace can provide visibility into both the path traversed by a request as well as the structure of a request. The path of a request allows us to understand the services involved in the servicing of a request, and the structure of a trace helps one understand the junctures and effects of asynchrony in the execution of a request.

Metrics

The official definition of metrics is:

The anatomy of a Prometheus metric

The pros and cons of each in terms of resource utilization, ease of use, ease of operation, and cost effectiveness

Let’s evaluate each of the three in terms of three criteria before we see how we can leverage the strengths of each to craft a great observability experience:

Logs

Logs are, by far, the easiest to generate since there is no initial processing involved. The fact that it is just a string or a blob of JSON makes it incredibly easy to represent any data we want to emit in the form of a log line. Most languages, application frameworks and libraries come with in built support for logging. Logs are also easy to instrument since adding a log line is quite as trivial as adding a print statement. Logs also perform really well in terms of surfacing highly granular information pregnant with rich local context that can be great for drill down analysis, so long as our search space is localized to a single service.

Metrics

By and large, the biggest advantage of metrics based monitoring over logs is the fact that unlike log generation and storage, metrics transfer and storage has a constant overhead. Unlike logs, the cost of metrics doesn’t increase in lockstep with user traffic or any other system activity that could result in a sharp uptick in data.

Metrics as blackbox frontends

Best practices

Given the aforementioned characteristics of logs, any talk about best practices for logging inherently embodies a tradeoff. There are a couple of approaches that I think can help alleviate the problem on log generation, processing, storage and analysis.

Quotas

We either log everything that might be of interest and pay a processing and storage penalty, or we log selectively, knowing that we are sacrificing fidelity but making it possible to still have access to important data. Most talk around logging revolves around log levels, but rarely have I seen quotas imposed on the amount of log data a service can generate. While Logstash and friends do have plugins for throttling log ingestion, most of these filters are based on keys and certain thresholds, with throttling happening after the event has been generated.

Dynamic Sampling

With or without quotas, it becomes important to be able to dynamically sample logs, so that the rate of log generation can be adjusted on the fly to ease the burden on the log forwarding, processing and storage systems. In the words of the aforementioned acquaintance who saw a 50% boost by turning off logging on EC2:

Logging is a Stream Processing Problem

Data isn’t only ever used for application performance and debugging use cases. It also forms the source of all analytics data as well. This data is often of tremendous utility from a business intelligence perspective, and usually businesses are willing to pay for both the technology and the personnel required to make sense of this data in order to make better product decisions.

A new hope for the future

The fact that logging still remains an unsolved problem makes me wish for an OpenLogging spec, in the vein of OpenTracing which serves as a shining example and a testament to the power of community driven development. A spec designed ground up for the cloud-native era that introduces a universal exposition as well as a propagation format. A spec that enshrines that logs must be structured events and codifies rules around dynamic sampling for high volume, low fidelity events. A spec that can be implemented as libraries in all major languages and supported by all major application frameworks and middleware. A spec that allows us to make the most of advances in stream processing. A spec that becomes the lingua franca logging format of all CNCF projects, especially Kubernetes.

Metrics

Traces

While historically tracing has been difficult to implement, the rise of service meshes make integrating tracing functionality almost effortless. Lyft famously got tracing support for all of their applications without changing a single line of code by adopting the service mesh pattern. Service meshes help with the DRYing of observability by implementing tracing and stats collections at the mesh level, which allows one to treat individual services as blackboxes but still get incredible observability onto the mesh as a whole. Even with the caveat that the applications forming the mesh need to be able to forward headers to the next hop in the mesh, this pattern is incredibly useful for retrofitting tracing into existing infrastructures with the least amount of code change.

When it makes sense to augment the three aforementioned tools with additional tools

Exception trackers (I think of these as logs++) have come a long way in the last few years and provide a far superior UI than a plaintext file or blobs of JSON to inspect exceptions. Exception trackers also provide full tracebacks, local variables, inputs at every subroutine or method invocation call, frequency of occurrence of the error/exception and other metadata invaluable for debugging. Exception trackers aim to do one thing — track exceptions and application crashes — and they tend to do this really well. While they don’t eliminate the need for logs, exception trackers augment logs — if you’ll pardon the pun — exceptionally well.

Conclusion

Observability isn’t quite the same as monitoring. Observability connotes something more holistic and encompasses “monitoring”, application code instrumentation, proactive instrumentation for just-in-time debugging and a culture of more thorough understanding of various components of the system.

Pre-Production Testing

Testing in Production

Monitoring

Monitoring isn’t dead. Monitoring, in fact, is so important that I’d argue it occupies the pride of place in your observability spectrum.

Exploration

Dynamic Exploration

Unknowables

Choose your own Observability Adventure

Observability — in and of itself, and like most other things — isn’t particularly useful. The value derived from the observability of a system directly stems from the business value derived from that system.

--

--

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.