Monitoring in the time of Cloud Native

Decision Making in the Time of Cloud Native

Outline

What to “monitor” and how in a modern cloud native environment?

Virtuous cycle of better observability

Monitoring

Observability

An interlude — Blackbox Monitoring

Blackbox monitoring
Whitebox vs Blackbox monitoring

Whitebox Monitoring versus Observability

Data and Information

  • We could use the data we’re gathering to be on the lookout for explicit failure modes that have high severity consequences — an imminent outage, in other words — that we’re trying to stave off or firefight, in which case we’re using the data to alert based on symptoms.
  • We could use this data to know the overall health of a service, in which case we’re thinking in terms of overviews.
  • We could use this data to debug rare and/or implicit failure modes that we couldn’t have predicted beforehand. In which case, we’re using the data to debug our systems.
  • We could also use the data for purposes like profiling to derive better understanding about the behavior of our system in production even during the normal, steady state, in which case, we’re using the data to understand our system as it exists today.
  • We might also want to understand how our service depends on other services currently, so as to enable us to understand if our service is being impacted by another service, or worse, if we are contributing to the poor performance of another service, in which case we’re using this data for dependency analysis.
  • We could also aim to be more ambitious and make sure our system is functional not just right now but also ensure we have the data to understand the behavior of our system so that we can work on evolving it and maintaining it, not just tomorrow but for the duration of its entire lifecycle. While it’s true that solving tomorrow’s problems should not be our goal for today, it’s still important to be cognizant of them. There’s nothing worse than being blindsided by a problem only to realize we could’ve done better had we had better visibility into it sooner. We can anticipate the known, hard failure modes of today and “monitor” for them, but the known, hard failure modes of tomorrow most often don’t exhibit themselves in a very explicit manner today. They need to be teased from subtle behaviors only exhibited by our system during certain anomalous situations or under certain traffic patterns that might be rare or not a cause of concern or immediately actionable today.
Monitoring vs Observability

Observability isn’t just about data collection

The Three Pillars of Observability

Logs

Traces

Metrics

The anatomy of a Prometheus metric

The pros and cons of each in terms of resource utilization, ease of use, ease of operation, and cost effectiveness

Logs

Metrics

Metrics as blackbox frontends

Best practices

Quotas

Dynamic Sampling

Logging is a Stream Processing Problem

A new hope for the future

Metrics

Traces

When it makes sense to augment the three aforementioned tools with additional tools

Conclusion

Pre-Production Testing

Testing in Production

Monitoring

Exploration

Dynamic Exploration

Unknowables

Choose your own Observability Adventure

--

--

--

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.

More from Medium

Crossplane Implementation & Integration with AWS

Person repairing a desktop computer

Building public cloud Kubernetes at scale.

Choose Right Infra platform for startup- part1

Unlocking secure multi-architecture application development to remediate Log4j-type issues