Health Checks and Graceful Degradation in Distributed Systems

The Two Types of Health Checks

Health checks, even in many modern systems, tend to typically fall under two categories — host level health-checks and service-level health checks.

Health is a Spectrum, not a Binary Taxonomy

The Need for Feedback Loops when applying Backpressure

Matt Ranney has a phenonemal blog post about unbounded concurrency and the need for backpressure in Node.js. The entire post is well worth a read, but the biggest takeaway (at least for me) was the need for feedback loops between a process and its downstream (usually a load balancer, but sometimes this could also be another service).

Image from my presentation on the Prometheus monitoring system at Google NYC in November 2016
Alert taken from my presentation on the Prometheus monitoring system at OSCON in May 2017
Myriad forms of rate limiting and load shedding techniques
  1. Queueing Theory in Practice: Performance Modeling for the Working Engineer, Eben Freeman from LISA 2017
  2. Stop Rate Limiting — Capacity Planning Done Right, Jon Moore from Strangeloop 2017
  3. Predictive Load Balancing: Unfair but Faster and More Robust, Steve Gury from Strangeloop 2017
  4. The chapters on Handling Overload and Addressing Cascading Failures from the SRE Book


Control loops and backpressure are already a solved problem in protocols like TCP/IP (where congestion control algorithms depend on load inference), IP ECN (which is an explicit mechanism to determine load, or near load), and Ethernet, with the effects of things like PAUSE frames.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan


@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.