Testing in Production: the hard parts

Blast Radius

In early July 2019, Cloudflare had a 30 minute global outage that was caused due to a deploy of code that was meant to be “dark launched”.

Prevention and Mitigation of Test Mishaps

AWS famously has a “Correction of Errors” template in their postmortem document, where engineers involved in the incident are required to answer the question “how could you cut the blast radius for a similar event in half?” In my experience, complex systems can and often do fail in unanticipated ways. Improving the resilience of such a service becomes an undertaking with ever-shifting goalposts.

Safe and Staged Deploys

Easily one of the most impactful areas of investment would be divorcing deploys from releases. These two posts [1] [2] explain the difference between a deploy and a release and why it becomes so important to delineate the two. In a nutshell:

Quick Service Restoration

When practicing staged rollouts, it’s imperative to be able to mitigate the impact should something go awry before it can cause further failures upstream or trigger a chain of events that leads to a cascading failure. Service unavailability or degraded performance due to a test run in production does contribute to the “error budget” of a service (or whatever other heuristic is used to track SLOs over time). Too many mishaps when testing in production can burn through the error budget, allowing little leeway for scheduled maintenance and other operational contingencies that might require taking a service offline.

To Crash or Not To Crash

Often serving degraded responses or being able to operate in a degraded mode is a core design feature of distributed systems. Examples of such degraded mode of operations include: falling back to serving canned or cached responses when a database is unavailable, being able to operate when components used for performance optimization (such as caches) are unavailable, servicing reads but not writes (a newspaper website temporarily disabling comments but continuing to serve static content), and so forth. A twenty year old paper that proposes the notions of harvest and yield as a way of thinking about system availability (well, strictly speaking, the paper proposes the harvest and yield model as an alternative to the CAP principle; I personally see it not so much as an alternative than a corollary) provides a blueprint for thinking about tradeoffs.

Change One Thing At A Time

This is generally good advice that has universal applicability. Testing in production introduces a change to the production environment; the people operating the service don’t have much of an inkling as to whether the test would succeed or fail. It becomes important not to couple this change with another change, be it change in some configuration option or changing the traffic patterns to the service or run some other test simultaneously.

Multi-tiered Isolation

An important action item of the GCP outage from 2018 linked above was the following:

  • the tooling that is used to operate the service (orchestrators that schedule or deschedule a service, agents that monitor a service, tools used for debugging a binary in production)
  • any ancillary work a system needs to perform (such as submitting stats or logs to an agent, which strictly isn’t in the path of the request but is a salient part of operating and understanding a service)
  • last but not the least, human operators

Divorce the Control Plane from the Data Plane

This post by Marc Brooker offers a good explanation on how to reason about the design of control and data planes. The control plane and the data plane typically have very different request patterns, operational characteristics and availability requirements. Generally speaking, data planes need to be highly available, even in the face of control plane unavailability whereas control planes need to favor correctness and security above all. Not having a clean separation between the data plane and the control plane often leads to fragile systems that are painful to debug, hard to scale and perhaps most importantly, hard to test.

Control planes all the way down.

Eschew Global Synchronized State

Shared mutable state in code is often considered to be the root of all evil. The same can be said when it comes to systems — global state mutations, especially of control plane data, is something to be actively eschewed for the simple reason that the blast radius of such a system encompasses all services that depend on this state. To cite this excellent Netflix blogpost about why distributed mutable state is a last resort in the context of load balancers:

  • ideally entirely stateless, with the minimal number of dependencies
  • if stateful, doesn’t require strong consistency guarantees or write quorums
  • persists snapshots of a known good state, so if it’s required to rebuild its state, it can synchronize just the delta from the source of truth.
  • ideally self-healing
  • well monitored
  • must crash “safely”

Other Considerations

There are certain additional considerations that don’t strictly impact blast radius but are salient that are of salience when it comes to testing in production.

Client Resilience and Client-Side Metrics

Testing server side software in production requires some degree of cooperation from the client, where a “client” might be the browser, a mobile app, the “edge” (CDNs) or a downstream server. If the test in question is verifying latency of the test request is within a certain threshold, measuring this client-side makes most sense. If the test is verifying that utilization or throughput is within acceptable bounds, then measuring this server-side makes most sense.

Invest in Observability

It’s impossible to be able to effectively test in production without a rock-solid observability pipeline. Observability signals help surface outliers, visualize the impact and scope of the change and help in alerting when a test run in production produces an unexpected output.

Conclusion

I’ve grown to believe that “testing in production” is a bit of a misnomer. “Testing production” or “verifying production” seems more apt a description. Users are interacting with the production environment at all times, so in a manner of speaking, production is already constantly being exercised.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan

10.9K Followers

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.