Testing in Production, the safe way

Why test in production when one can test in staging?

The Art of Testing in Production

Testing in production has historically carried with it a certain stigma and negative connotations linked to cowboy programming, insufficient or absent unit and integration testing, as well as a certain recklessness or lack of care for the end user experience.

From my post “Testing Microservices, the sane way”

The Three Phases of “Production”

For the most part, discussions about “production” are only ever framed in the context of “deploying” code to production or monitoring or firefighting production when something goes awry.

Phase 1 — Deploy

With testing (even in production!) only ever being a best effort verification, the accuracy of testing (or any form of verification for that matter) is contingent on the tests being run in a manner that resembles the way in which the service might be exercised in production in the closest possible way.

Phase 2 — Release

The blog post Deploy != Release goes on to define “release” as:

Phase 3 — Post-Release

If the release of the service went without a hitch and the newly released service is serving production traffic without any immediate issues, we can consider the release to be successful. What follows a successful release is what I call the “post-release” phase.

Testing in Production during the Deploy Phase

Assuming we can divorce the “deploy” phase from the “release” phase as outlined above, let’s explore some of the forms of testing that can be performed once the code is “deployed” to production.

Integration Testing

Traditionally, integration testing is performed by a CI server in an isolated “testing” environment over every git branch. A copy of the entire service topology (including databases, queues, caches, proxies and so forth) is spun up for the test suites of all the services to be run against each other.

Integration test for a deployed version of service B run against a released version of service C where the writes never make it to the database

Shadowing (also known as Dark Traffic Testing or Mirroring)

I find shadowing (also known as a dark launch according to a blog post from Google or in istio parlance called mirroring) to be more beneficial than integration testing in many cases.

Tap Compare

I first heard the word “tap-compare” during a discussion with Matt Knox at the Testing in Production meetup this January. The only reference to the term I’ve been able to find has been in Twitter’s blog post on the topic of productionizing a high SLA service.

Load Testing

For those unfamiliar with what load testing is, this article should be a great starting point. There’s no dearth of open source load testing tools and frameworks, the most popular ones being Apache Bench, Gatling, wrk2, Erlang based Tsung, Siege, Scala based Twitter’s Iago (which scrubs the logs of an HTTP server or a proxy or a network sniffer and replays it against a test instance). I’ve also heard from an acquaintance that mzbench really is the best in class for generating load, and supports a wide variety of protocols such as MySQL, Postgres, Cassandra, MongoDB, TCP etc. Netflix’s NDBench is another open source tool for load testing data stores and comes with support for most of the usual suspects.

From the whitepaper on Kraken

Config Tests

Testing in Production — Release

Once a service has been tested after it’s deployed, it needs to be released.


Canarying refers to a partial release of a service to production. A subset of production now consists of the canaries which are then sent a small percentage of actual production traffic after they pass a basic health check. The canary population is monitored as it serves traffic, the metrics being monitored for are compared with the non-canary metrics (the baseline), and if the canary metrics aren’t within the acceptable threshold, a rollback ensues. While canarying is a practice generally discussed in the context of releasing server-side software, canarying client-side software is increasingly common as well.


Monitoring is a must-have at every phase of the production rollout, but is especially important during the release phase. “Monitoring” is best suited to report the overall health of systems. Aiming to “monitor everything” can prove to be an anti-pattern. For monitoring to be effective, it becomes salient to be able to identify a small set of hard failure modes of a system or a core set of metrics. Examples of such failure modes are:

Exception Tracking

I put exception tracking under the release phase though to be fair, I think it has equal utility during the deploy phase and during the post-release phase as well. Exception trackers often don’t get the same scrutiny or hype as some of the other Observability tooling, but in my experience I’ve found exception trackers enormously useful.

Traffic Shaping

Traffic shaping or traffic shifting really isn’t so much a standalone form of testing than something to assist with canarying and the gradual release of the new code. In essence, traffic shifting is achieved by updating the configuration of the load balancer to gradually route more traffic to the newly released version.

Testing in Production — Post-Release

Post-release testing takes the form of verification that happens once we’ve satisfactorily released code. At this stage, we’re confident that the code is largely correct, was successfully released to production and is handling production traffic as expected. The code deployed is in real use either directly or indirectly, serving real customers or doing some form of work that has some meaningful impact for the business.

Feature Flagging or Dark Launch

The oldest blog post I can find about a company successfully using feature flags dates back to almost a decade ago. There’s a website featureflags.io to serve as a comprehensive guide to all things feature flagging.

A/B Testing

A/B testing often falls under the experimental analysis and isn’t necessarily seen as a form of testing in production. As such it’s not only widely (sometimes controversially) used but also extensively researched and written about (including what makes good metrics for online experiments). What’s possibly less common is A/B testing being used for testing different hardware or VM configurations, but these are often referred to as “tuning” (for example, JVM tuning) as opposed to being seen as an A/B test (even if in principle, tuning can very much be seen as a form of A/B test to be undertaken with the same level of rigor when it comes to measurement).

Logs/Events, Metrics and Tracing

Known as the “three pillars of Observability”, logs, metrics and distributed traces are something I’ve written about extensively in the past.


Profiling applications in production is sometimes required to diagnose performance related problems. Depending on the language and runtime support, profiling can be as easy as adding a single line of code to an application (import _ "net/http/pprof " in the case of Go) or might require significant instrumentation to applications or by means of blackbox inspection of a running process and inspecting the output with a tool like flamegraphs.


A lot of people perceive teeing as similar to shadowing, since in both cases production traffic is replayed against a non-production cluster or process. The difference, the way I see it, comes down to the fact that replaying production traffic for testing purposes is slightly different from replaying traffic for debugging purposes.

Chaos Engineering


The goal of testing in production isn’t to completely eliminate all manner of system failures. To quote John Allspaw:



@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan


@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.