On-call doesn’t have to suck

Cindy Sridharan
12 min readFeb 10, 2018

--

I can’t quite imagine anyone enjoying being on-call. I’ve never particularly enjoyed it myself. I still remember the very first time I went on-call several years ago. I was on-call for an entire week and the overarching feeling throughout the week was one of… foreboding, even when things weren’t on fire.

At the end of the week, I felt pretty drained and had absolutely no intention to protract the process any longer. Thus at the end of my on-call shift, I felt happy to be back on familiar territory of writing code uninterrupted. It’s perhaps worth mentioning here that I probably felt that way in no small part because I’m a developer and have never in my career been an SRE or an Operations Engineer.

I’ve since then invariably shouldered on-call responsibilities at every company I’ve worked at. I don’t consider this to be a badge of honor; I’ve now grown accustomed to see it as something that’s simply part of my job.

I’ve worked at companies in the past where I’ve shared on-call responsibilities with an entire team of people as well as at companies where I was the sole person on-call for multiple services for years on end, without quite having a secondary to escalate pages to (something that was only possible since I got paged very infrequently). None of these different flavors of on-call was fun or enjoyable. So far be it from me to be the person that romanticizes on-call.

That said, I also believe on-call doesn’t have to suck or be something people dread. And that if it does, it’s probably symptomatic of a team’s engineering prowess as well as priorities/culture.

This tweet definitely got enough people riled up, which I found incredibly surprising. To be honest, I didn’t set out with the intention to start this conversation about being on-call. It just so happened that yesterday, I read a blog post about how a startup functioned just fine without having a dedicated Operations team and where the developers were empowered to troubleshoot any and every production issue.

This approach is something I’m seeing gain increased adoption at smaller companies (primarily San Francisco based startups). Having worked at a startup where I’d been on-call for all of the services I authored, it made sense to me that this model was adequately sufficient for a small company or a small (fewer than 5) to mid sized (fewer than 12) team within a larger company.

Particularly, the recent strides in the cloud infrastructure space, the advent of extremely sophisticated tooling as well as broadly easy to use APIs vastly lower the bar for software engineers to shoulder all of the Operational responsibilities for their services.

SWE + “on-call” != SWE

The reason I started speaking about on-call was owing to some of the responses I’d been getting to the previous tweet. One especially stood out to me.

To which I’d responded:

As previously stated, my intention isn’t to romanticize being on-call. It’s not anyone’s idea of funbe it SREs or software engineers(SWE).

To be fair, there are several things that aren’t certain type of engineers’ idea of fun. For some, writing documentation isn’t any fun. For others, refactoring isn’t fun. And to reiterate, being on-call is no one’s idea of fun.

However, being on-call is uniquely encroaching upon an engineer’s personal time in ways writing documentation or refactoring code or improving test coverage are not. While the latter can be done during work hours, on-call is something that is all-consuming, effectively handcuffing one to their laptop and phone for the entire duration of the on-call. While people working at startups or people used to working long hours might be slightly better prepared for embarking on an on-call shift, I can understand why it might seem especially horrendous to people who consider their evenings, nights and weekends off-limits for work related activities.

In other words, it becomes a different category of job, one not traditionally associated with what being a “software engineer” entails. There exist vast swathes of people to whom this, in and of itself, might be reason enough to consider on-call to “suck”.

It simply wasn’t the job they thought they were signing up for.

Being a software engineer is an extremely privileged position to be in and the world (or at least the Bay Area) is one’s oyster. I think it’s eminently reasonable to suggest that people who are so loath to the very idea of being on-call aren’t going to be served best by being on a team or company that has the expectation that developers be on-call for their own services, especially when there exist a large number of extremely interesting fields of software engineering that simply don’t require operating a highly available service and which are more in line with people’s traditional ideas of what a “software development job” looks like.

Folks who’re going to be turned off by the very idea of on-call per-se are going to be turned off by any position involving on-call anyway (no matter how sustainable or humane the on-call culture and practices are), so are those folks really the ones worth expending time on trying to recruit into a services team?

More importantly, it begs the question as to how the presence of these folks might affect the overall dynamics of the team in general?

Having worked at companies where a subset of engineers on a small team simply refused to concern themselves with anything Operations related about the very code they were authoring (including not so much as knowing how it was packaged or deployed or monitored or debugged), I can attest to the fact that this attitude runs the risk of turning into a source of latent resentment among team members only to fester to the surface when the going gets rough.

And that is one thing I can wholeheartedly confirm truly sucks.

On-call and diversity in tech

It’s very much true that diversity in tech leaves a lot to be desired. This is even more acutely felt in certain fields of software engineering — women, for instance, are very poorly represented in SRE/Operations roles and general systems programming roles as opposed to, say data science. A (fairly valid) concern was raised that if being on-call came with the territory of being a “software engineer”, then tech would become even less accessible to people who’re already dismally represented in the industry.

I believe this is a bit of a false dichotomy in addition to (inadvertently) reinforcing gender and other stereotypes (like for instance the idea that women, minorities, folks with kids or folks of a certain age necessarily have more obligations outside of work, which can then in turn implicitly bias recruiting and hiring decisions). If on-call at a company were so bad that women were turned off by roles involving on-call, it’s odds-on that male engineers would be turned off it by it too (and indeed, this appears to be truly the case, if my Twitter mentions from men are anything to go by).

Simply put, building a humane on-call culture and long-term sustainable practices can, in fact, help attract more candidates to the company, as opposed to turning off folks. And the groundwork for this needs to come from companies, instead of renaming Systems Administrator to SRE and creating a silo that then becomes this bastion under represented minorities would be excluded from.

On-call is a direct reflection of engineering skills

Yes, this did prove to be rather unpopular indeed.

And looking back, it’s obvious to me that when I first tweeted this, I hadn’t spent enough time thinking about the people who find the very idea of being on-call abhorrent per se (I’ve covered this above). I’d been thinking about some of the people whom I know personally who’ve become bitter or plainly burnt out owing to terrible on-call scars that were the result of poor (or non-existent) management, culture and engineering “prowess” (a word that, again very surprisingly, elicited some brickbats).

These aren’t people who think on-call “sucks” because the very possibility that they might be very rarely paged at 3:00am in the morning and might be required to follow a runbook, fix the problem and go to bed back in five minutes sounds unpalatable to them. Au contraire, these are the folks who think on-call “sucks” because, on-call for them, well and truly sucked.

On-call can “suck” for a plethora of reasons (and when these repeatedly occur during multiple on-call rotations without being prioritized to be resolved, burn out is inevitable):

— noisy alerts
— alerts that aren’t actionable
— alerts missing crucial bits of information
— outages caused due to bad deploys owing to bad release engineering practices and risk management
— lack of visibility into the behavior of services during the time of an outage or alert that makes debugging difficult
— outages that could’ve been prevented with better monitoring
— high profile outages that are due to architectural flaws or single points of failure
— outdated runbooks or non-existent runbooks
— not having a culture of performing blameless postmortems
— lack of accountability
— not following up with the action items of the postmortem
— not being transparent enough internally or externally

On-call, like the operability of a service, isn’t a silo. It’s in many ways the microcosm of the engineering skills of an organization (resilience of the systems being built as well as the quality of monitoring, alerting and automation) which in turn is a reflection of the quality of management and prioritization (engineering culture).

A healthy on-call culture involves building resilient systems in the first place, in addition to having a small set of hard failure modes that can be “monitored” (so there aren’t a bazillion spammy alerts), having good contingency measures (which enable good runbooks), good testing and release engineering practices (so deploys aren’t a frequent source of outages), iterative refactoring of the system design and architecture in lockstep with iteration of testing, monitoring and alerting as well (so as to avoid large scale re-architectures down the road) to name a few, all of which involves engineering, yes, “prowess”.

Which in turn is contingent on the “engineering culture” — which needs to be in alignment with prioritizing all of the aforementioned, in addition to acknowledging that checking all of the above boxes is still only ever a best effort attempt at building a healthy service which still isn’t immune to unexpected failures.

There’s no such thing as a “perfect on-call” inasmuch as there’s no such thing as a “perfect system”. Systems will fail. Full stop. The best we can do is build systems that are “good enough”, so that on-call doesn’t become chaotic and miserable.

Or to quote Liz, a Staff SRE at Google:

I’ve in the past argued that not everyone is an Operations engineer. I still believe this is true, even in a world where developers are on-call.

Service developers need to be responsible for not just writing code but also for managing the entire life-cycle of the service, ensuring its health, maintainability, observability, ease of debugging and its ultimate graceful demise. This includes being responsible for deployments, rollbacks, monitoring and debugging, in addition to bug fixes and new feature development.

We’re looking at a future where responsibilities are shared. Operations, per se, is not everyone’s job. What, however, is everyone’s job is ensuring holistic software lifecycle, achieved when SWEs and SREs work together. And the first step toward this future is to actually have a healthy respect for Operations and on-call (even if developers personally don’t find it “fun”), and statements along the lines of “I’m a brilliant developer but I hate on-call” aren’t doing the cause any favors.

Investment in skills

A healthy on-call rotation is only possible if the engineers developing the system are incentivized and encouraged to proactively think about its operability at the time of system design as well as development. It also requires the engineer who is going to be on-call to both have a deep understanding of the system’s design, architecture, trade-offs and shortcomings, in addition to possessing at the very least a baseline set of Operational know-how.

Healthy on-call rotations are a reflection of an engineering organization where the engineers developing the system and the engineers on-call (be it developers or managers or Operations engineers) are set up for success, which in turn is only possible if an organization prioritizes and invests in the sort of skills required to enable developers to build the kind of services that would then be easy to operate and easy to debug.

I’ve never heard of an on-call onboarding program for developers, but if the expectation is that developers must be on-call for their own services, then they need to receive a certain level of training before their very first on-call rotation (something I’ve lacked at every job where I’ve ever been put on-call), in addition to continued training and support as they ramp up.

Putting developers on-call won’t automatically fix service reliability issues or improve the on-call experience any more than hiring an SRE team will prove to be the silver bullet. Like code, the on-call experience is something that needs to be subject to constant review, assessment and refactoring, else there isn’t a hope in hell of it being successful.

Toward a more sustainable and humane on-call

The best we can aim for when it comes to an on-call shift is to hope it is mostly uneventful.

The first step toward building a sustainable on-call is to commit to building a culture where on-call, by and large, isn’t life-impacting, while still acknowledging that by dint of its very nature, being on-call does have several ramifications on one’s personal life and is an encroachment into one’s personal time.

Being on-call means voluntarily trading a 40 hour week for a 168 hour week. Compensating people (be it SREs or SWEs) for being on-call isn’t something the industry has standardized on, but when you think about the additional hours and the level of anxiety or foreboding that comes with being on-call (even during an uneventful week), it does make sense for the incentive structure to be better formalized.

Furthermore, on-call becomes incredibly less stressful when working with supportive teams, ideally one where everyone on the team understands the system equally well, can debug any issue with ease armed with best-in-class tools and share a good enough rapport with each other so as to be able to pick up the slack for team members when required or swap on-call rotations if the need ever arises. Having certain members on such a team be actively hostile to the idea of on-call or sharing responsibilities kills any hope for such camaraderie.

The current state of on-call is far from healthy in most organizations. I doubt it will ever be “perfect”. The best we can aim for is to get to a point where on-call is something that’s sustainable and leaves the systems in a good enough state and the team in a state that’s not dysfunctional or adversarial.

Developers must strive to aim higher. One of the arguments I heard repeatedly over the last two days is that laying the blame at the feet of developers is unfair since they aren’t in any position whatsoever to effect any meaningful change before a sweeping organizational change is underway or without the explicit blessing of management to improve the reliability of a service.

If developers are sitting on their hands waiting for a wholesale organizational change to first take effect before they can work toward making service reliability a priority, I’d wager they are going to be waiting for a very long time. In the meantime, there usually exist several small and strategic wins that can greatly help improve the stability of a system with very little work. Identifying such wins and prioritizing them seldom, if ever, requires the explicit approval of upper management and is something a competent engineering manager or an IC should be able to achieve with ease.

Furthermore, such grassroots initiatives can in its own way help shape the organization’s culture. Brian puts it way better than I ever could:

Organizational behavior is the combined output of the actions of individuals within the organization, and is a feedback loop. The only part of an organization you have control over is yourself. Your actions are the only lever you have to catalyze organizational change. If you say “I cannot do the right thing until an organization changes”, you are helping to continue the current behavior of the organization, and making it less likely that it will ever change. It is quite possible that you don’t have enough leverage to change an organization, and that even making better choices from your position won’t change it enough to suit your needs. Your decisions are still the only thing you have control over. Incentives in the wrong places make it hard to make good decisions, but saying you can’t make good decisions because of incentives is an argument against self awareness that has ramifications far outside of just engineering decisions. Also note: “trying to change an organization” can often result in you being attacked and driven out of an organization. I’m not implying that it is an individuals fault if an organization doesn’t change, only that the only thing we have control over is our own decisions. Be aware that the decision “I don’t like how this organization behaves, but can live with it because I benefit more than I suffer” is often a decision you make at the expense of others.

An organizational change isn’t going to be brought about by the sheer will and blessing of the upper echelons of management. A more sustainable on-call rotation requires a bottom-up cultural change without the change being handed down to individual contributors on a platter.

A sustainable on-call is only possible if the engineers building the system place primacy on designing reliability into a system. Reliability isn’t birthed in an on-call shift.

--

--

Cindy Sridharan

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.