Cluster Schedulers

32 min readJul 14, 2017

This post aims to understand:

— the purpose of schedulers the way they were originally envisaged and developed at Google
— how well (or not) they translate to solve the problems of the rest of us
— why they come in handy even when not running “at scale”
— the challenges of retrofitting schedulers into existing infrastructures
— running hybrid deployment artifacts with schedulers
— why at imgix we chose Nomad over Kubernetes
— the problems yet to be solved
— the new problems these new tools introduce
— what the future holds for us

1. Introduction

Cluster schedulers were initially popularized by the white paper on Borg from Google. Google famously has been running everything in “containers” orchestrated by Borg for over a decade now. Containerization outside large organizations was made possible by the runaway success of Docker, which in turn led to the birth of Kubernetes. Together, these tools have succeeded in both capturing the zeitgeist as well as enjoying a steady uptick in adoption across the board.

I’ve invariably been sceptical about the necessity of schedulers, especially at smaller companies, albeit finding schedulers themselves fascinating to learn about out of curiosity. This is a sentiment I’ve expressed on Twitter in the past where I’ve asked:

It’s strange now to look back at this tweet, since exactly a year later, here I am, having rolled out a scheduler at the company where I work. When I posted this in July 2016, I had a fairly good idea about what schedulers did, what problem they solved and even how they solved it but I just couldn’t see the need for it at work.

Surprisingly, Brendan Burns, one of the creators of Kubernetes, had responded to this question back then, when he’d said:

In response, I’d said:

Where I work, this is just as true in 2017 as was in 2016 when I’d originally made this comment — we’re a small company with an even smaller engineering team. A scheduler is certainly a huge investment for us.

I couldn’t see this point last year, but now I do. Sometimes, you only truly understand the power of a piece of technology when you actually use it.

I recently started using Nomad at work around early June. I’ve been to enough talks by the Hashicorp folks in the last ~2 years that I was familiar with what Nomad was and what it aimed to do, but working with it directly helped me gain a better appreciation for schedulers.

2. What even is a scheduler?

Required reading for anyone interested in learning about large scale computing and the purpose schedulers serve is the white paper on the Borg scheduler from Google followed by the 2009 book The Datacenter as a Computer — An Introduction to the Design of Warehouse-Scale Machines by Google’s Luiz André Barroso and Urs Hölzle.

The Borg paper begins by stating that Borg —

… achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.

The paper further goes on to claim that:

Borg provides three main benefits:
1. hides the details of resource management and failure handling so its users can focus on application development instead
2. operates with very high reliability and availability, and supports applications that do the same
3. lets us run workloads across tens of thousands of machines effectively.

3. Cluster Schedulers in the Wild

All of the aforementioned points make perfect sense if one is Google (or anywhere near Google scale). With the exception of a precious few, the majority of organizations are nowhere near even a fraction of that scale or have the sort of constraints that warrant the operation of a system as complex as Borg.

A strain of thought that has become increasingly popular in the recent years is that the Google way of doing things is not just right for Google but also the best way for the rest of us. This is corroborated by the birth of a number of companies and open source projects heavily inspired by Google’s internal infrastructure and tooling. A recent post on the pitfalls of this style of thinking proved to resonate incredibly well with many, for it clearly laid out many of the fallacies of this school of thought. This scepticism is also shared by some of my acquaintances, many of whom have worked at really large scale organizations. In the words of one such acquaintance:

… buying machines up to a point is pretty easy; run one thing per machine, move on with your life, until you get to a point where the inefficiency is a problem that’s going to kill you.

To be honest, most companies never get to this point.

So then, what’s the need for a scheduler at these companies?

4. The need for a scheduler

I generally find that “need” is a word best left to marketeers. People have been deploying applications without “containers” let alone schedulers for much longer than even Google has been deploying its constellation of services in containers. We probably also don’t need things like test coverage, continuous integration, good observability tools and a number of other things we’ve come to consider basic requirements for building and operating software in this day and age.

One particular mistake I see a lot of us tend to make (including yours truly) about certain tools and technology is centering our conversation/evaluation around the stock problems they aim to solve and the features they provide, as opposed to the problem we actually have at hand. As often as not, the opportunity cost doesn’t get anywhere as much thought devoted to it as the promised golden future. Part of this is inevitable. We don’t live in a vacuum but instead are influenced by the unholy trinity of hype, developer advocacy and a genuine fear of missing out.

It’s tempting, especially when enamored by a new piece of technology that promises the moon, to retrofit our problem space with the solution space of said technology, however minimal the intersection. Evaluating cluster schedulers on the basis of the problems they were invented to solve at Google is a fool’s errand. It makes more sense to first fully understand the current shortcomings of one’s infrastructure, before deciding whether a scheduler is the tool best suited to solve these problems.

Personally, I see three primary unsolved problems — the packaging problem, the deployment problem and the life-cycle problem. I want to delve deeper into each of these problems, before exploring how schedulers can help tackle these problems.

The Packaging Problem

The requirement for every company to be “data driven” and especially for consumer facing companies to be “mobile first” has made polyglot architectures an inevitability for even companies that are just getting started.

Even after all these years there doesn’t exist a POSIX like standard for packaging for different languages, frameworks and runtimes. The tool that comes the closest to solving this problem is, in fact, Docker, and it’s highly unlikely that schedulers would be the limelight if it weren’t for the unbridled success of Docker. In fact, an ACM queue article from Google states that:

It (Kubernetes) was conceived of and developed in a world where external developers were becoming interested in Linux containers, and Google had developed a growing business selling public-cloud infrastructure.

The introduction of Docker in 2013 has ushered in a paradigm shift in the infrastructure space. The astonishing velocity at which new tools and workflows are being built around Docker makes me feel justified in labeling the current era a veritable gold rush. The momentum behind containerization has contributed toward both its ubiquity (every major cloud provider and PaaS offering comes with some form of support for containerization) as well as its increasingly widespread adoption. This metamorphosis isn’t yet complete and I suspect by the time it is, the way most organizations operate applications would be streets ahead of how they’d done previously.

Irrespective of what might think about the technical merits (or lack thereof) of Docker, it’s hard to deny that it has been absolutely revolutionary, even if primarily as a packaging format. The biggest draw of Docker was (and still remains) the fact that it helped attain parity between different environments, thereby accelerating the speed of development and testing. That internally Docker leverages control groups, namespaces, capabilities and other primitives built into the Linux kernel is purely an implementation detail, something I doubt had much to do with its popularity as opposed to the promise of “build, ship, run, any app, anywhere”. Docker lends itself towards solving a problem every bit as topical as it is unrelated to the core primitives upon which it is built. Docker’s genius and the main reason behind its preposterous popularity is its usability. To quote Bryan Cantrill:

Bryan Cantrill at Velocity 2016

Container primitives weren’t developed at Google to solve the problem Docker ended up solving for the rest of us. The Borg paper states that:

The vast majority of the Borg workload does not run inside virtual machines (VMs), because we don’t want to pay the cost of virtualization. Also, the system was designed at a time when we had a considerable investment in processors with no virtualization support in hardware.

Google did not invent cgroups so that its developers can “build, ship, run, any app, anywhere”. Google also famously deploys static binaries. The Borg paper goes so far as to state that:

Borg programs are statically linked to reduce dependencies on their runtime environment, and structured as packages of binaries and data files, whose installation is orchestrated by Borg.

Go, Rust and other C family of languages compile into static binaries. Dynamic languages like Python, Ruby, Javascript, PHP have a more complicated packaging story. All of them, however, can be built into a Docker image. In fact, the follow up ACM queue article goes on to state why the Docker image is, in fact, closer to the “hermetic ideal” than even static binaries:

There are many ways to achieve these hermetic images. In Borg, program binaries are statically linked at build time to known-good library versions hosted in the companywide repository. Even so, the Borg container image is not quite as airtight as it could have been: applications share a so-called base image that is installed once on the machine rather than being packaged in each container. This base image contains utilities such as tar and the libc library, so upgrades to the base image can affect running applications and have occasionally been a significant source of trouble.
More modern container image formats such as Docker and ACI harden this abstraction further and get closer to the hermetic ideal by eliminating implicit host OS dependencies and requiring an explicit user command to share image data between containers.

The primitives upon which Docker is built might’ve not been invented to solve the problem Docker solves, but standardization of packaging is something Docker got right.

Docker solved the packaging problem in a way no other tool had successfully done before, bearing testament to the fact that technology built to solve a specific problem for Google can translate exceedingly well as the solution to an entirely different problem outside the ivory towers of Google. That the problem being solved is far divorced from the original context in which the solution was developed is incidental and irrelevant, debunking the notion that you need to be Google scale to use the technology pioneered at Google.

The Deployment Problem

Packaging code into a deployment artifact is but the first step in operationalizing an application. Assuming we have an artifact that is ready to be deployed across a fleet of servers, the next immediate problem we face is that of deployments, graceful restarts and rollbacks.

The deployment space is extremely fractured with different organizations and ecosystems adopting varying practices to deploy code safely and reliably. Bigger companies have entire engineering teams dedicated to working on building internal tooling to automate the build and release process of software.

This fragmentation notwithstanding, the fundamental way in which deployments are done remains somewhat similar across the spectrum. You bring up a bunch of new processes and kill the old ones, ideally without dropping any connections. Different applications handle graceful restarts differently — mostly by handling specific signals. The most common approach to “gracefully restart” an application is to allow the old processes to finish handling all requests still in progress and then exit once all requests have been drained, making sure the dying process doesn’t accept any new connections in the meantime.

I’ve seen this done in a panoply of ways, each with its own set of failure modes and shortcomings— dedicated libraries, guard rails in application code, process supervisor config, socket activation, specialized socket managers, configuration management, custom or off-the-shelf tooling like Fabric or Capistrano, proxy configs or just plain old shell scripts with a hefty dollop of tribal knowledge to name a few. Lather, rinse and repeat for not just deploys but also for rollbacks for every single service one runs, and suddenly this doesn’t look like such a trivial problem anymore.

The aforementioned approach is as unsophisticated as it’s brittle and can prove to be crippling as the number of services scale up, especially in a “microservices” environment. Having to wrangle the Ruby ecosystem (Thin, Mongrel, Unicorn, god, monit and what have you) when there’s some Python in the stack as well (gunicorn, uWSGI, mod_wsgi, Twisted, supervisor) can be a horrifying proposition in its own right, but exacerbated greatly when any JVM gets thrown into the mix, as is wont to happen at any company doing anything at all with “Big Data”.

Deployments, in short, continue to be a massive pain point for everything except the most trivial of cases, and as with packaging, there’s no one standard Unix-like way to deploy heterogenous application workloads.

The Life Cycle Problem

For the sake of argument, let us assume that we have solutions to the packaging and deployment conundrums. We’re now faced with the responsibility of ensuring the healthy operation of our service for the duration of its lifetime. Even eschewing fancy terminology like SLOs, SLAs and SLIs, the very least requirement our service needs to meet is to be up and be functional.

Applications fail for a number of reasons. Some of these are application logic centric (incorrect handling of unexpected input or edge cases); applications crashing due to application errors can be detected and mitigated by process managers restarting the crashed application. During other times, applications fail due to the network (upstreams being undiscoverable or requests to upstreams being rate-limited or timing out), in which case one would hope the application code (or the proxy fronting the application) is robust enough to handle these failures.

And sometimes, machines themselves fail, requiring a relocation of applications onto other hosts. Process managers cannot help here since, by design, they are host centric. The way many historically monitored “distributed health” of applications is by scripting Nagios checks to alert a human operator. There are workarounds to this as explored in this post a little later, but most of the tools in the traditional Operations toolkit aren’t specifically designed for this purpose.

More importantly, the problem with most distributed systems is that of partial failure where components of the application fail, or the application itself functions in a degraded state. Partial failure can manifest in myriad forms, including increased timeouts or response latency. Degraded application performance can be a symptom of several causes — noisy neighbors, unbounded queues, lack of sufficient backpressure, poor cache hit ratios, database lock contention, poor query patterns, partial outage of upstreams or third party services to name a few. Having good observability into applications helps uncover these issues, but most observability tooling isn’t built to auto-heal the application; instead, it’s built to help in the diagnosis and debugging of systems via dashboarding, auditing, exploratory analysis, and in the worst case scenario, alerting a human operator.

In short, we’re all building distributed systems these days. If one has more than one server or container, one is operating a distributed system. Yet, the tools we use to build, run and monitor these systems largely aren’t up to the mark. Managing the lifecycle of applications is at best a semi-automated process, with most tooling of yore being either host-centric or human-centric.

5. Problems we actually had

We’ve identified three problems that continue to plague us, but it makes a lot more sense to look at a more concrete example of what problems we actually had where I work before we decided to switch (parts of our stack) to a scheduler.

The Packaging Problem

I work for imgix, an image processing company, where we process tens of thousands of image transformations per second (which in the worst case would require us to freshly fetch those many images in real time from S3 or GCS or any HTTP backend) and serve it back in real time. We also need to support real time user driven configuration updates to our entire fleet of machines, in addition to load balancing requests to hosts based on locality and various other factors as well as purging varying image formats from all of our caches, both internal and external.

All of our client-facing Javascript applications as well as sales and marketing tools run on Heroku or GCP. These platforms usually have standard, if disparate, packaging and deployment mechanisms for different languages and there’s little anyone can do other than adhere to their guidelines to get something up and running. The various components of our core stack are written in Lua, C, Objective-C, Python and Go. These are the applications that run inside our datacenter and where deployment isn’t quite as straightforward as git push heroku master.

Python is a bit of an odd man out among our main languages. We maintain our fair share of old Python code, but no new code that’s meant to be run within the datacenter is authored in Python. While all of our C, Objective-C, Lua and Go applications compile into convenient static binaries, our Python deployment story has historically been something of an albatross. Our earliest Python applications were deployed in the pex format while the one (thankfully monolithic) application that is being actively developed and maintained is deployed as a Docker container.

Our initiative to Dockerize this single Python application was half-hearted and understandably, the results somewhat mixed. For the most part, Dockerizing the Python code yielded the result we were hoping for , viz., a hermetically sealed artifact capturing all the pure Python as well as native dependencies (shared libraries, header files and compiled C modules). Except for the odd issue with OpenSSL (often fixed by pulling in a specific version of the cryptography library), deploying Python in the form of a Docker container has worked well enough.

Outside of Python, we don’t see a need for Docker for the rest of our applications, which are deployed as binaries and have a fairly good artifact caching, versioning and delivery system. We also use Macs for our core image rendering layer, which requires we re-image every server and recompile our image processing software every time there’s an OS X upgrade. The details of how this is done is outside the scope of this post, but suffice it to say that Docker isn’t the solution to any of these problems. Our goal, as such, is to limit Docker usage to just the single Python application and not write any future core service in Python.

The Deployment Problem

Circus is a process and socket manager from Mozilla. It’s especially good for deploying Python web applications without requiring the application to be fronted by a WSGI server. Initially we used Circus to run all of our applications, but about a year ago switched to s6.

s6, a descendant of daemontools, has been pretty reliable so far and is what is used to run most of our core services. s6 is a toolkit of small programs that allows one to compose flexible process pipelines with little effort and without leaving behind deep process trees. The switch from Circus to s6 was made owing to Circus’s general unreliability, problems with its socket client (circusctl) often timing out its conversation to the Circus daemon, the Circus daemon itself oftentimes being unresponsive for long periods, or worse, pegging CPU on certain boxes (HUP-ing Circus usually fixed this, but it was still something that needed monitoring).

Deployments were fully automated but under the hood involved getting the latest artifact to the right hosts and have Circus do a restart of the application. Applications ran as a child process of Circus. Circus managed the listening socket, so restarting entailed Circus sending the old application process a (configurable) signal, upon the receipt of which the application would manually deregister itself from Consul (which we use for service discovery) ensuring no further requests were routed to it, drain all connections and finally exit. New processes that were being brought up would first register themselves with Consul with the same service name as the dying process and then start accept -ing new connections. This worked well enough, except every application codebase carried some boilerplate code that did the Consul registration upon starting and deregistration upon shut down.

As stated above, process managers’ scope is limited to a single host. We almost never run just a single instance of an application (unless it’s something deployed to a staging environment or an experiment). Outside of our core image rendering pipeline, the applications we operate include everything from HTTP servers to internally authored routers and proxies, asynchronous workers, CPU and memory intensive batch jobs and a number of what we called “oddjobs” described in our internal wiki thusly:

Oddjobs were a miscellaneous bunch, some of which were singletons, some of which lacked internet access, some of which were long-running. The way they were run could only be described as … interesting, if imperfect.

One of the features Consul provides is distributed locking. consul lock -n foo --cmd allows one to obtain n locks on the prefix foo in the Consul key value store and then have the Consul agent on that host actually run cmd as a child process. If the lock was lost or communication disrupted (owing to the physical node failure, for instance), the child process would be terminated.

If we needed one instance of an oddjob running at any given time, running this job on, say, two physical hosts under a consul lock would mean only one would actually acquire the lock at any given time. If the processes that held the lock crashed, the other process running on the other host could acquire the lock relinquished by the crashed process, guaranteeing we’d have exactly one process running at any given time (unless, of course, all of the hosts crashed, in which case we’d have a far bigger problem than oddjobs not running). This works equally well when the job in question isn’t a singleton.

This met all our requirements and soon enough, we had a quite a few critical internal services running as oddjobs, including, the Prometheus Alertmanager.

The Life Cycle problem

When Consul is everything but the kitchen sink, it’s tempting to use it to solve every problem. Tools like Zookeeper, Consul and their ilk are meant to be more building blocks of distributed systems; using distributed locking directly as a means to reliably run jobs isn’t the most effective use of the tool.

Soon enough, it became evident that this approach wasn’t as foolproof as we’d have liked. Sometimes services wouldn’t deregister from Consul correctly when they exited, once leaving almost 6K dead entries behind. This, per se, wouldn’t have been a problem had there been an index in the Consul store on the service state and health check status, but the absence of an index meant leader thrashing and Raft timeouts for some of our other services that polled Consul. At other times, the child oddjob process would escape the parent process and get reparented to init. This happened because Consul launched the child process in a shell, which didn’t relay signals like SIGTERM to the child when it exited, leading to the child process getting reparented to init.

On Linux, having a child process set PR_SET_PDEATHSIG on itself via prctl ensures that if parent process dies for some reason, the child would get a signal, even if the parent is kill-9-ed. This meant that once a Consul agent acquired the lock and forked the child oddjob process it was meant to run, having the child set PR_SET_PDEATHSIG on itself before exec-ing would mean signals would be delivered to the child correctly.

Except Consul is written in Go, which doesn’t make any code execution between a fork and an exec straightforward. The solution was to wrap the actual binary needed to run under the Consul lock with a small C program called deathsig. Consul would fork and exec deathsig, which in turn would set PR_SET_PDEATHSIG on itself before exec-ing again to run the actual binary that needed to be run under the Consul lock.

This worked but was never really be meant to be the “final solution”. Even when oddjobs were first conceived and implemented in this fashion two years ago, the goal was to eventually move to a scheduler that was built for this purpose.

It then begged the question, which scheduler?

6. Why not Kubernetes?

Kubernetes is an incredibly sophisticated and opinionated framework for orchestrating containers with a flourishing and fantastic community. I’m pretty convinced at this point that Kubernetes is what most mid-sized to large organizations with a well-staffed Operations team would benefit from. However, Kubernetes wasn’t the right choice for us for a number of reasons.

Docker

One of the biggest arguments against Kubernetes is the fact that it would’ve required us to change how we package all of our existing applications. Had only it worked with hybrid deployment artifacts, instead of prescribing the artifact be a Docker or rkt container, Kubernetes would’ve definitely been something we might have seriously considered. For the reasons stated above, Docker usage is something we hope to limit, not proliferate.

To me personally, Docker is a perplexing piece of software. On the one hand, it makes encoding any and every application into a standard format ridiculously easy. Running Docker containers in production with the standard Docker tooling, though, is an entirely different kettle of fish. I’ve had conversations with people working at large companies with an entire team of kernel experts to support their Docker based container infrastructure, one of whom mentioned:

Docker was and still remains a nightmare. We still run into hundreds of kernel panics and single-digit Docker lockups a day.

Even when taken with a grain of salt, if this is the state of Docker at companies with an entire team dedicated to work on their container based infrastructure, adopting a scheduler that would lock us into Docker didn’t feel awfully reassuring, particularly when, with the sole exception of one application, the benefits that Docker would’ve offered would not have outweighed the overhead, let alone the cons.

Networking

The other issue is that of Kubernetes networking.

Dismissed by the fanboys as FUD, Kubernetes networking is something that continues to be complex, often papered over with “workflows” and further layers of abstractions. The complexity, as a matter of fact, is claimed to be entirely warranted. The ACM article goes on to explain why.

All containers running on a Borg machine share the host’s IP address, so Borg assigns the containers unique port numbers as part of the scheduling process. A container will get a new port number when it moves to a new machine and (sometimes) when it is restarted on the same machine. This means that traditional networking services such as the DNS (Domain Name System) have to be replaced by home-brew versions; service clients do not know the port number assigned to the service a priori and have to be told; port numbers cannot be embedded in URLs, requiring name-based redirection mechanisms; and tools that rely on simple IP addresses need to be rewritten to handle IP:port pairs.

We run tens of services, not hundreds let alone thousands. Most of our applications (with certain exceptions) do a bind 0 and advertise this port to Consul. For the single Docker application we run, we use the host network. We don’t foresee this changing in the foreseeable future.

Learning from our experiences with Borg, we decided that Kubernetes would allocate an IP address per pod, thus aligning network identity (IP address) with application identity. This makes it much easier to run off-the-shelf software on Kubernetes: applications are free to use static well-known ports (e.g., 80 for HTTP traffic), and existing, familiar tools can be used for things like network segmentation, bandwidth throttling, and management. All of the popular cloud platforms provide networking underlays that enable IP-per-pod; on bare metal, one can use an SDN (Software Defined Network) overlay or configure L3 routing to handle multiple IPs per machine.

Again, while setting up an SDN or defining a new /23 is definitely something we would’ve done if absolutely called for, we just didn’t see the need for this overhead.

Incremental Refactor of Infrastructure

Introducing a scheduler is a significant infrastructural change. While “never rewrite” appears to be a golden rule that many developers have internalized to good effect, many infrastructure engineers seem to have no qualms about completely throwing this caution to the winds. When I meet people working at startups who tell me smugly how they spent 6 months Dockerizing all of their applications to run on their own Kubernetes cluster, it’s hard not to wonder if that time could’ve been put to better use. Bad code can be rolled back; ill-advised infrastructural choices usually end up being more risky.

Sandi Metz and Katrina Owen, in their book 99 Bottles of OOP, propose a blueprint for refactoring, where they claim:

The good news is that you don’t have to be able to see the abstraction in advance. You can find it by iteratively applying a small set of simple rules. These rules are known as “Flocking Rules”, and are as follows:
1. Select the things that are most alike.
2. Find the smallest difference between them.
3. Make the simplest change that will remove that difference.
Making small changes means you get very precise error messages when something goes wrong, so it’s useful to know how to work at this level of granularity. As you gain experience, you’ll begin to take larger steps, but if you take a big step and encounter an error, you should revert the change and make a smaller one. As you’re following the flocking rules:
1. For now, change only one line at a time.
2. Run the tests after every change.
3. If the tests fail, undo and make a better change.

The book introduces this concept in the context of refactoring Ruby code. I feel this translates equally well to infrastructural refactors as well — make the simplest infrastructural change to things that are most alike to remove a pain point.

A good rule of thumb when it comes to adopting new tooling — especially Operations tooling — is to pick one that will integrate seamlessly into an existing infrastructure and require one to make the least number of changes to what’s already working. Those who have the luxury (time, resources and dedicated personnel) to greenfield their infrastructure or those beginning with a clean slate might be able to experiment with other options, but for us this wasn’t a possibility. We’ve gone down the path of golden rewrites a couple of times and it’s never gone as well as we’d hoped. We’re firmly committed to incremental refactors.

We’ve been using Consul for well over 2.5 years now and have developed many best practices as well as architectural patterns around its use. Consul runs the gamut of coordination tasks across our stack, from load balancing separate categories of real time traffic to different physical machines on separate racks to driving our customer facing dashboard’s maintenance page updates (freeing our frontend engineers from having to be on hand during a backend maintenance). The importance and the versatility of Consul cannot be overstated.

Having operated Consul for years, we are well aware of its failure modes, consistency guarantees as well as edge cases, and have developed sound abstractions and architectural patterns to workaround some of these. Replacing such an integral component of our stack along with the hard-won operational knowledge is something a company our size could ill-afford. Any gains brought about by a scheduler in making the deployment and lifecycle management process a bit easier and standardized at the expense of requiring us to replace Consul wasn’t worth the effort.

In addition to Consul, we also use Vault for secret management. If we were going to use a scheduler, it needed to be one that would require us to make the simplest change, yet integrate well with our existing stack and gain us all the benefits we hoped to reap.

That scheduler was Nomad.

7. Why Nomad?

Different cluster schedulers vary in the features they provide, heuristics they use to make scheduling decisions, tradeoffs they make and constraints they impose. The more flexible ones can be used as anything from a simple distributed cron replacement to a tool that’s capable of scheduling up to millions of containers across tens of thousands of geographically distributed nodes. The more feature rich ones offer more fine-grained control but at the expense of operational complexity.

Nomad’s key advantage over other schedulers in the space is the ease of use and ease of operation, both of which are of paramount importance to any system, but especially so when it comes to Operations tooling. While there is a certain overlap between the functionality provided by the two, Kubernetes is primarily a platform for distributed containerized applications that happens to have a scheduler built in, whereas Nomad is primarily a flexible scheduler that can be used to build a platform comprising of hybrid deployment artifacts, from Docker containers run by the standard Docker tooling to LXCs to Java JARs to binaries running as root without any isolation whatsoever (not recommended) via different drivers.

8. Immediate Wins

Switching to Nomad yielded some immediate wins over the way we used to deploy and run applications. Many of these wins aren’t specific to Nomad, and the points marked with an asterisk are the benefits other schedulers would’ve provided as well.

Minimal change required to our existing stack

Adopting Nomad did not require us to change our packaging format — we could continue to package Python in Docker and build binaries for the rest of our applications. It also required us to make very little change to application code (just some boilerplate code deleted). Nor did it require us to change the way we did service discovery. It integrated well with the tools we were already using.

* Deployments as Code

The ability to describe complex deployment policies hitherto indescribable in a flexible manner via a declarative job file is something I wished I had all those years ago when I started out and it’s hard to see how I can go back to deploying applications any other way.

Every application deployed with Nomad has a job file that lives in the same repo as the application codebase and is edited by the developers responsible for building the application. This job file lets me specify, among other things, the following in the form of config:

1. The number of instances of the application I want running
2. What applications need to be colocated on the same physical hosts
3. Resource limitations one might want to set for the application process (CPU, RAM, Network bandwidth)
4. The port I want to bind my application to. I can also choose to let Nomad pick a port and have my application bind to that.

Having a common syntax that allows developers to specify the operational semantics of the applications they build puts the onus of the operation of a service on those best suited to make decisions given the performance characteristics and business requirements of the application.

Fantastic Consul Integration

Nomad allows me to express the service name under which I want to register my application with Consul, as well as any tags I want to associate the service with as config in the job file. What this means for me as an application developer is that my applications aren’t responsible for registering themselves with Consul, registering subsequent health checks, or deregistering themselves when shutting down. The less the code to write, the fewer the bugs and failure modes of the application. But where Nomad’s fantastic integration with Consul really shines is when it comes to how greatly it simplifies graceful restarts.

* Simplified Graceful Restarts

Nomad allows every task to specify a kill_timeout, which:

Specifies the duration to wait for an application to gracefully quit before force-killing. Nomad sends an SIGINT. If the task does not exit before the configured timeout, SIGKILL is sent to the task.

Applications, when they start up, query Consul for all their upstream services and store this information in an in-memory data structure. The application then fires a background coroutine/goroutine that long polls Consul for changes to these upstreams and updates the in-memory data structure whenever this information changes.

Let’s assume I have process A running and I want to deploy a new version of the code. Let’s also assume A has two downstreams — X and Y — and one upstream Z. At the time of deployment, A has three open connections from X and Y and one connection held open to Z. This might look something like the following:

Obviously, this is an extremely contrived example. In a real system, there might be tens or hundreds of upstreams and downstreams, and thousands of bidirectional RPC calls in progress at the time of a deployment. Furthermore, it’s also possible that all the four services A, X, Y and Z might be in the middle of being deployed at the same time. For now, let’s work with this example.

During a deploy, the old process A is deregistered from the Consul and a new process B is registered.

X and Y will learn about B from Consul, after which future requests will be routed to process B and not process A. Since Consul is a CP system, it might take a while (though usually not longer than a few seconds) for this change to propagate. Until then, X and Y would continue routing requests to process A, so we need to make sure that A is up long enough until X and Y can start routing new requests to to process B.

Once X and Y have got the update, they will start routing all new requests to process B. But some of the old connections they have open with process A might still be in progress, which means A can exit only after it’s finished responding to all requests in progress. This might look something like the following:

Once A has completed servicing the request still in progress, it can exit, and X and Y will continue routing all future requests to B.

Nomad makes this extremely easy. It deregisteres A from Consul automatically when it brings up B. It will also send A a SIGINT at this point. A can either handle the signal and exit, but if it doesn't do so, it will get SIGKILLED after the kill_timeout. The default kill_timeout is 30s but can be configured in the job file (and is much higher for some batch jobs). Depending on whether the application being shut down needs to do any cleanup (such as releasing any resources, file descriptors or locks it might have acquired) before exiting, the application can either choose to handle the signal or ignore it.

If A choses not to handle the SIGINT and exit of its own accord, it’ll stick around for sometime, even when X and Y stop routing requests to A.

30s later (or however long the timeout is configured to be), A will get a SIGKILL, leaving just process B behind. The 30s default is probably way too high and this value could be tweaked based on application characteristics — like the p95 or p99 response time of specific endpoints or the agreed upon SLA of the application.

At the edge, we use HAProxy, whose config gets redrawn by Consul-Template. Zero downtime HAProxy reloads are slightly more complicated than described above and as such is beyond the scope of this post.

* Flexibility

Scaling up the application count or resource constraints becomes quite as easy as editing the job file and redeploying the application while leaving a paper trail behind. If a bad deploy requires a hotfix or a rollback, all I need to do is edit the job file to revert the application to a known good version and deploy again.

* Democratization of Operation

The job file looks similar for all applications run under Nomad, making it extremely easy for me to look at the job file of an unfamiliar application and instantly be able to tell how it’s deployed.

Since deployments are uniform for all applications, should the push come to the shove, I could even deploy services I’m not the primary owner of with a reasonable amount of confidence. Having every engineer (and not just Operations Engineers) be familiar with deployments greatly democratizes the operation of all services across the stack.

Simplicity

It would be remiss not to include the ease of use and ease of operation of Nomad as a win. Switching over some of our services to run under Nomad was never an all-consuming project that took months of dedicated effort to set up. Probably about 15–20% of our time over the last one month was spent learning Nomad and experimenting with it till we gained enough confidence to use it in production.

The simplicity is made possible by Nomad’s fairly straightforward architecture. There exist a group of Nomad servers (the central pink nodes) that form the nodes of the Raft cluster. There’s a leader responsible for the Raft log replication among the followers. The nodes surrounding the central servers run the Nomad client (the blue dot). Nomad schedules tasks to nodes — these tasks could be Docker containers, Java JARs, binaries or some long running batch job. The Nomad clients communicate the health of the node as well as information pertaining to the node’s resource utilization to the Nomad servers, which use this information to make scheduling decisions and communicate those back to the clients. This might just be a a bird’s eye view but covers the basic architecture.

9. Pain Points

Nomad is still a very young project and we only use it for some of our auxiliary services; not a single one of our core services has been switched over to Nomad and currently there aren’t any plans to do so in the immediate future either. It’s still early days for us and while the results so far have been promising, there are some features Nomad lacks that we would greatly benefit from.

ACL

Nomad currently lacks any form of access control support. The workaround a lot of people resort to is to run Nomad behind another proxy — in our case it’s HAProxy — that allows one to specify policies and access control.

Overcommit

Some schedulers allow for oversubscription. The Borg paper states that:

Rather than waste allocated resources that are not currently being consumed, we estimate how many resources a task will use and reclaim the rest for work that can tolerate lower-quality resources, such as batch jobs. This whole process is called resource reclamation. The estimate is called the task’s reservation, and is computed by the Borgmaster every few seconds, using fine-grained usage (resource consumption) information captured by the Borglet. The initial reservation is set equal to the resource request (the limit); after 300s, to allow for startup transients, it decays slowly towards the actual usage plus a safety margin. The reservation is rapidly increased if the usage exceeds it.

Nomad, in the vein of Borg, colocates CPU and memory intensive batch jobs on the same physical hosts as latency sensitive services. But the lack of quotas, priorities and oversubscription support makes resource reclamation impossible. While this hasn’t been a problem until now, it’s something that will be in the back of our minds the next time we think about switching some of our more critical services over.

10. Conclusion

Most of us focus on scale when we think about schedulers. The problem which schedulers were born to solve (large scale computing) has become the pivot around which most public discourse revolves. “Google Infrastructure For Everyone Else” only makes sense if the technology can be adapted to solve real, immediate problems faced by organizations.

The applications we develop have become vastly more complex than those we were building a decade ago. Even when the core business logic is simple, the need for features such as additional integrations, robust data pipelines, high reliability and availability, quality of service guarantees, customer satisfaction, rapid innovation, continuous deployment, quick feedback loops and constant iteration has thrown into clear relief the importance of standard and reliable tooling.

These are requirements even pre series-A startups have these days. What with startups in the AI and IoT realm being the flavor of the month, the need to collect and process massive amounts of data is becoming an increasingly common problem across the spectrum, requiring standardized, sophisticated tooling and infrastructure.

In many ways, the word “scheduler” is very much a misnomer, since most schedulers make more than just scheduling decisions like which physical hosts to run specific applications on. Schedulers manage the entire lifecycle of applications from their birth to their redeployment to their monitoring to their eventual decommissioning.

Schedulers might initially seem like a scary thing, way above the pay grade of most engineering organizations. In reality, however, schedulers can be a game changer and are an enormous step up from the traditional way of managing software life cycle. The flexibility they afford and immediate benefits they provide cannot be overstated enough.

Going back to the conversation this post began with:

A year ago, I would’ve disagreed with this tweet. Today, I disagree with it for a different reason. It’s not solely the job of Operations engineers to ensure the reliable deployment of applications. It is a responsibility they share with developers.

Schedulers are not superior because they replace traditional Operations tooling with abstractions underpinned by complex distributed systems, but because, ironically, by dint of this complexity, they make it vastly simpler for the end users to reason about the operational semantics of the applications they build. Schedulers make it possible for developers to think about the operation of their service in the form of code, making it possible to truly get one step closer to the DevOps ideal of shared ownership of holistic software lifecycle.

Which, truly, is the right way — for not just Google but for all of us — to think about software.

Cluster Schedulers

1. Introduction

2. What even is a scheduler?

3. Cluster Schedulers in the Wild

4. The need for a scheduler

The Packaging Problem

The Deployment Problem

The Life Cycle Problem

5. Problems we actually had

The Packaging Problem

The Deployment Problem

The Life Cycle problem

6. Why not Kubernetes?

Docker

Networking

Incremental Refactor of Infrastructure

7. Why Nomad?

8. Immediate Wins

Minimal change required to our existing stack

* Deployments as Code

Fantastic Consul Integration

* Simplified Graceful Restarts

* Flexibility

* Democratization of Operation

Simplicity

9. Pain Points

ACL

Overcommit

10. Conclusion

Written by Cindy Sridharan

Responses (21)