Every year, I publish a list of best technical talks of the previous year. Usually, these posts are published in the first couple of months of the year. 2020 and 2021, to say the least, haven’t been “normal” years, which explains why I’ve been dilatory in publishing this post.
Usually my lists are heavy on systems talks but with a smattering of other interesting domains as well. This year I’m afraid the list is, to one single notable exception, exclusively systems-focused. This list skews more towards talks from research conferences than industry talks (simply because that’s where the most exciting ideas and technologies are to be found these days), but I’m not the biggest fan of purely academic talks. Most of the research talks that made this list are from companies that have productionized this research.
I was fortunate enough to host the “distributed systems for developers” track for the virtual edition of QCon Plus 2020. The first three talks on this list are from my track.
- Greenwater, Washington: an Availability Story
This was a great talk by Marc Brooker on how to reason about the availability of systems beyond simply looking at the number of 9s. The talk discusses some of the pitfalls of strategies often persued for improving availability (primarily redundancy) since they rarely address issues around correlated failures. The talk outlines two criteria to optimize for when designing for availability: a small or localized blast radius and identifying and mitigating the likely areas of correalted failures.
1a. Physalia: Millions of Tiny Databases
This was a fantastic NSDI 2020 paper from AWS. The paper builds upon concepts discussed in the aforementioned talk (or rather, the aforementioned talk highlights some of the core tenets of this paper, without going into the details of the implementation), viz, minimizing blast radius and correlated failures that adversely impact availability. Physalia is a transactional KV store that powers the control plane of cloud systems (Amazon EBS in this case). Instead of building a monolithic, centralized store, the paper describes an architecture of having millions of tiny databases, each “extremely available” for a subset of keys required by a subset of clients.
2. Essential Complexity in Systems Architecture
This was a fun talk from Laura Nolan on how to assess systems complexity along various axes. Using Google’s BigTable and Amazon’s Dynamo as an example, the talk touches upon how a similar problem was solved at two different companies using vastly different architectures, and the pros and cons of both approaches.
3. Change Data Capture for Distributed Databases Netflix
This was an interesting, if slightly niche, talk on capturing change data capture events from Cassandra, discussing the Flink ecosystem and the use of RocksDB.
4. Monitoring production services at Amazon
This was a great talk on Amazon’s monitoring philosophy — namely, measuring “above the fold” metrics like latency, how to centre customer focus by measuring metrics like per client error rate (and how to scale this practice when there are millions of different clients), how to deal with noisy metrics (noisy latency, unpredictable traffic, “distributed noise”) and more. I definitely came away with many insights, which is rare for me for a talk on the topic of monitoring.
5. Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website
This was one of the best papers I read in 2020. It’s a comprehensive look at how Facebook achieves zero downtime releases for all components of an extremely heterogenous infrastructure, with a particular focus on how a load balancer can help paper over several deployment related hiccups.
5a. The Evolution of Traffic Routing in a Streaming World
Another really fascinating talk on the topic of load balancing from Facebook, with more of a focus on solving the traffic routing problem in a distributed load balancing context.
6. Virtual Consensus in Delos
This was a sensational paper from OSDI on building a simple, fast, fault tolerant and zero-dependency consensus system that powers many components of Facebook’s control plane where more traditional consensus systems like Zookeeper don’t quite meet the requirements.
The key insight here is that of decoupling different components at different layers. At the highest level is a metastore running the original single Paxos algorithm which provides fault tolerance and simplicity. At the lower level is a shared-log architecture for implementing the consensus, first using Zookeeper as the shared log, and then removing Zookeeper by virtualizing the shared log which writes to an underlying physical loglet.
7. Twine: A Unified Cluster Management System for Shared Infrastructure
This was a cracking OSDI paper on Facebook’s cluster scheduler and the design decisions underpinning the architecture that enables fleet-wide optimizations, an API to allow applications to configure their own lifecycle management, and most import of all, how smaller machines (64GB) are preferred to large machines with high memory/CPU which necessitated the concomitant sharding of services.
8. Firecracker: Lightweight Virtualization for Serverless Applications
AWS’s Firecracker was open-sourced amidst much fanfare in 2018. It was finally good to have a white paper with an in-depth comparison between various isolation models (containers, language VMs like V8, userspace kernel implementations like gVisor, unikernels, virtual machines and “micro” VMs). The paper (and the talk) goes into much further details about what I believe is one of the most seminal pieces of software to have been open sourced in the recent years.
9. Monarch: Google’s Planet-Scale In-Memory Time Series Database
Monarch has been talked about a lot by (current and former) Google engineers and is widely considered something of a gold-standard of monitoring systems, so it was great that Google eventually (albeit a decade belatedly) published a white-paper on the design of Monarch. The design choices I found most fascinating were storing the time series in memory, a relational query language, more descriptive target schemas, support for exemplars (commonly found these days in many commercial monitoring tools) and optimizing for availability over consistency (something of a departure from norm for Google).
10. Inversion of scale: Outnumbered but in control
This is the talk version of the article Avoiding overload in distributed systems by putting the smaller service in control.
The key insight here is that when there is a huge mismatch (100x) between the size and scale of the control and data planes, putting the smaller of the two in control of the rate of work (fetching configurations, synchronizing operational state and more) done in the system is a far more reliable paradigm.
11. A Sticky Situation: How Netflix Gains Confidence in Changes:
I first learned about “sticky canaries” a couple of years ago at an event hosted by Netflix.
Broadly speaking, a cohort of clients (the canaries) are always routed to the same backend canary cluster, as opposed to a traditional canary approach, where a client retry request might be routed to a production cluster. At the time, I’d misunderstood the main use case for this sort of testing to benefit client-side changes, but it really is a way to test server-side changes, when immediate server-side metrics might not surface problems that clients at several removes might observe.
This talk by Haley Tucker does a terrific job explaining how “sticky canaries” work in practice when integrated with the chaos engineering framework used at Netflix, and the pros and cons of this approach.
12. Deep dive on AWS Nitro Enclaves for applications running on Amazon EC2
This was a really fascinating talk on AWS’s implementation of secure enclaves, on top of Nitro. The talk goes into leveraging the hardware to enforce isolation between enclaves, as well as the mechanics and the lifecycle of the enclave.
13. Confessions of a Systems Engineer: Learning from My 20+ Years of Failure
This was a delightful talk that everyone running a system should watch. It lays out a number of home truths that must be painfully familiar to anyone operating services. Think of it as the ultimate production readiness checklist (and it covers a ton of important topics that’s often not talked about much). It’s probably one of my most favorite talks of all time.
14. Language Models are Few-Shot Learners
This isn’t a talk but arguably the most scintillating research to have come out, not just in 2020 but probably in decades. It’s nearly a 40 page paper and while I can’t pretend to have fully comprehended every aspect of the paper, it was still incredibly fascinating to get some idea on the technology underpinning GPT-3.