Given it took me until October 2021 to post the list for 2020, thought I’d get a move on this year and post the list before the end of the year.
A short list this year, inasmuch as I barely spent any time on “extracurricular” tech stuff. The first four talks are incredibly accessible, the rest not so much.
There were a number of very interesting papers that came out in 2021 that made it to my reading list, but I just didn’t manage to read. Here’s hoping 2022 proves to be a more fruitful and productive year.
1. How we used serverless to speed up our servers, Jessica Kerr
A lot of the hype (and hate) around serverless has surely died down in the past two years, finally separating the wheat from the chaff. This was a stellar talk from Strangeloop (it’s a talk by Jessica Kerr, who has to be one of the very best conference speakers ever) on how Honeycomb used AWS Lambda to “serverless their database”.
2. Gigabytes in milliseconds: Bringing container support to AWS Lambda without adding latency: Marc Brooker
This might be the shortest talk on this list clocking in at just over 16 minutes, but chock full of very fascinating ideas on how container support was implemented in AWS lambda. Lambda supports container image upto 10GB in size, still scaling up in milliseconds. Some very interesting insights and techniques to leverage these insights are discussed in the talk.
First, deterministially flattening and chunking the layers of container image into a sparse filesystem, and lazily loading only the chunks needed to service the reads made by Firecracker (the VMM used by Lambda). Second, multiple levels of caching of the chunks, starting with a cache within the sparse filesystem agent, a machine-local cache, and a AZ local cache. Third, convergent encryption of the chunks to deduplicate data and storing the deduplicated data in a content addressable store.
And finally (and in my opinion, the most interesting insight): erasure coding of the chunks, which involves breaking the chunk down into a number of smaller chunks (let’s call this N), of which only K (K < N) are needed to put back the original chunk. These N smaller chunks can be spread across multiple nodes, meaning individual node failure will not impact the availability of the AZ local cache, since the chunk can be recovered with fewer than N sub-chunks. Really fascinating stuff, especially how it helps improve tail latency.
3. What’s the Cost of a Millisecond? Avishai Ish-Shalom
This was a fun talk from SRECon on several contributing factors to latency and the impact of it. A number of “best practices” for high availability when naively implemented tend to add latency — retries, timeouts and queues in particular. High utilization typically often goes hand-in-hand with non-linear latency amplification. Same is true with high variance. Architectures with large fan-outs and worse, microservices architectures with deep critical paths are particularly vulnerable to latency amplification. The talk proposes solutions to help alleviate this problem: tuning timeouts (including having timeout budgets and propagating these upstream on a per-request basis), request racing and speculative execution, smarter retries, backpressure, load shedding, circuit breakers and the other usual suspects. Techniques to reduce variance are also discussed in length here.
4. The official ten-year retrospective of NewSQL databases: Andy Pavlo
This was a great and accessible talk from one of the foremost database experts that takes one on a whirlwind tour on the evolution and ultimate demise of “NewSQL”. The talk covers almost every main trend seen in the OLTP database market in the past two decades, from the rise and initial promise of NoSQL, to the rise of “NewSQL”, to the evolution of NoSQL to support all of the features that it had initially marketed itself as an antidote to, and how this impacted the NewSQL vendors, and what the future holds.
5. RAMP-TAO: Layering Atomic Transactions on Facebook’s Online TAO Data Store
Continuing in the vein of the need for transactional guarantees (that the evolution of NoSQL has proved beyond a shadow of doubt), this was a great paper on retrofitting the RAMP protocol with TAO (the eventually consistent graph datastore used at Facebook).
The RAMP (Read Atomic Multi-Partition Transaction) protocol was first described in the 2014 SIGMOD paper, which introduced a new isolation model (Read Atomic Isolation) that guaranteed cross-shard atomic visibility. The RAMP-TAO paper describes the need for transactions in TAO and how the RAMP protocol was modified to guarantee negligible impact on performance and storage overheads.
6. Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3: Bornholt et al.
This was a great paper on the practical applicability of formal methods. It describes the architecture of Shardstore, an LSM tree based storage engine powering S3, and how various components of it was validated using methods varying from property-based testing to model checking. What I found fascinating as how these techniques were integrated into mocks in normal unit tests.
7. Metastable Failures in Distributed Systems, Nathan Bronson et al.
This was a fantastic paper from HotOS 2021 on metastability, a concept that must be familiar to anyone building and running systems. As the paper describes:
metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.
The paper identifies many common “reliability” strategies as contributors to metastable failures, and then goes on to propose some potential solutions, the most interesting of which is the need for more sophisticated failure detectors, how they can better help finesse the feedback loop (which the paper casts as the “root cause” of such incidents), and how to leverage this to be able to better model and test with greater fidelity. Definitely worth a read if you happen to be building or running large scale systems.
8. Cores that don’t count: Hochschild et al.
Another fascinating paper from HotOS was this one from Google on “mercurial cores”. We’re increasingly moving to a world where the end of Moore’s Law means companies are developing custom silicon (“specialized instruction-silicon pairings”) for various custom use cases (mainly machine learning and AI), which lead to ephemeral computational errors not detected during manufacturing tests. The paper considers approaches like better detection and isolation mechanisms to address this problem, and then goes further and explores software techniques for tolerating silent data corruption.
Facebook, incidentally, published a similar paper titled Silent Data Corruptions at Scale that talks about problems in cores and goes on to issue another call-to-action for better collaboration and research.
data corruptions propagate across the stack and manifest as application level problems. These types of errors can result in data loss and can require months of debug engineering time. Common defect types observed in CPUs, using a real-world example of silent data corruption within a data center application, leading to missing rows in the database. We map out a high-level debug flow to determine root cause and triage the missing data. We determine that reducing silent data corruption requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.
9. TraceSplitter: A New Paradigm for Downscaling Traces: Sajal et al.
This was a cracking Eurosys paper on many of the pitfalls of downscaling traces (which is often omitted in most of the “discourse” you’d read on social media). The paper then proposes a novel approach: downscale traces very similar to how you’d load balance traffic. The general idea being that load balancing strategy is going be tailored to traffic patterns and what you wish to optimize (average and tail latency, for example). It then follows that downscaling traces based on the same criteria helps retain fidelity with the actual traffic.
10. Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications
Sharding is a very commonly used technique, yet common sharding frameworks or libraries that enjoy widespread use are few and far between. The paper argues that existing sharding frameworks aren’t aware of the application lifecycle and treat all “unavailability” evens uniformly, even if regular deployments are far more frequent than power loss. Second, most sharding frameworks don’t support geo-distributed applications. Third, placement and load balancing heuristics used by existing frameworks are rigid and inflexible. Fourth, existing sharding frameworks prescribe a one-size-fits-all solution, while smaller applications require simplicity and more complex applications require customizability. The paper then describes their solution to these problems — a framework that works in lockstep with the underlying cluster manager to ensure zero downtime shard migrations, and uses a constraint solver for shard placement and load balancing. Along the way, there are also several great insights on shard configuration most preferred by users, load balancing and more.