From Interns, With Love

The world’s most powerful computer (1954) at Columbia University’s Watson Lab.

Spoke with our interns this year to understand what they were working on. Some version of this will post eventually find its way to the company’s engineering blog. To keep up with new writing, sign up for my newsletter.

While not exactly envious of our current crop of interns (cause, you know, the whole work from home thing), I’ll admit I find myself reminiscing back to when I was one myself. I’m still surprised they let me anywhere near the stuff they did. When I first interned four years ago, we had just declared a code yellow to focus our energy towards stabilizing CRDB. Having joined the newly-formed distributed query execution¹ team, but now with its attention directed elsewhere, what that meant for me was free rein to flesh out a few nifty things: distributed hash and merge joins², aggregation primitives (think SUM, COUNT, DISTINCT, etc.), and various sorting algorithms.

That was more than enough to rope me back in for a second internship. This time I brought my dear friend Bilal along, who similarly went on to intern twice. I even managed to sneak my brother in (a strictly worse engineer), also as a two-time intern.

All of which is to say that I think internships here can be pretty great. CRDB is a mostly-cool system to be working on, and we’re still at the point where we’re happy to let junior engineers take on work that, I think, would otherwise only be accessible to someone further along career-wise. This was true for me back when, and I’d say the same applied for our most recent cohort.

We hosted several interns over the year across various engineering teams, all working on projects deserving of full-length blog posts. Today however we’ll highlight two projects from our most recent batch and give a briefer treatment for the remaining.

1. Read-based compaction heuristics

Aaditya Sondhi interned on our Storage team to work on Pebble, a storage engine based on log-structured merge trees³⁴ (abbrev. LSMs). Aaditya worked on introducing read-based compactions to Pebble, but before diving into what that means, we’ll first need to understand what read-amplification and compactions are.

1.1 Compactions and read-amplification in LSMs

In LSMs, keys and values are stored as sorted strings in immutable blobs called SSTs (sorted string tables). SSTs are stacked across multiple levels (L1, L2, …), don’t overlap within a level, and when searching for a key that overlaps with multiple SSTs (necessarily across multiple levels), the one found at the higher level is considered authoritative. This brings us to read-amplification: the amount of physical work done (bytes read, number of disk seeks, blocks decompressed, etc.) per logical operation. When reading a key k from a two-level LSM, we may have to trawl through both if it isn’t found in the first.

That in turn brings us to compactions⁵. As data flows into higher level SSTs, LSMs maintain a healthy structure by compacting them into (fewer but larger) lower level SSTs. At one level (sorry) this lets LSMs reclaim storage (range deletion tombstones and newer revisions mask out older values), but also helps bound the read IOPS required to sustain a fixed workload. Like all things, this is counter-balanced⁶⁷⁸⁹ with the need to maintain sane write/space-amplification, which the rate of compactions directly play into.

An SST compaction; the L1 SST overlaps with two L2 SSTs and is compacted into it.

Figure 1. An SST compaction; the L1 SST overlaps with two L2 SSTs and is compacted into it.

(Aside: there’s something to be said about how storage engines are characterized in terms of resource utilization¹⁰ as opposed to unqualified throughput or latency. System-wide measures like $/tpmC¹¹ are another example of this. These feel comparatively easier to reason about, more useful for capacity planning, and easily verifiable.)

1.2 Optimizing compactions for read-amplification

Compacting LSMs based on reads isn’t a novel idea. It was originally implemented in google/leveldb, and later dropped in facebook/rocksdb. As for the Go re-implementation of it (golang/leveldb, incidentally where we forked Pebble from), it hasn’t ported over the heuristic yet. Part of the motivation for using a purpose-built storage engine was to let us pull on threads exactly like this.

We hypothesized that by scheduling compactions for oft-read key ranges, we could lower read amplification for subsequent reads, thus lowering resource utilization and improving read performance. In implementing it, we borrowed from the ideas present in google/leveldb. For every positioning operation that returned a user key (think Next, Prev, Seek, etc.), we sampled the key range (mediated by tunable knobs). The sampling process checked for overlapping SSTs across the various levels in the LSM. If an oft-read SST was found to overlap with ones from lower levels, it was scored higher to prioritize its compaction.

Benchmarks showing the effect of read-based compactions on throughput, read-amplification and write-amplification.

$ benchstat baseline-1024.txt read-compac-1024.txt
                    old ops/sec  new ops/sec  delta
ycsb/C/values=1024    605k ± 8%   1415k ± 5%  +133.93%  (p=0.008 n=5+5)

                    old r-amp    new r-amp    delta
ycsb/C/values=1024    4.28 ± 1%    1.24 ± 0%   -71.00%  (p=0.016 n=5+4)

                    old w-amp    new w-amp    delta
ycsb/C/values=1024    0.00         0.00           ~     (all equal)


$ benchstat baseline-64.txt read-compac-64.txt
                  old ops/sec  new ops/sec  delta
ycsb/B/values=64    981k ±11%   1178k ± 2%   +20.14%  (p=0.016 n=5+4)

                  old r-amp    new r-amp    delta
ycsb/B/values=64    4.18 ± 0%    3.53 ± 1%   -15.61%  (p=0.008 n=5+5)

                  old w-amp    new w-amp    delta
ycsb/B/values=64    4.29 ± 1%   14.86 ± 3%  +246.80%  (p=0.008 n=5+5)

Figure 2. Benchmarks showing the effect of read-based compactions on throughput, read-amplification and write-amplification.

As expected, we found that read-based compactions led to significant improvement in read heavy workloads. Our benchmarks running YCSB-C 100% reads) using 1KB writes saw read amplification reduced by ~71% and throughput increased by ~133%. With YCSB-B (95% reads) using small value reads/writes (64 bytes), we reduced read-amplification by ~15% which led to a throughput increase of ~20%. These benchmarks targeted Pebble directly, and there’s still a bit of legwork to be done around parameter tuning (we’re necessarily trading off some write-amplification in this process), but the results are encouraging.

2. Query denylists (and our RFC process)

Angela Wen interned on our SQL Experience team, which owns the frontier where SQL clients meet the database. During her internship Angela worked on introducing a mechanism to gate certain classes of queries from being run against the database. This was motivated by our cloud SREs running large CRDB installations, and wanting the ability to deny queries(-of-death¹²) when emergent situations call for it (think circuit breakers¹³).

Angela’s experience captures the kind of broad leeway accorded to interns that I’m arguing we do a bit better than elsewhere. A general purpose query denylist is a very open-ended problem, with many personas you could design it for, and one took deliberate effort to build consensus on. The process we use to structure these conversations are RFCs, and we ended up authoring one here as well.

The RFC and the ensuing discussions clarified who the intended users were, the must haves/nice-to-haves, catalogued the various classes of deniable queries, and most importantly, outlined the actual mechanics of the denial itself. For all my gripes with RFCs, I find the process of actually writing one edifying. It can foster real agency over a component’s design and works decently well as a pedagogical tool (also I imagine it’s cool to have public design documents to share with friends similarly into query denylists).

We ended up eschewing our original proposal to implement file-mounted regex-based denylists (the contentions here being around usability, deployment, etc.) in favor of cluster settings of the form:

SET CLUSTER SETTING feature.changefeed.enabled = FALSE;
SET CLUSTER SETTING feature.schema_change.enabled = TRUE;

Configuration changes were made to disseminate cluster-wide by means of gossip¹⁴. Individual nodes listen in on these updates use the deltas to keep an in-memory block-cache (sorry) up-to-date. This is later checked against during query execution to determine whether or it’s an allowable operation.

Like mentioned earlier, we scrapped lots of alternate designs during this process, and were better off for it. We re-sized our scope to focus instead on certain classes of queries as opposed to more granularly matching specific ones. This came after observing that a vast majority of problematic queries during prior incidents were well understood, and could be structurally grouped/gated wholesale. That said, we modularized our work to make it simple to introduce new categories as needed.

3. Observability, design tokens, data-loss repair, and more

We hosted a few other interns this semester, and there’s much to be said about their individual contributions. We typically structure our programs to have folks work on one or two major projects, building up to them with starter ones. Here we’ll briefly touch what these were.

3.1 Query runtime statistics

The query execution plan for a full table scan followed by an AVG.

Figure 3. The query execution plan for a full table scan followed by an AVG.

Cathy Wang interned on our SQL Execution team and worked on improving observability for running queries. We have some existing infrastructure in place to surface various execution statistics. Cathy built upon this to include details about network latencies (useful for debugging queries run within geo-distributed clusters), structured our traces to break down how much time is spent across various layers in the system, and tacked on memory utilization to our traces to surface exactly how much memory is in-use during any point mid-execution. This last bit is worth elaborating on: Go’s garbage collector doesn’t give us fine-grained control over allocations, and to that end a result we’ve had to design our own memory accounting/monitoring infrastructure to closely track usage during a query’s lifecycle. By exposing these internal statistics, we expect developers to better understand the memory footprint of individual queries and to tune them accordingly.

3.2 Design tokens

Pooja Maniar interned within the Cloud organization, specifically on the Console team. One of the projects she worked on was consolidating and standardizing our design tokens. Think of these as abstractions over visual properties, variables to replace hardcoded color palettes, fonts, box shadows on pressed buttons, etc. The motivation here was to limit the number of design decisions developers had to make, whether it be choosing between specific hexcodes, UI components, etc. We wanted to create and hoist guidelines into a centralized, shared repo and then integrate it into our several console pages (accessible both through the database itself and through the cloud offering). We were also partway through a brand-refresh at the time, and Pooja’s grand unification helped ensure brand consistency throughout.

3.3 Quorum recovery

Sam Huang interned on the KV team (they let me mentor this fellow), and one of the projects we worked on was to introduce a quorum recovery mechanism within CRDB. Because CRDB is built atop raft-replicated key-ranges, when a cluster permanently loses quorum for a given set of keys (think persistent node/disk failures), it’s unable to recover from it. This necessarily entails data-loss, but we still want the ability to paper over such keys and provide tooling for manual repair. Sam worked on introducing an out-of-band mechanism to reset the quorum for a given key-range, and somewhat cleanly, we were able to leverage existing Raft machinery to do so. This came from the observation that if we were to construct a synthetic snapshot (seeded using data from extant replicas, if any), and configured it to specify a new set of participants, we would essentially trick the underlying replication sub-system into recovering quorum for this key-range. Our synthetic snapshot incremented the relevant counters to come after the existing data, which also in-turn purged older replicas from the system.

3.4 Metamorphic schema changes

Jayant Shrivastava interned on our SQL Schemas team, and spent his time here ruggedizing our schemas infrastructure. CRDB makes use of several advanced testing strategies to ensure correctness and stability, including use of fuzzers, metamorphic and chaos testing, Jepsen¹⁵, and much more. Having observed some latent fragility in this area recently, Jayant fleshed out an equivalent test harness but focusing instead on schema changes. We constructed a workload generator to execute randomly generated DDL statements, executing within the confines of individual transactions. These statements generate and drop tables on the fly, do the same for columns with randomized types, and are executed concurrently with statements issued against those very tables/columns. We leveraged metamorphic methods here by asserting against the invariants of the system rather than specific outputs (things like transactions that have read from a certain column should expect to always find it in subsequent reads). Put together we were able to cover a large space of possible interleavings and uncovered several critical bugs in the process.

3.5 Import compatibility

Monica Xu took a brief hiatus from her aspiring music career to intern on our Bulk IO team. Her team’s broadly responsible for getting data in and out of CRDB as fast as possible (specifically import/export and backup/restore). Monica made several contributions in this area, including enabling progress tracking for dump files, supporting dry run imports, and improving pg_dump¹⁶ compatibility. There were kinks to be work out with the latter seeing as how CRDB only supports a subset of Postgres syntax, which can be problematic when processing pg_dump files as is. The particular set of questions Monica helped address was what reasonable behavior is when chewing through potentially destructive import directives. Think DROP TABLE [IF EXISTS], or CREATE VIEW, which is particularly tricky given it stores the results of the query it was constructed using, results subject to change during the import process. Monica engaged with our product teams when forming these judgements (we now simply defer to the user with instructive messaging), and helped significantly ease the onboarding experience for developers migrating off of their existing installations.

4. Parting thoughts

If you’re still here and interested, hit us up. And don’t let the database-speak throw you off, most of us didn’t know any of it coming in.

Radu Berinde, Andrei Matei. 2016. Distributing SQL Queries in CockroachDB. ⤻
Raphael Poss. 2017. On the Way to Better SQL Joins in CockroachDB ⤻
Arjun Narayan, 2018. A Brief History of Log Structured Merge Trees. ⤻
Arjun Narayan, Peter Mattis. 2019. Why we built CockroachDB on top of RocksDB. ⤻
Siying Dong, [n.d.]. Leveled Compactions in RocksDB. ⤻
Mark Callaghan, 2018. Read, Write & Space Amplification – Pick Two. ⤻
Mark Callaghan, 2018. Describing Tiered and Leveled Compactions. ⤻
Mark Callaghan, 2018. Name that Compaction Algorithm. ⤻
Mark Callaghan, 2018. Tiered or Leveled Compactions, Why Not Both?. ⤻
Nelson Elhage, 2020. Performance as Hardware Utilization. ⤻
TPC-C, [n.d.]. What is TPC-C. ⤻
Mike Ulrich, 2017. Site Reliability Engineering, Addressing Cascading Failures. ⤻
Martin Fowler, 2014. Circuit Breakers. ⤻
Abhinandan Das, Indranil Gupta, et. al. 2002. SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol. ⤻
Kyle Kingsbury, 2016. Jepsen Testing CockroachDB. ⤻
PostgreSQL 9.6.20 Documentation, [n.d.]. pg_dump. ⤻

To get posted on new writing, sign up for my (entirely inactive) newsletter.