LSM trees: write fast now, pay later

You are ingesting an event stream into a B-tree-backed table and the throughput graph is flat at 5,000 rows per second. The SSD underneath is rated for ten times that on sequential writes and its utilization sits at twenty percent. Nothing is saturated, yet the writer will not go faster. The bottleneck is not bandwidth — it is that every insert lands on a random page, and random writes are the one access pattern that turns a fast disk into a slow one.

This is the problem log-structured merge trees were built to solve, and the way they solve it is a trade you should understand before you take it. Writes get cheap because they stop touching random locations. Reads get more expensive, and they stay that way until a background job catches up. The whole design is a bet that you can defer the read cost and pay it off the critical path, and the failure modes are all about what happens when the background falls behind.

The naive approach

Update in place. A B-tree keeps keys sorted across fixed-size pages, and a write finds the leaf page the key belongs on, modifies it in the buffer pool, and marks it dirty so it gets written back to its home location on disk. Reads are excellent — a point lookup is a logarithmic walk down the tree, one page per level, and a real B-tree is shallow, so a lookup is three or four page reads. For a read-heavy workload this is hard to beat.

The write path is where it strains. The leaf a key belongs on is determined by the key’s value, and an event stream’s keys arrive in no particular page order. So consecutive inserts dirty pages scattered across the file, and each dirtied page is eventually a write to wherever that page sits on disk.

Why it breaks

A random write costs far more than the bytes it moves. The storage device wants large sequential transfers; a 4 KB write to a random offset forces a read-modify-write of the surrounding erase block on an SSD, or a seek on spinning media. You change one row and the device rewrites a whole page or block — that is write amplification, and on a B-tree it rides on top of every insert.

It gets worse under concurrency. Page splits, when a leaf fills and has to divide, cascade upward and serialize against other writers fighting for the same pages. Throughput floors not because the disk is full but because the access pattern is the worst case the disk has. The SSD is bored because you are feeding it the one thing it is bad at.

An append-only structure buys you out of this. If every write goes to the end of a file instead of to a key-determined page, writes become sequential, batchable, and free of in-place page rewrites. The catch is obvious: append-only means the data is in arrival order, not key order, and a database that cannot find a key by value is useless. The LSM tree’s entire mechanism is the machinery for keeping reads possible on top of an append-only write path.

The real mechanism

An LSM tree keeps data in three places and lets writes touch only the cheapest one. The memtable is an in-memory sorted map; every write lands here first — a put inserts a key/value pair, a delete inserts a tombstone marking the key dead. A write is an insert into a sorted in-memory structure. No disk, no seek, no page split.

When the memtable fills, it is flushed: its already-sorted contents are written out in one sequential pass as an immutable SSTable at level 0. That is the trick — because the memtable was sorted in memory, the flush is a single sequential write, the access pattern the disk loves. L0 SSTables are each internally sorted, but their key ranges may overlap freely, because each is a dump of whatever the memtable held at flush time.

A background compaction job later merges the L0 files together with the L1 files, drops superseded versions and tombstones, and writes fresh L1 files whose key ranges do not overlap. A point read walks the structures newest-first: memtable, then each L0 file (every one, since ranges overlap and any could hold the key), then the single L1 file whose range covers the key. Read amplification is the number of structures a read had to touch.

FIG. 01 — LSM TREE

MEMTABLEL0 (overlapping)L1 (sorted, non-overlapping)

memtable 0/8·L0: 0 files (0 entries)·L1: 0 files (0 entries)

compaction merges ALL L0 with overlapping L1 here; real engines pick overlapping files only (event log reports genuine overlap counts)

readAmp last —

readAmp avg —

writeAmp 1.0

spaceAmp 1.0

tombstones 0

press Read to trace a path

●value hit◆tombstone hit✕miss▬memtable / L0▬L1▪table holds tombstones▣compaction pressure

The four experiments below, in order: fill and flush the memtable; trace a read; pile up L0 with auto-compact off and watch readAmp redline; then Compact and re-read. Each one isolates a different half of the write-fast/read-later trade.

Start with the write path. Press Write about ten times and watch the memtable bar at the top fill one entry at a time. Turn on auto-flush and keep going: at eight entries the memtable drains to a new L0 SSTable and clears, and an L0 row appears spanning the key range that file covers. That flush was the entire cost of making those eight writes durable on disk — one sequential append, not eight random page writes. This is the side of the bargain that is pure profit.

Now trace a read. Press Read and follow the read-path panel: it lists every structure the probe touched, in order, and marks where the search stopped. With one L0 file and an empty L1, a found key costs one or two probes. Count them — that count is the readAmp last number in the stats row, and right now it is small because there is almost nothing to search.

Here is where the debt accrues. Turn on auto-write and auto-flush but leave auto-compact off. Writes stream in, the memtable flushes repeatedly, and L0 files pile up. Once L0 holds four files the compaction-pressure badge lights — four is the L0 compaction threshold, chosen to match a real RocksDB default we will get to. Keep pressing Read. Because L0 ranges overlap, a read for a key not in the upper files has to probe every L0 file before it reaches L1, so the readAmp last counter climbs past six and turns red. The write path cost nothing; reads are now paying interest on every file the background never merged.

Now settle the debt. Press Compact once. Every L0 file merges with L1, duplicates and tombstones drop out, and L0 collapses to empty while L1 holds a handful of non-overlapping files. Press Read again and compare the trace length to what it was a moment ago: a read that touched six structures now touches one or two, because there is at most one L1 file per key range and nothing overlapping above it. That collapse — from a long probe list back to a short one — is the entire reason compaction exists.

Failure modes and costs

The three amplifications are the standing costs of the design, and the figure meters all three.

Read amplification is the one you just watched redline. It tracks L0 file count almost directly, because every overlapping L0 file is another structure a missed key must probe. It is bounded only by how far compaction is allowed to fall behind, which is why “pile up L0 with compaction off” is a real production incident, not a toy: p99 read latency degrades over hours while the write path looks perfectly healthy, because the writes are still cheap and only the reads are paying.

Write amplification is the cost of the cleanup itself. Watch the writeAmp stat jump the moment you press Compact: the same logical bytes are now written three times over — into the memtable, out to L0 on flush, and rewritten again into L1 during compaction. Compaction is not free; it trades the write savings back, a little at a time, to keep reads fast. The numbers seem upside down at first — you adopted an LSM tree to reduce write cost, and compaction adds write cost — but the writes it adds are sequential and off the critical path, which is the whole point.

Space amplification is the obsolete versions and dead tombstones sitting on disk waiting to be collected. Every overwrite leaves the old value live in an older file until compaction reconciles them; every delete leaves a tombstone. Watch spaceAmp drift above one as L0 accumulates duplicate keys, then fall back toward one after a Compact reclaims them.

Tombstones deserve their own look, because their lifecycle is a trap. A delete does not erase anything — it writes a marker that shadows the key for any read that reaches it. Try it: press Delete a few times, then Read until the trace lands on a deleted key. The read stops at a HIT tombstone (deleted) probe and reports “not found”; the marker did its job. But the tombstone itself is now data that has to be carried in every file and probed on every read until it reaches the bottommost level and can finally be dropped. Press Compact: the tombstones drop at L1 because L1 is the bottom here, the tombstone count falls, and subsequent reads miss cleanly without a marker to stop at. A workload that deletes heavily can accumulate enough tombstones to inflate both read and space amplification on its own; the marker that makes deletes cheap is itself a cost deferred to compaction.

That is the “pay later” framing made literal. The compaction queue is debt, and like any debt it is fine until you stop servicing it. Let it grow — auto-compact off, writes streaming in — and read amplification, write amplification, and space amplification all climb together, because they are three views of the same unpaid balance.

How RocksDB actually does this

The sim is the shape of the mechanism; RocksDB is the mechanism shipped to production, descended from Google’s LevelDB and sharing the lineage that also produced Cassandra and ScyllaDB. The defaults below are checkable against the RocksDB source and wiki linked at the end.

The memtable is sized by write_buffer_size, default 64 MB — far larger than the sim’s eight entries, which exists only so a flush is visible in a handful of clicks. A column family keeps up to max_write_buffer_number memtables, default 2, so writes continue into a fresh memtable while a full one flushes in the background.

L0 compaction fires at level0_file_num_compaction_trigger, default 4 — the exact threshold the sim’s compaction-pressure badge uses, because four overlapping L0 files is where read amplification starts to bite in the real system too. The levels below grow geometrically: max_bytes_for_level_base sets L1’s target total size at 256 MB by default, and max_bytes_for_level_multiplier, default 10, makes each level roughly ten times the one above it, so a real tree stacks L0 through L6 with compaction cascading down — not the single L0-to-L1 step the sim models.

The single biggest reason the sim’s read amplification is pessimistic is that it has no bloom filters. RocksDB attaches a bloom filter to each SSTable — by default 10 bits per key, for roughly a 1% false-positive rate — and a point lookup checks the filter before touching the file. If the filter says the key is definitely absent, the file is skipped without a read. So in production a point read probes only the handful of files that might actually hold the key, and read amplification sits far below the file count, even with many overlapping L0 files. The sim deliberately omits blooms so that file pile-up is visible as rising read amplification; that omission is labeled in the SIMPLIFICATIONS list.

The figure below is the bloom filter mechanism in miniature. Add a few keys to set bits in the 64-bit array, then probe members and non-members to see the three-position check in action. Query non-members repeatedly until you land a false positive — all three probed bits happen to be set by earlier keys, so the filter cannot rule the key out and the SSTable would be read for nothing. Members never miss; the counter stays at zero no matter how many you query.

FIG. 02 — BLOOM FILTER

BIT ARRAY — 64 BITS — k=3 HASH FUNCTIONS

add keys, then query members and non-members

keys added 0

queries 0

tables skipped 0

false positives 0

set bitclear bitprobed clear (absent)probed set

Coach: query non-members until you land a false positive — and notice members never miss. At ~12 keys the false-positive rate sits around 12%; reset and the bit array clears back to zero.

One architectural fork worth a sentence: the default here is leveled compaction, which keeps each level a single non-overlapping sorted run and pays more write amplification to hold read and space amplification down. RocksDB also offers universal (size-tiered) compaction, which merges similarly-sized runs and cuts write amplification at the cost of higher read and space amplification — the same three knobs, turned the other way. Cassandra and ScyllaDB default to tiered strategies for the same write-amplification reason.

A few more deliberate simplifications, all in the SIMPLIFICATIONS list and labeled where they matter: there is no write-ahead log here, so an unflushed memtable would not survive a crash — a real engine logs every memtable mutation first, which is the durability half of the story told in explainer #1. The tree is only L0 and L1, not L0 through L6. The thresholds are tiny so every key and file is visible. And compaction here merges all L0 files with the whole of L1 and re-partitions by even key-range splitting, rather than RocksDB’s pick of only the overlapping subset by target file size — the event log still reports how many L1 files genuinely overlapped.

When to reach for which

Reach for an LSM tree when the workload is write-heavy and the storage rewards sequential I/O: ingest pipelines, time-series, event logs, write-amplification-sensitive flash. You are trading read latency and background CPU for a write path that does not thrash the disk, and bloom filters claw most of the read latency back. Reach for a B-tree when reads dominate and writes are modest, especially latency-critical point lookups where a predictable three-page descent beats a probe across memtable, L0, and L1 plus a bloom check — an LSM read is cheap on average but has a longer tail when compaction is behind.

Whichever you run, the two numbers to put on a dashboard are pending compaction bytes and L0 file count. They are the live readout of the debt: when they climb and stay up, the background has fallen behind the foreground, and your read latencies are about to follow. The question this leaves open — how to choose a compaction strategy that balances the write, read, and space costs for a given workload, and what universal compaction, partitioned indexes, and blob storage change about that balance — is its own piece; the parking-lot notes sketch it.

Sources

RocksDB wiki, Leveled Compaction — L0 files overlap while L1+ are non-overlapping sorted runs, and an L0→L1 compaction normally picks up all L0 files because they overlap.
RocksDB source, include/rocksdb/options.h and include/rocksdb/advanced_options.h — write_buffer_size 64 MB, level0_file_num_compaction_trigger 4, max_bytes_for_level_base 256 MB, max_bytes_for_level_multiplier 10, max_write_buffer_number 2.
RocksDB wiki, RocksDB Bloom Filter — ~10 bits per key for roughly a 1% false-positive rate, letting a point lookup skip SSTables that provably lack the key.
RocksDB wiki, Universal Compaction — tiered compaction lowers write amplification at the cost of higher read and space amplification versus leveled.
Martin Kleppmann, Designing Data-Intensive Applications, ch. 3 — SSTables, LSM trees, and the LSM-versus-B-tree comparison.
Alex Petrov, Database Internals, Part I — LSM storage, memtable/SSTable flush, compaction, and read/write/space amplification.
LevelDB documentation — the memtable/log/SSTable structure and level compaction the RocksDB lineage descends from.