One size does not fit all; it’s important to have tailored data systems. NoSQL is a new paradigm that has taken a large market share.

In relational systems, tables have rows and columns with a well-defined schema. The data model fits the data, rather than the functionality.

In NoSQL systems, unstructured documents or files are stored without a schema. The data is stored in an application-friendly way.

The Log-Structured Merge-Tree (LSM-Tree) was invented in 1996, but wasn’t used for almost 10 years. Now, it’s one of the most popular data structures. It offers fast ingestion and competitive reads, at some cost of memory. This is great for today’s workflows, which need high ingestion. Most importantly, the LSM-tree is tunable.

An LSM-tree stores key-value pairs in levels, where each level is larger than the previous. These levels store files of pages. Before level 1 is a buffer.

There are three main design principles,

  1. Buffering ingestion
  2. Immutable files on storage
  3. Periodic compaction

Performance Analysis

The buffer holds entries, which is the number of pages times the number of entries per page . Each subsequent page holds entries, where is the level ratio.

If there are entries in an LSM-tree, there are (this is the result of a geometric series). Since behaves similar to a constant compared to , we can simplify to

In big-O notation, we get

Let’s look at ingesting. entries are flushed in one I/O. We also have to consider leveling operations; in the lifetime of an entry, it will be written every time a merge occurs, times to a level. This happens on levels, so the total would be . In a tiered LSM, the data is never rewritten within a level.

When we query point lookups, fence pointers can limit the number of I/Os per sorted run to 1. For leveled trees, this costs . For tiered trees, this costs . We can use filters to improve this cost. A filter probabilistically reports on membership; it has no false negatives and has false positives with tunable probability .

Some filters only support point queries, like bloom, cuckoo, XOR, and ribbon filters. In this class, we’ll focus on bloom filters. A bloom filter is a bitmap of fixed size. It hashes with hash functions and sets bits accordingly. depends on the memory available for the filter (bits per entry ) and the number of hash functions. It turns out the optimal number of hash functions is . The optimal . With , !

Monkey introduces a way to optimize for each filter according to the amount of memory allocated to the filters. This is better than allocating an even amount of bits to each filter (like most commercial systems do), since bits get you farther on smaller levels. In practice, this is an order of faster.

The cost of a non-empty point lookup would be and the cost of an empty point lookup is

Range lookups are a bit more complicated… In a leveled tree, this comes out to where is the selectivity of the query. A tiered tree works about similar, although there would be less ranged queries.

For the other data structures, we also assume there’s a buffer to make the comparison more fair.

Ingestion costPoint lookup cost*Long range lookup costShort range lookup cost
Leveled LSM-tree**
Tiered LSM-tree**
B-tree11***1
Sorted array11
Log

* With fence pointers and bloom filters, remove to get the cost of empty lookups
** Monkey shaves off another factor of
*** 1.5x costlier than LSM-trees

In best to worst,

  • Ingestion: log, tiered LSM-tree, leveled LSM-tree, B-tree, sorted array
  • Point lookups: B-tree / sorted array, leveled LSM-tree, tiered LSM-tree (LSMs are close to ), log
  • Long range lookups: leveled LSM-tree / tiered LSM-tree / sorted array, B-tree, log
  • Short range: B-tree / sorted array, leveled LSM-tree, tiered LSM-tree, log

Note that the LSM-trees are never the worst in anything. A B-tree is quite good but suffers relatively in ingestion cost. Because of the massive amount of data being consumed, LSM-trees have become more popular.

Memory Buffer

The memory buffer is a key part of LSM-trees. There are a couple common implementations: Vector, skip-list, hash-hybrids (hash skip-list), and trie.

Skip-list is often the default. It has pretty good stats, but slower insert speeds (). Vector has constant time inserts and space complexity, but expensive point queries. The optimal buffer must depend on workload composition.

Some implementations have further tuning necessary. Hash hybrids require tuning the prefix length and the bucket count . The prefix length decides which characters of a key to hash on. We can either implement buckets has a linked-list, a skip-list, or just a vector. Hash hybrids are great for mixed workloads but also require more space to store structures.

Tries are space optimized and good for point and range queries, especially since random accesses are reduced compared to similar tree structures.

Compactions

The design questions that a compaction algorithm must answer are,

  1. Data layout: How to organize the data on device?
  2. Granularity: How much data to move at a time?
  3. Movement Policy: Which block of data to be moved?
  4. Compaction Trigger: When to re-organize the data layout?

The classical layout design is leveling where we merge immediately such that there’s only ever one run per level.
On the other hand, tiering only merges on some condition.

One of the downsides of tiering is more space amplification, which we’ll define as the ratio of the logically invalid data size to the total data size. On leveling, the worst case is where the entire last level is logically invalid, which is approximately . On tiering, this seems to be .

As increases, a leveled design behaves more like a sorted array, while a tiered design behaves like a log.

Hybrid layouts use only a certain amount of tiered levels (-leveling means all but 1 are tiered). This can lower the number of write stalls and increase block cache hits (because merging data requires invalidating its cache). These hybrid layouts are a great way to navigate the space between leveling and tiering.

In full-level compaction (which the classical leveling design uses), we merge the entire level at a time. This is nice and simple but can lead to cascading compactions, which can create massive write stalls. We want to avoid worst-case scenarios like this.

In partial compaction, we logically divide up runs into files. When a compaction is triggered (which is its own design space), we select some amount of files to merge with the overlapping files in the next level. This raises the number of compactions, since flushing the buffer will basically always trigger a compaction, but also makes things way more predictable.

Partial compaction opens a whole new space to explore. How do we decide which files to compact? Some policies are round-robin, minimum overlap with parent, file with most tombstones, and coldest file. Round-robin performs quite well, even though it sounds like the most naive solution.

Some compaction triggers are level saturation, number of sorted runs, space amplification, and age of a file.

There are other variables that can be changed. We can also have a variable number of runs per level, instead of setting them all to . We can even make the size ratio variable!

Some systems use background compactions to offset write stalls. Another strategy is assigning priorities to compactions, so that they only happen when the write pressure is low enough. Instead of compacting right away, they add additional tiers to the first level (which puts the tree out of shape).

Deletes

Deletes create a hidden cost in LSM-trees. They lead to space amplification, since the tombstones lie around for a while. This also adds write amplification, since these tombstones need to be read and rewritten in each compaction until they get merged.

The delete persistence latency is unbounded, depending on the file picking policy, tree shape, and ingestion rate.

FADE triggers compactions to keep the delete persistence latency within thresholds. Once the age of a tombstone

Now what if we want to support queries like “delete all data older than…” This isn’t possible in the standard setup if the sort key isn’t the timestamp. Many companies have requirements for these kinds of deletes every day, which would require scanning the entire database…

KiWi (key weaving storage layout) is meant to help with this problem. A new hierarchy is added, where multiple pages make up a delete tile. Within a file, delete tiles are partitioned on the sort key. Within a delete tile, pages are partitioned on the delete key. Within a page, entries are sorted on the sort key as usual.