NoSQL

NoSQL means Not only SQL

Unstructured documents or files
Schema-less
Data is stored in an application-friendly way
Possible duplication

There are a few classifications of NoSQL systems,

Key-value stores
Document stores
Columns stores
Graph stores

Key-value pairs rely heavily on the Log-Structured Merge-Tree (LSM-tree)
The LSM-tree aims to improve inserting over the $B^{+}$ tree, which requires $3 + 1$ I/Os to insert a single entry

LSM-tree uses very fast insertion and doesn’t typically delete entries; instead newer insertions are always read before older insertions, so updated or blank versions of data are inserted to update or delete an entry
Reads are also competitive, since the LSM-tree keeps track of lots of data in order to increase speed
Additionally, LSM-trees are tunable and can tradeoff read and write performance depending on application needs
Finally, LSM-trees offer great space utilization compared to $B^{+}$ trees which are 66% full on average

The size ratio $T$ defines how much each lower level increases

Principles,

Buffering ingestion (inserts are buffered and done in batches)
Immutable files on storage (data is never changed once it’s moved to disk)
Out-of-place updates & deletes
Periodic data layout reorganization

Once the buffer is filled, it’s sorted and flushed into the next level (handling updates/deletes). Reads on entries still in buffer can be fetched without I/Os to disk

When leveling (eager merging), we merge right away when able,

Good read performance
Good space amplification
But high write amplification (37GB to 1GB potentially?)

When tiering (lazy merging), we don’t merge right away until the level has reached it’s capacity,

Poor read performance
Poor space amplification
Good ingestion performance

There are many hybrid designs that mix tiering and leveling

Without a filter or index, searching for something in the LSM tree is quite slow! Binary search on each level requires random accesses

Fence pointers are stored in memory and locate the minimum and maximum values in each page. This means there will only be potentially $L$ I/Os

Bloom filters tells you if an entry is in a page with high certainty (with some false positives, around 1%)

Binyamin's Notes

Explorer