Late Materialization

Column-Stores vs. Row-Stores: How Different Are They Really?

What are the paper’s main contributions? What is “late materialization”?

The paper explores three ideas.

  1. Current row-store databases cannot effectively emulate column-store techniques. Whether this is a fundamental difference or just a matter of implementation is not so clear. In my personal opinion, trying to “coerce” an already optimized system to fit a column-store’s execution plan cannot be a fair comparison. The system simply was not designed with column-store methods in mind. The authors do acknowledge this to an extent.
  2. A new technique called invisible joins helps improve execution of joins substantially, such that denormalization techniques have little benefit. The idea is rewriting joins into predicates, to help enable late materialization, minimizing out-of-order reads.
  3. Column-store designs enable some extremely effective optimization techniques. These include late materialization, compression, block iteration, and invisible joins (along with between-predicate-rewriting).

The efficacy of late materialization is one of the most important takeaways from this paper. Processing columns is fundamentally more efficient in many ways. For example, the potential for compression is much higher, and block-wise operations can be optimized much better on modern CPUs. Late materialization involves delaying tuple reconstruction as much as possible, in order to maximize the benefits of operating on columns.

Monkey

Monkey: Optimal Navigable Key-Value Store

When does a tiered LSM-tree behave similarly to a leveled LSM-tree? What is the key contribution of the paper?

Navigability is the key to the LSM-tree. By adjusting the tiering policy and level ratio, an LSM-tree can be tuned from a read-optimized array to a write-optimized log. Specifically, a leveled tree provides more efficient reads, at the cost of higher update latency, while a tiered tree provides faster writes, at the cost of higher read latency. When the level ratio , a tiered policy behaves the same by definition (it merges runs when there are ). As the ratio increases, they become more drastically different.

The key contribution of Monkey is a cost model and algorithm for navigating the parameter space of LSM-trees, given knowledge of the workload and storage medium. The most novel part of this is a way of allocating memory to filters in proportion to the number of items in each run. Since filters on larger runs require more memory to achieve equal performance, it’s better to allocate a greater proportion of memory to the higher level filters. According to the storage medium’s latency, they recommend allocating more or less memory to the filters. The other key axis is the read/write tradeoff, which can be tuned with the level policy and ratio. They propose a quick divide and conquer algorithm for setting these parameters according to the expected workload. This is achieved with a cost model they provide for each each query.

The authors run many tests to show that these simple improvements allow for much more effective tuning of LSM-trees. In particular, the proposed filter scheme seems to be an impressive idea with no tradeoffs.

LSM Memory Buffer

Anatomy of the LSM Memory Buffer

How does the memory allocation strategy affect the query latency for a pre-allocated vector vs. a dynamically allocated one?

Dynamically allocating vectors might better optimize memory usage, however it requires reallocating memory and copying data to keep the vector contiguous. This leads to higher variability in write queries. Pre-allocating vectors naturally diminishes variability.

Given a workload, how would do decide the prefix length for a hash-hybrid buffer implementation?

A larger prefix length can help improve point queries, since it means the hash-table will narrow the search space more. However, this can use much more memory, since the buckets will be under utilized if there are too many of them. The paper does not give any exact heuristic, but demonstrates that it is very dependent on workload and environment.

Memory Pressure

LSM-Trees Under (Memory) Pressure

SHaMBa achieves better lookup performance when large Bloom filters do not fit in memory. Can we do something similar if we have large index blocks? Give a concrete example to explain why it could work or not.

My takeaway is that SHaMBa implements two core concepts.

  1. Modular filters, which are practically more space efficient (even though they take the same amount of space) because they can be cached more effectively
  2. A utility measure for filter modules, to help with a more optimized replacement policy

I see two analogous techniques that could be implemented on index blocks.

  1. Tiered fence pointers, which include min and max information for an entire page of fence pointers. This would require some thought about how to physically organize the index blocks such that fence pointers on adjacent pages of a sorted run are stored together. Under more extreme memory pressure, keeping these higher level fence pointers in memory as high priority pages could save fetches to the stored index blocks.
  2. On a more skewed access pattern, a utility measure could also be implemented on index block pages. If a page needs to be swapped out of memory, we’d rather switch out an index block page on cold data.

Lethe

Lethe: A Tunable Delete-Aware LSM Engine

To ensure bounded delete persistent latency, Lethe aggressively performs compactions. How does these eager compactions affect write amplification of the overall system?

These eager compactions increase write amplification of the overall system, especially at the start of their experiments. However over time, deleting items sooner results in smaller compactions, so this cost gets amortized. In addition, the aggressive compactions decreases space and read amplification.

ACE

ACEing the Bufferpool Management Paradigm for Modern Storage Devices

Consider you have three SSDs with the following asymmetry values: (i) , (ii) , (iii) . Your workload contains 50% writes with a predictive pattern. Which device will have the highest speedup after implementing ACE bufferpool and why? Also, will you use any prefetcher? Justify your answer.

The device with highest asymmetry will have the highest speedup, since it will get the biggest benefit from ACE’s write-back concurrency. If the workload has a predictive pattern, then prefetching should be effective. Even if there are no sequential access patterns, the history-based prefetcher should be able to capture workload patterns and exploit read concurrency.

FASTER

FASTER: A Concurrent Key-Value Store with In-Place Updates

How does FASTER manage data between the “hot” in-memory section and the “cold” disk-resident section? What criteria determine when data is moved between these sections?

FASTER’s HybridLog uses a mutable in-memory section, a read-only memory section, and a stable disk section. The scheme uses a circular buffer to essentially mimic a FIFO protocol. When data is fetched from disk, it starts off in the mutable section. Data in the mutable section is updated in-place, which is very efficient. As other records are brought into memory, the previous records in the mutable section move to the read-only section. Data in the read-only section is updated with a RCU (Read-Copy-Update) strategy; essentially a new updated copy is made in the mutable section. It’s important to understand that FASTER uses log-structuring, so it can handle multiple versions of a key. The read-only section effectively acts as a second-chance for records to stay in memory (while also allowing for correctness guarantees). Otherwise, a record will be flushed to disk.

In practice, FASTER also uses the concept of a fuzzy in-between region in order to deal with concurrency in a safe manner.

The Data Calculator

The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models

What factors should the Data Calculator consider when synthesizing an optimal data structure for this workload? How would the framework decide between a B-tree and a log-structured merge (LSM) tree?

For any given workload, the Data Calculator considers the proportions of operations in the workload, the data distribution, and the learned hardware coefficients to search a space of data structures. In absence of any specialized search algorithm (I don’t think the paper described anything like Cosine), a user would need to supply a smaller domain of data structures for Data Calculator to compare.

The framework would likely decide between a B-tree and a log-structured merge tree by looking at the proportion of reads and writes in the workload. A B-tree is optimized for point queries, while an LSM-tree is optimized for ingestion.