Indexing for Near-Sorted Data Review

What is the motivation?

The authors are motivated by the idea of sortedness as a resource. The idea is that since creating an index means structuring data, already partially structured data ought to take less effort to index. This is not just a theoretical idea about constructing ideal data structures. Many real-life applications involve partially data sorted to various degrees, like a previously sorted relation with few updates, data sorted with a correlated attribute, and timestamps in an incoming data stream with delayed packets.

How and why have past approaches fallen short?

Past approaches simply do not take advantage of partially sorted data. They expend the same amount of effort to ingest data no matter how it’s structured. Sortedness-aware structures have not been a subject of past research, so data structures like B $^{+}$ -tree and B $^{ϵ}$ -tree do not take advantage of partial sortedness.

What is the key contribution?

The SWARE paradigm, which uses buffering with partial bulk loading to better take advantage of incoming sortedness. Their implementation uses many additional techniques like query-driven sorting and merging, zone maps, and hierarchical bloom filters to keep performance competitive. SWARE dramatically better ingestion performance on even only slightly sorted data, while only hurting point query performance by a small amount.

In many ways, this data structure seems inspired by LSM-trees, even though the purpose is quite different (which the authors explicitly address).

What is missing?

I thought the paper did a great job of motivating its decisions and supporting each claim. I don’t have any real critiques, especially considering how unique of a topic the paper explores. My only thought is that it was a bit hard to keep track of the main points because of how many different techniques were explored in SWARE.

How does the paper support its claims?

The paper does an excellent job at providing empirical evidence for each of its design decisions. They use the Benchmark on Data Sortedness for testing the indexes on different levels of sorted data. They measure the speedup on each operation across varying workloads, amounts of sortedness (K% and L%), buffer size, and data size. They compare their modified B $^{+}$ -tree and B $^{ϵ}$ -tree with the originals in terms of lookup and write performance, scalability, responsiveness to sortedness, and on-disk performance. In addition, they run experiments to justify the usage of global and per-page bloom filters, query-based sorting (inspired by cracking), and their choice of the buffer flush threshold and buffer size.

What are some next steps?

I’d like to see the idea of “sortedness as a resource” explored more in relation to other data structures. I thought it was interesting to hear that B $^{ϵ}$ -tree already exploits sortedness to some degree, so I’m curious where other data structures fall in this design space.

While SWARE gets great performance, I can’t help but wonder if there’s a way to exploit sortedness without degrading point query performance whatsoever.

Binyamin's Notes

Explorer