Adaptive Data Layouts

Data placement affects the performance of a system

In row store, we store all fields of an entry together. This means that reading one column from the schema results in high read amplification. Column store improves on these kinds of queries. But point queries especially suffer in this setup.

HTAP stands for hybrid transactional and analytical processing, and is a setup that aims to combine row and column store. One way of implementing this is literally to implement both, and to transfer from row store to column store when there is less system activity.

If you know a specific set of your database does not need analytic queries then you’d keep it in row store.

Example Question: Say we have a schema $R (A, B, C, D)$ , what layout would optimize the query “select max(B) from R where A>5 and C<10”?

To minimize page reads (at the cost of more complexity), store A and C row-wise. We select from those entire columns to find the indexes of entries that satisfy A>5 and C<10. We use those indexes to read in the necessary pages from B, and compute the maximum.

One of the most interesting benefits of column-stores is compression. When you store things by column, you have the ability to use some powerful compression techniques. Dictionary compression along with bit-packing can save massive amounts of space. Even when there is no fixed number of unique entries, delta compression can still capture the most important variation in data. This can either mean storing deltas from the first entry, or between every single entry. Another important technique is run-length encoding.

The cool thing is that we can potentially operate on compressed data directly.

Binyamin's Notes

Explorer