CliffGuard: A Principled Framework for Finding Robust Database Designs
What is the motivation?
LSM-trees have become the key component in large scale storage systems. One of their most attractive characteristics is how they offer tunable performance through many different parameters. Unfortunately, it is not always clear how to properly configure an LSM-engine, and the task is often done manually. This is an important area of research, because tuning an LSM-tree for a given workload can make a big difference on performance. Many works have explored how to optimize different components of the LSM-tree according to a workload, like the allocation of memory, compaction scheme, and size ratio.
How and why have past approaches fallen short?
Past approaches have focused on optimizing according to a single known workload. In practice, there is always some level of unpredictability in the workload of a database and application needs shift over time. As a result, most algorithms overfit LSM-trees, leading to brittle setups.
What is the key contribution?
This work does not contribute any new cost models for analyzing LSM-trees, or break down its components in a unique way. Instead, the authors improve on the general method of LSM-tree optimization, by developing an algorithm for more robust LSM-trees.
The idea is simple; instead of optimizing for a single workload, optimize for the worst-case of a neighborhood of workloads. Specifically, they represent configurations with vectors and use KL-divergence to compare different setups. This exposes a new parameter which defines the size of the uncertainty region. LSM-trees configured in this way ought to be more robust, since changes in the workload cannot result in dramatically worse performance. The idea is very analogous to overfitting in ML.
What is missing?
The authors could have made the impact of much more clear. They title one short section “Robust Outperforms Nominal for Properly Selected ” but do not actually elaborate on how to properly select . Some rules of thumb or summaries of observations would be helpful here.
How does the paper support its claims?
The paper does a great job of supporting its claims. They implement their algorithm into RocksDB and test it extensively with different workloads. As expected, the algorithm is not always better, however they report that robust optimization is better in over 80% of their comparisons.
What are some next steps?
I’d be interested to see how this can be applied to auto-adjusting LSM-trees. Some parameters can be changed on the fly relatively easily, so I think this framework could be helpful. Also, perhaps more complex parameters (like compaction schemes) could be integrated into this model. Additionally, it could be interesting to explore optimizing the parameter according to application needs.