Data Systems 101

A data system is responsible for storing data and providing access to the data through efficient data movement. To enable efficient data movement, the system must be hardware aware.

Data movement accounts for the majority of workload execution cost.

Storage hierarchy,

CPU
registers (8-64 bits, 1 ns)
on-chip caches (32-512 KB, 4 ns)
on-board cache (1-8 MB, 100 ns)
DRAM / memory (1-512 GB, 16000 ns)
flash storage (SSD/NVM) (1-16 TB, 2 ms)
magnetic storage (HDD/tape) (1-64 TB, 1 s)

Memory and above are volatile. Also, be cautious about memory fetches, and extremely cautious about fetching from flash storage and below.

Different devices have different access granularity. Higher level (slower) storage generally has less granularity.

Basically, avoid disk accesses whenever possible, since these will become bottlenecks for other operations

Data on disk is stored in files, which are collections of pages (the minimum granularity). Within a page, there are data entries.

A zone map stores the minimum and maximum values for each page. This is a very light weight structure to implement.

RUM conjecture: Between the categories of read, memory, and update, you can only optimize two at a time. An important thing to note is the difference between range queries and point queries.

The ultimate problem in system design is that there are hundreds of parameters that must be tuned as application requirements change. Self-adapting data systems are key!

Binyamin's Notes

Explorer