Sorting

Databases can be big and need not fit in memory
Data also needs to be stored in persistent, non-volatile storage
Therefore, the DBMS must deal with moving data from non-volatile storage to volatile storage

Sorting is super important but is also quite slow
It is a classic problem in computer science but also a database-specific problem, since many relational operations require sorting

ORDER BY
DISTINCT
GROUP BY
Bulk loading
Sort-merge join

If one can put all the data in memory then sorting is an easy solution. But say that’s not the case. If one wanted to store 1 GB of data with 1 MB memory then sorting is quite tricky.

In general, the goal is to minimize disk access when under memory constraints
Streaming data through RAM

Read page from disk to input buffer
Operate on data; write to output buffer
If output buffer is full, write to disk
If input buffer is consumed, read another page

Some operations write more or less than they read. Like cross-product generates a lot of data, or even update might only take a little data to update many entries. Compression writes less than what it takes in.

External merge sort is a divide-and-conquer algorithm that splits data into runs

In the sort phase, runs are all sorted individually in memory
At this point, merging continues

In 2-way external merge sort, 2 sorted runs are merged during each pass, so the runs double in size each pass. This requires 2 input buffers and 1 output buffer.

The total number of passes is $1 + ⌈ lo g_{2} P ⌉$ , one pass for sorting and $lo g_{2} P$ for merging. Therefore, the total I/O cost is $2 P \cdot (1 + ⌈ lo g_{2} P ⌉)$

In the generalized external merge sort, we use $B$ total buffers and produce $P / B$ sorted runs at pass 0. That means the total I/O cost is $2 P \cdot (1 + ⌈ lo g_{B - 1} (P / B)⌉)$

Quick sort is a great choice in Pass 0 to generate the initial runs

In-memory heapsort is another method that can generate larger variable length runs. On average can generate runs that are $2 \cdot B$ pages (quick sort would be $B$ ). When data is reversely sorted this would give $B - 2$ . When data is fully sorted this would be $P$ , the length of the whole file

Binyamin's Notes

Explorer