Databases can be big and need not fit in memory
Data also needs to be stored in persistent, non-volatile storage
Therefore, the DBMS must deal with moving data from non-volatile storage to volatile storage

Sorting is super important but is also quite slow
It is a classic problem in computer science but also a database-specific problem, since many relational operations require sorting

  1. ORDER BY
  2. DISTINCT
  3. GROUP BY
  4. Bulk loading
  5. Sort-merge join

If one can put all the data in memory then sorting is an easy solution. But say that’s not the case. If one wanted to store 1 GB of data with 1 MB memory then sorting is quite tricky.

In general, the goal is to minimize disk access when under memory constraints
Streaming data through RAM

  1. Read page from disk to input buffer
  2. Operate on data; write to output buffer
  3. If output buffer is full, write to disk
  4. If input buffer is consumed, read another page

Some operations write more or less than they read. Like cross-product generates a lot of data, or even update might only take a little data to update many entries. Compression writes less than what it takes in.

External merge sort is a divide-and-conquer algorithm that splits data into runs

  1. In the sort phase, runs are all sorted individually in memory
  2. At this point, merging continues

In 2-way external merge sort, 2 sorted runs are merged during each pass, so the runs double in size each pass. This requires 2 input buffers and 1 output buffer.

The total number of passes is , one pass for sorting and for merging. Therefore, the total I/O cost is

In the generalized external merge sort, we use total buffers and produce sorted runs at pass 0. That means the total I/O cost is

Quick sort is a great choice in Pass 0 to generate the initial runs

In-memory heapsort is another method that can generate larger variable length runs. On average can generate runs that are pages (quick sort would be ). When data is reversely sorted this would give . When data is fully sorted this would be , the length of the whole file