Indexing

There are a few ways to classify indexes

Data placement
Index design
Indexing techniques

Indexes are awesome and help in reading data more efficiently, facilitating point queries and range queries

B+-tree
Hash index
LSM-tree
Bitmap

Which attributes should we index on?

Primary keys and foreign keys (most commonly used in joins, generally done by default)
Attributes used in WHERE clause (enables efficient filtering)
Columns used commonly in GROUP BY or JOIN

Indexes and data can be tightly coupled, or decoupled. The trees are made up of inner and leaf nodes.

When the data is coupled, inner nodes will include keys, values, and pointers to children.
Otherwise, the values will be stored further down with the same key in a child node. However, in these decoupled trees, the leaf nodes are connected with express pathways (since everything is in the leaf nodes).

To clarify the differences,

Coupled trees can terminate point queries early, decoupled always goes until the leaf nodes
Coupled trees have high range cost
Coupled trees have data redundancy when there’s multiple indexes whereas decoupled can support multiple indexes easily on the same base data

Indexes can be clustered or unclustered,

Clustered indexes have the data physically sorted on the index attribute. This makes for very efficient point and range queries but only one clustered index can be made per relation
On an unclustered index, the point queries are still very efficient, but range queries are quite slow because they involve many random accesses

Indexes can be primary or secondary,

A primary index is on the primary key and never contains duplicates. There is one per relation
A secondary index can have duplicates

Indexes can be dense or sparse,

In a dense index, each key has at least one data entry
In a sparse index only one key per page is indexed. So the pages are sorted, making for fast range queries even if point queries will be a bit slower

Indexes can be single or composite key

Data structure types,

Tree-based
Hash-based

Tree Based Hashing

The B-tree family comprises many types of balanced n-ary trees. B $^{+}$ -tree is the most common (B $^{link}$ -tree is sometimes used too).

B $^{+}$ -tree is an n-ary, self-balancing, ordered tree structure

Balanced $O (lo g N)$ performance on inserts, updates, deletes, and point queries
Optimized for range queries
The key is “n-ary”: increase it to as much as can fit in a single page

It restructures after inserts and deletes in order to be perfectly balanced
There are index nodes and leaf nodes

The order is the maximum number of children per node ( $F$ )
Every node must have between $\frac{F - 1}{2}$ and $F - 1$ keys (apart from the root which has between $1$ and $F - 1$ )

Nodes are made up of keys and values.

Index nodes store values, children pointers, and sibling pointers
Leaf nodes store values and siblings

Point queries start from the root node and always end at a leaf node. Asymptotically this takes $O (L) = O (lo g_{F} N)$ random I/Os

Even if a key is stored in a index node, it might not be in a leaf node, and is therefore not included

Range queries are also simple, find the correct leaf node (random I/Os), find the correct key, and then scan (sequential I/Os)

To insert,

Start from the root, find the correct leaf $L$
If $L$ has the required space, insert into $L$ at the correct location
Otherwise
1. Split into $L$ and $L 2$
2. Redistribute entries evenly and copy the middle key of the split to the parent node $L 2$
3. Adjust the pointers
4. Repeat the process if required
  Splits grow the tree, root splits increase the height

To delete,

Start from the root, find the correct leaf $L$
Remove the entry from the leaf, $L$
If $L$ is not at least half full
1. Try redistributing by borrowing from a sibling (and update the parent pointer)
2. Otherwise merge with a sibling
3. Update parent if necessary (merging with the root is also possible here)
  In this merge process, a parent can be merged with one of its children

On average, the nodes in a tree are 67% full
This means the average fanout is $256 \times 0.67 = 171$
Typical tree height is $3$ or $4$ (because $25 6^{4}$ is a massive number)

But can we do better than wasting 33% of the space? Yes, bulk loading enables this if all the data is already there. The process essentially involves adding each leaf to a parent one by one and splitting when necessary.

Hash-Based Indexing

Hash tables are an associative array data structure that maps keys to values, typically unordered
The hash function gives us the array index given a key

Space complexity: $O (N)$
Time complexity (for inserts and lookups): $O (1)$ on average, $O (N)$ in the worst case

A hash function should be fast and efficient, with a low collision rate. We don’t need an advanced cryptographic hash function.

In static hashing, there are a fixed number of buckets and collision avoidance can be dealt with using chaining or open addressing
In dynamic hashing, the bucket count adjusts to match the data and the hash function dynamically updates

Ideally, we allocate a large array of size $N$ and find the index by computing $h (k) % N$ . If there are no collisions, this is an easy scheme for $O (1)$ lookups. Unfortunately, life is not ideal.

Linear probing fixes the collision problem but deletes get complicated.

Re-hashing is too expensive to do on each delete
Tombstones (special value representing a deleted item) work but the worst-case time is still not great

Chaining resolves collisions with overflow pages, connected to the main pages as linked lists. This is pretty good on average but the worst-case is not great

Extendible hashing is a chaining approach that splits buckets when overflowing. One layer is a directory that points to the buckets (index pages). In case of overflow, the directory size is doubled but not the number of buckets.

Each directory entry is associated with $n$ bits according to the global depth. The entry references the bucket who stores the data ending in those $n$ bits. However, the individual index pages store their own local depths. When an index page overflows, its local depth increases. If necessary, the global depth also increases, which doubles the size of the directory. If the local depth of a page is smaller than the global depth, then multiple directory entries will point there.

The I/O cost for a lookup is generally 2, but only 1 if the directory fits into memory. The limitations of this are the costs of maintaining a directory and how skewed data can cause more splits.

Linear hashing has a system for splitting buckets evenly and uses multiple hashes. Essentially, overflow pages are used when an overflow happens, and buckets are split in round robin fashion. When a bucket is split, it gets a new hash function (if the initial hash function was $x % k$ then the new secondary one would be $x %2 k$ ). Then, if something maps into a bucket with a secondary hash function then that hash function is used instead.

Linear hashing has the advantage of not using a directory.

Binyamin's Notes

Explorer

Tree Based Hashing

Hash-Based Indexing

Table of Contents