Chain Replication

Chain replication for supporting high throughput and availability

Chain replication tries to support high throughput, availability, and strong consistency

The state of an object consists of a history of updates that have been performed and a set of pending requests

Chain replication uses a fail-stop approach, where servers halt in response to failure rather than making erroneous transitions

We use a primary/backup approach, so $t - 1$ servers can fail without compromising an object. These servers are linearly ordered to form a chain.

The reply for every request is generated and sent by the tail
Query requests are directed to the tail
Update requests are processed in the head and then passed down to the tail

Since update requests are processed by a single replica, they can be non-deterministic

The master service detects server failure and informs each server in the chain of its position and the current head and tail. In their prototype, this is implemented as a replicated service coordinated with Paxos

Fault Tolerance

First note that servers earlier in the chain have a more recent history than servers later in the chain

If the master detects failure of the head, we simply notify the chain of the update and lose any of the pending updates that had not been propagated.

If the master detects failure of the tail, it notifies the chain of the update. At this point, the set of completed requests has likely increased,

Failure of other servers is more complex. Each server maintains a list of sent requests that have been forwarded but not processed by the tail. When the tail processes a request, it forwards and acknowledgement up the chain. The master finds the last request that $S^{+}$ has received and tells $S^{-}$ to reforward the missing bits.

To add a new server, we append to the tail. The existing tail is responsible for forwarding the history until it is caught up, at which point requests in the sent list are forwarded to the new tail.

Performance

Chain replication is an instance of primary/backup. The head and tail share responsibility for ordering operations, which in theory is less overhead. However, serially distributing data also makes for higher latency.

In their experiments, they find that chain replication is consistently faster than a regular primary/backup protocol.

Binyamin's Notes

Explorer

Fault Tolerance

Performance

Table of Contents