Practical Byzantine Fault Tolerance

The idea of Byzantine fault tolerance was only theoretical for a while. The alternative is essentially fail-stop, killing any misbehaving server.

This a replicated state machine approach

We assume an attacker essentially has full control of (out of ) machines and can delay a message for an arbitrary but limited amount of time

An attacker cannot fail a client or break our cryptographic schemes

Let’s iterate on a design,

  • Clients and servers have public/private key pairs and sign every message. We wait for all servers to reply with our answer. But this breaks liveness!
  • Let’s include servers. We wait for a majority () matching replies. Our problem is that our majority may include bad nodes.
  • nodes works. If we wait for replicas, we know a majority of nodes must be correct

A client sends pre-prepare messages to every replica
When a replica receives a pre-prepare, it multicasts a prepare message
Once a replica receives prepare messages, it may execute the attached operation and forward the result to the client

What do we do when the primary appears faulty? We must have a view-change protocol