ZooKeeper

ZooKeeper provides a simple and high performance kernel for building more complex distributed applications. It provides wait-free operations, per client FIFO ordering, and linearizable writes

Guaranteeing FIFO client order allows clients to submit operations asynchronously, which is desirable for performance

To enable clients to cache data, ZooKeeper uses a watch mechanism which notifies clients of changes (their occurrence, not their content). This avoids managing client cache directly, which is much more performant

ZooKeeper organizes znodes in a hierarchical name space, similar to a file system,

  • Regular znodes must be deleted explicitly
  • Ephemeral znodes are cleaned up at the end of a session automatically
  • Znodes can be created as sequential, where a counter is appended to its name automatically (the sequence value is no smaller than the value of any other sibling)
  • Znodes are not intended for general data storage, rather they map to meta-data
  • Znodes have associated time stamps and version counters

A client connects to ZooKeeper through a session, which has an associated timeout

The API is simple and self-explanatory, with both synchronous and asynchronous versions,

  • create(path, data, flags)
  • delete(path, version)
  • exists(path, watch)
  • getData(path, watch)
  • setData(path, data, version)
  • getChildren(path, watch)
  • sync(path) waits for all updates pending to propagate (used to guarantee up to date reads)

Znodes do not have handles, they are referred to with the full path

A key property is that clients will receive notifications of watches BEFORE they see the new state of the system after that change is made

Refer to the paper to see the power of this system

Consistency and Fault Tolerance

ZooKeeper replicates data on each server, storing everything in-memory with logs to disk

Read requests simply read the state of the local database

All write requests are forwarded to a leader who orders the requests, processes them, and sends the final transactions back to followers. These changes are broadcasted through “Zab”, which uses simple majority quorums to decide on a proposal. Zab guarantees that all changes from previous leaders are delivered to an established leader before it begins to broadcast changes.

To simplify, ZooKeeper uses the leader and proposal log from Zab as the leader and WAL for its own use. Zab may deliver messages more than once, but this is fine because transactions are idempotent.

Replicas periodically make fuzzy snapshots and recover by reapplying changes since the snapshot

Watch notifications are handled locally; only the server a client is connected to tracks notifications

Every replica response comes with a zxid, identifying the last transaction processed by the replica. If the client connects to a new server, it must wait until that server is caught up with the last seen zxid (or find a more up to date replica)

Session failures are detected with simple heartbeats

Question

One use of ZooKeeper is as a fault-tolerant lock service (see the section “Simple locks” on page 6). Why isn’t possible for two clients to acquire the same lock? In particular, how does Zookeeper decide if a client has failed and it can give the client’s locks to other clients?

The lock is represented as the location of a single ephemeral znode. When a client tries to acquire the lock, they attempt to create the lock’s znode at its location. If the lock is already acquired by another client, the znode will exist and the creation attempt will fail. The client uses a watch to be notified when the znode is deleted. The client lets go of the lock by deleting the znode, notifying all watching clients. Since the lock znode is ephemeral, ZooKeeper will delete it automatically when it loses contact with the session that created it. So if a client fails, ZooKeeper will notice a lack of messages and destroy the session after the timeout passes.