Spanner

Spanner

A Spanner deployment is called a universe, organized as a set of zones which are a unit of physical isolation. A zone has one zonemaster and hundreds to thousands of spanservers. Zones have location proxies to help location spanservers.

The universe master (handles universes) and placement driver (handles movement between zones) are singletons

Spanservers are responsible for hundreds to thousands of tablets, which is a bag of key/timestamp/value pairs. A tablet’s state is stored in a B-tree-like file and a WAL. Each tablet comes with a Paxos state machine, with leaders appointed through leases (10 seconds).

Leaders implement a lock table for concurrency control, necessary for two-phase locking. Leaders also implement a transaction manager for running distributed transactions, acting as a participant leader (while the other replicas act as participant slaves).

If a transaction involves multiple Paxos groups, one leader is selected as the coordinator leader, whose transaction manager state is replicated in its Paxos group.

Operations that require synchronization acquire locks

Writes must initiate Paxos at the leader but reads can access state at any replica in the Paxos group

Data is organized in directories, which can be migrated by Spanner, and are also the smallest unit whose replication properties can be configured. Technically, directories are further subdivided into shards, which are the actual smallest unit that stays together.

Spanner supports semi-relational tables with a query language and general-purpose transactions

Concurrency Control

Spanner uses TrueTime to guarantee correctness, which provides a true time interval

Spanner only allows a leader to abdicate until $TT . a f t er (s_{max})$ is true, so that timestamps on Paxos writes stay in monotonically increasing order

To enforce external-consistency, the leader for a write assigns a commit timestamp to the end of the interval and then waits until $TT . a f t er (s_{i})$ is true

Each replica tracks a safe time which is the maximum timestamp at which a replica is up-to-date, which is the minimum of the last Paxos write and the transaction manager’s safe time (which is more complicated)

Class Notes

Abstractly, we have shared replicated across different servers. When we’d like to do a read-write transaction, we go to the central server (whatever is managing all the metadata), who goes to a Paxos leader to coordinate the transaction. This Paxos leader runs a 2-phase commit protocol on the Paxos leaders involved with the transaction. Each leader keeps track of locks on the data points being accessed, so that we have exclusivity. The leaders run a standard Paxos scheme to operate according to the transaction.

This is great, but we’d like to do better for reads. If we read locally, we neither have locks nor a 2-phase commit scheme to rely on. However, we’d like external consistency (operations occur instantly) and serialization.

We use snapshot isolation, which versions each value. This allows us to query at a given time point (transaction ID). If not for Paxos, this scheme would give us serializability, but still not external consistency. Because of Paxos, a given shard may up to date info on $X$ but not on $Y$ , so we still lack serializability.

The key innovation of Spanner is the use of the TrueTime API to enable both serializability and external consistency of local reads. We use real timestamps to serialize transactions. How do we synchronize time? With difficulty, and the TrueTime API still only returns an interval of uncertainty containing the true time.

When we commit a transaction, we first set the timestamp to the latest TrueTime (end of interval), and then wait the duration of the interval before releasing locks. Then, the rule for read-only transactions is that the timestamp is also TT.latest. We must wait until the Paxos state machine is past TT.latest to verify that there are new changes.

Question

Suppose a Spanner server’s TT.now() returns correct information, but the uncertainty is large. For example, suppose the absolute time is 10:15:30, and TT.now() returns the interval [10:15:20, 10:15:40]. That interval is correct in that it contains the absolute time, but the error bound is 10 seconds. See Section 3 for an explanation TT.now(). What bad effect will a large error bound have on Spanner’s operation? Give a specific example.

A large TrueTime error bound forces Spanner to delay commits significantly. When a transaction is ready to commit at 10:15:30, but TT.now() returns [10:15:20, 10:15:40], Spanner must wait until 10:15:40 before committing to ensure external consistency. This kind of delay directly impacts write performance and system throughput, especially in write-heavy workloads. The large uncertainty interval forces these delays to maintain linearizability across the distributed database.

Binyamin's Notes

Explorer

Concurrency Control

Class Notes

Question

Table of Contents