GFS

GFS is built for an analytics workflow with inexpensive machines,

Handles common failures
Stores a smaller number of massive files
Workflow consists of sequential reads, and minimal, small random reads, along with sequential writes (append)
The system needs well-defined semantics when multiple clients append to a file concurrently
Bandwidth is more important than latency

GFS uses a file system-like interface, with create, delete, open, close, read, and write, along with snapshot (low cost copies) and record append (atomic append)

GFS architecture is pretty straightforward,

Files are divided into fixed-size chunks, identifiable with immutable and unique chunk handles
Chunkservers store chunks (replicated) on local disk as Linux files
The master contains all file system metadata, managing access control information, mapping files to chunks (with locations), and also handling heartbeat messages, lease management, garbage collection, and chunk migration
Clients interact with the master for metadata but exchange data directly with chunkservers

GFS does not handle caches

Steps for a simple read,

The client translates the file name and byte offset into a chunk index
The client asks the master for a chunk handle (or multiple) and replica locations
The client makes a request with the handle and byte range to one of the replicas, likely the closest
The client caches the chunk information until it expires or the file is closed and reopened

The master keeps all metadata in memory, which includes the file and chunk namespaces, a map from files to chunks, and the locations of each chunk’s replicas

We use a lease mechanism to get consistent mutation orderings on replicas. The master must designate a replica as a primary before any mutations can take place. The primary is responsible for ordering incoming operations.

Leases timeout after 60 seconds. If the primary goes unresponsive without canceling the lease, the master must wait until the lease expires on its own.

Steps for a simple write,

The client asks the master for the current primary
The master replies with the identity of the existing primary or creates a new lease if necessary
The client pushes the data to all replicas
Once acknowledged by all replicas, the client sends a request to the primary, who forwards it to the other replicas
Once all replicas have committed, the primary replies to the client
If the operation fails, the client retries

GFS client code breaks down large writes into multiple operations, which may be interleaved with concurrent operations (consistent yet undefined)

To distribute network usage better, data is pushed linearly through a chain of chunkservers, rather than all from the client

GFS’s master supports multiple operations with granular locking. The namespace does not have real directories, instead supporting a list of files with prefix compression.

Consistency and Fault Tolerance

Metadata is stored persistently via an operations log, except for the locations of replicas, since this is queried on start from each chunkserver. This log is replicated on multiple remote machines.

Keeping locations persistent is difficult, because chunkservers can join, leave, change names, and so on. Ultimately, chunkservers are the final say on whether they store a chunk.

Namespace mutations are atomic and handled by the master with locks. The operations log defines a global order on these operations.

Region consistency,

A file region is consistent if all clients will always see the same data
Defined if it is consistent and clients will see what the mutation writes in its entirety (without interference from concurrent writers)
Concurrent successful mutations leave a region undefined but consistent
A failed mutation makes a region inconsistent

A record append atomically appends at least once at an offset of GFS’s choosing, in a way that leaves regions defined

Since clients cache chunk locations, they may read from a stale replica. This is eventual consistency.

GFS identifies failed chunkservers through handshakes and detects data corruption by checksumming

The master keeps track of version numbers for each chunk which it uses to detect stale replicas

Master state is replicated for reliability on multiple machines, and mutations are only considered committed after they’ve been flushed to disk on all master replicas. Monitoring outside of GFS starts a new master process with the replicated logs if it detects a failed master.

Shadow masters provide read-only access to the file system even when the primary master is down

Performance Considerations

Chunk sizes are set to 64 MB, much larger than on a regular file system. This reduces client/master information dramatically, allows the client to keep persistent connections with chunkservers, and reduces the size of metadata.

A potential issue with large chunk sizes is that hot spots should develop more easily when data is not distributed as much (large chunks means less chunkservers for a file). In practice, this is rarely an issue, and can be solved with higher replication when needed.

Chunk replica placement is done to maximize data reliability and availability, while also maximizing network bandwidth utilization. Chunk replicas should be spread across racks so that chunks survive outages.

For simplicity, GFS uses garbage collection to delete files and reclaim unreferenced chunks. This amortizes the cost of deletion to a background process and simplifies the many edge cases that normally would pose an issue. This introduces a delay which can act as a safety net, but also prevent users from fine tuning usage. GFS supports immediate deletions when the user specifies.

Question

Describe a sequence of events that would result in a client reading stale data from the Google File System.

Client $X$ asks the master for the chunks containing file $a$ . The master gives the locations of chunks $1$ , $2$ , and $3$ . The client caches these locations. Then, the chunkserver holding $3$ crashes. Now Client $Z$ modifies $a$ ; the server designates $1$ as the primary, assigning a new lease and incrementing $1$ and $2$ ‘s version number ( $3$ is still down). At this point in time, $3$ restarts. Client $X$ has been running some operations and now wants to reread $a$ . Its cache has not yet expired, so it reads from chunk $3$ like before. The master has not yet detected or garbage collected the stale chunk $3$ . So since $X$ ‘s metadata has not been updated, it reads stale data.

Binyamin's Notes

Explorer

Consistency and Fault Tolerance

Performance Considerations

Question

Table of Contents