MapReduce: Simplified Data Processing on Large Clusters
MapReduce allows programmers to execute functional map-reduce programs in parallel without extra effort. This system was introduced by Google.
This work introduces a simple interface for parallelizing map-reduce operations on clusters of commodity PCs. It explains an implementation on GFS. They use re-execution as the primary mechanism for fault tolerance.
The Map function takes an input pair and produces a set of intermediate key/value pairs. The Reduce function reduces all intermediate key/value pairs with the same key into a smaller list of values.
In the Google implementation, thousands of Linux machines (2-4 GB) are connected with switched Ethernet, with storage provided by Google’s distributed file system ran on inexpensive disks. Users submit jobs to a scheduling system.
Steps in a procedure,
- Input files are split into pieces (16-64 MB) and starts up that many copies of the program on the cluster.
- One copy is the master who assigns map and reduce tasks to workers. The master stores the state of each task (idle/in-progress/completed, the associated worker) and the locations and sizes of intermediate file regions (which are forwarded to reduce tasks).
- A worker assigned a map task reads the corresponding input and buffers intermediate pairs to memory
- Periodically, pairs are flushed to memory and partitioned into regions, whose locations are passed back to the master
- The master forwards the intermediate pair locations to the reduce worker, who uses RPC to read from the local disks
- Reduce workers sort on the key-space and process the reduce function, whose output is appended to a final output file for the reduce partition
- When all tasks are done, the master wakes up the user program and passes back the output files (which can be combined if necessary with another map-reduce call)
Faults
A cluster with thousands of machines must have a good way of handling failures. The master pings workers periodically and marks workers as failed if it receives no response. Map tasks are reassigned, whether completed or not, while reduce tasks are only restarted if they had been in-progress.
To handle master failure, it’s simple to write checkpoints of its data structures. In practice, this is unlikely, so Google’s implementation in this paper does not handle it.
When the supplied functions are deterministic (pure functions), faults have no effect on the output. When functions are non-deterministic, result files will be internally consistent but might not be externally. They rely on atomic file system operations to handle passing inputs and outputs safely.
Performance Considerations
Google’s implementation runs on GFS and therefore tries to schedule map tasks on machines with a copy of the input data (or a machine close by)
More task granularity means better load balancing and faster recovery, however there are practical limitations. For one, the user does not want an excessive number of output files.
MapReduce schedules backup tasks once the overall procedure is close to completion to deal with stragglers. This increases CPU slightly but decreases latency dramatically.
Extra features,
- Users can choose specific partition functions for either performance or convenience (grouping together certain data in the final output files)
- Tasks may use a combiner function to essentially reduce on a small scale to avoid sending redundant data over network. Certain operations (like counting words) can be much faster when done this way.
- MapReduce provides different input (reader) formats, which each come with their own partitioning functionality
- MapReduce comes with a functionality for detecting and skipping records that cause faults
- It comes with a local execution version for debugging
- MapReduce includes a way to maintain counters over map and reduce tasks which are aggregated in the master
The rate of processing generally increases exponentially until it hits its peak, and then quickly dips back down until the procedure is complete. Some overhead occurs at the start as the program is propagated to worker machines which interact with GFS to fetch the input.
Question
MapReduce improves performance by exploiting parallelism but benefits of parallelism depend on good load balancing. How does MapReduce achieve load balancing?
MapReduce designates a master responsible for assigning map and reduce tasks to all the workers. A user using MapReduce will typically set the number of map tasks to a number significantly higher than the number of workers, while the number of reduce tasks (each of which results in a single output file) will be a small multiple of the number of workers. This way, for most of execution, MapReduce achieves good load balancing by simply assigning an incomplete task to any idle worker. To split intermediate results among in-progress reduce workers, MapReduce uses a partition function (which is user-defined and will typically be a simple hash).
To handle stragglers, MapReduce uses backup tasks; once a certain threshold of completion is reached, the master assigns the incomplete tasks an additional time to new machines. This minimizes extra computation while greatly reducing total waiting time. MapReduce can handle these redundant results easily, since its design gracefully handles retried tasks in case of machine failure.
The main constraint of MapReduce is that our functions must be completely pure, without side-effects
One bottleneck is that the input to any given reducer is potentially distributed across all the mappers.