[summary]MapReduce: Simplified Data Processing on Large Clusters
summary of MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication.
This allows programmer without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Fault tolerance: use re-execution as the primary mechanism for fault tolerance.
1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece.
2. One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. The master pick idle workers and assigns each one a map task or reduce task. (Load Balancing)
3. The intermediate key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
6. The output of the Reduce function is appended to a final output file for this reduce partition.
After successful completion, the output of the map-reduce execution is available in the R output files. Typically, users do not need to combine these R output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.
Master Data Structures
The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks.)
For each completed map task, the master stores the locations and sized of the R intermediate file regions produced by the map task.
Fault Tolerance
Worker Failure
The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also rest to idle and becomes eligible for rescheduling.
Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.
MapReduce is resilient to large-scale worker failures.
Master Failure
It is easy to make the master write periodic checkpoints of the master data structures. If the master task dies, a new copy can be started from the last checkpointed state.
Locality
Network bandwidth is a relatively scarce resource in our computing environment. We conserve network band-width by taking advantage of the fact that the input data is stored on the local disks of the machines that make up our cluster.
The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task’s input data.
MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.
Backup Tasks
MapReduce have a general mechanism to alleviate the problem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution competes.
Partitioning Function
A default partitioning function is provided that uses hashing (e.g. “hash (key) mod R”). This tends to result in fairly well-balanced partitions. MapReduce library can provide a special partitioning function. For example, using “hash(Hostname(urlkey)) mod R” as the partitioning function causes all URLs from the same host to end up in the same output file.
Combiner
Partial merging. The only difference between a reduce function and a combiner function is how the MapReduce library handles the output of the function. The output of a reduce function is written to the final output file. The output of a combiner function is written to an intermediate file that will be sent to a reduce task. Partial combining significantly speeds up certain classes of MapReduce operations.
Machine Failures
An experiment that intentionally killed 200 out of 1746 worker shows that the underlying cluster scheduler immediately restarted new worker process on these machines (only the processed were killed, the machines were still functioning properly). The entire computation finishes just at an increase of 5% over the normal execution time.
Lesson Learnt
1. Restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant.
2. Network bandwidth is a scarce resource. A number of optimizations in our system are therefore targeted at reducing the amount of data sent across the network: the locality optimization allows us to read data from local disks.
3. Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.