Map-Reduce Shuffle and Sort

Map-Reduce Shuffle and Sort

Shuffle and Sort: 

Map-Reduce gives the guarantee that input to every reducer is sorted by key. 

The process by which system performs the sort and transfers the map outputs to the reducers as inputs are called shuffle.

No alt text provided for this image

Map Side:

  • When map task starts producing its output it does not simply return to the disk, It takes the advantage of in-memory buffer and performs some pre-sorting for efficiency.
  • Every map task has a circular memory buffer associated with it.
mapreduce.task.io.sort.mb (default is 100MB), Who’s threshold is mapreduce.map.sort.spill.percent (default is 80%)
  • When buffer reaches it max threshold one background thread starts to spills the content to the disk.
  • If the buffer fills up during this time then the map will block until the spill is complete. Spills are written in a round-robin fashion to the directories specified by
mapreduce.cluster.local.dir property
  • Before it writes to the disk, the thread divides the data into the partitions, within each partition background thread performs in memory sort by key.
  • If any combiner function is there then it runs on the output of the sort, Combiner makes map output more compact and fewer data to write to local disk and transfer to the reducer.
  • Each time memory buffer reaches its spill threshold, the new spill file is created resulting in having multiple spill files. Before the task is finished spill files are merged into single partitioned and sorted output file.
  • If there are at least 3 spill files then combiner is run again before the output file is written. 
  • We can compress output before written to the disk, Doing so makes it faster to write to the disk, saves disk space and reduces the amount of data transfer to the reducers.
To compress mapreduce.map.output.compress=true (Default is false)
Compression library mapreduce.map.output.compress.codec
  • The output files are made available to the reducers over HTTP.

Reduce Side:

  • Map tasks may finishes at different times so reduce tasks starts copying their outputs as soon as it completes this is known as a copy phase of the reduce task.
  • Reduce task has a number of copier threads which copies data in parallel and they are configurable using property
mapreduce.reduce.shuffle.parallercopies (default is 5)
  • Map outputs are written to the reduce tasks JVM’s memory if they are small enough otherwise they are copied to the disk.
 In memory buffer size is mapreduce.reduce.shuffle.input.buffer.percent (70%)
 In memory buffer threshold mapreduce.reduce.shuffle.merge.percent (66%) or 
Threshold number of map tasks mapreduce.reduce.merge.inmem.threshold (1000)
  • When a threshold is reached it is then spilled to the disk.
  • As the copies accumulate on disk, background threads merge them into large, sorted files.
  • If map outputs were compressed by map tasks then it has to be decompressed in memory before merge operation.
  • When all the map outputs have been copied, the reduce task moves into the sort phase (actually a merge phase as sorting was carried at map side), which merges the map output while maintaining the sorting order. 
  • During the reduce phase, the reduce function is invoked for each key in a sorted output file.
  • The output of this phase is written directly to the output file system typically HDFS.

To view or add a comment, sign in

More articles by Shreyas Tapale

  • Hadoop Core Components

    CORE HADOOP: Hadoop as its core is made up of 2 components. These components comprised of what is known as the Hadoop…

Others also viewed

Explore content categories