Map-Reduce Shuffle and Sort

Shreyas Tapale

Published Apr 26, 2019

Shuffle and Sort:

Map-Reduce gives the guarantee that input to every reducer is sorted by key.

The process by which system performs the sort and transfers the map outputs to the reducers as inputs are called shuffle.

Map Side:

When map task starts producing its output it does not simply return to the disk, It takes the advantage of in-memory buffer and performs some pre-sorting for efficiency.
Every map task has a circular memory buffer associated with it.

mapreduce.task.io.sort.mb (default is 100MB), Who’s threshold is mapreduce.map.sort.spill.percent (default is 80%)

When buffer reaches it max threshold one background thread starts to spills the content to the disk.
If the buffer fills up during this time then the map will block until the spill is complete. Spills are written in a round-robin fashion to the directories specified by

mapreduce.cluster.local.dir property

Before it writes to the disk, the thread divides the data into the partitions, within each partition background thread performs in memory sort by key.
If any combiner function is there then it runs on the output of the sort, Combiner makes map output more compact and fewer data to write to local disk and transfer to the reducer.
Each time memory buffer reaches its spill threshold, the new spill file is created resulting in having multiple spill files. Before the task is finished spill files are merged into single partitioned and sorted output file.
If there are at least 3 spill files then combiner is run again before the output file is written.
We can compress output before written to the disk, Doing so makes it faster to write to the disk, saves disk space and reduces the amount of data transfer to the reducers.

To compress mapreduce.map.output.compress=true (Default is false)

Compression library mapreduce.map.output.compress.codec

The output files are made available to the reducers over HTTP.

Reduce Side:

Map tasks may finishes at different times so reduce tasks starts copying their outputs as soon as it completes this is known as a copy phase of the reduce task.
Reduce task has a number of copier threads which copies data in parallel and they are configurable using property

mapreduce.reduce.shuffle.parallercopies (default is 5)

Map outputs are written to the reduce tasks JVM’s memory if they are small enough otherwise they are copied to the disk.

In memory buffer size is mapreduce.reduce.shuffle.input.buffer.percent (70%)

In memory buffer threshold mapreduce.reduce.shuffle.merge.percent (66%) or

Threshold number of map tasks mapreduce.reduce.merge.inmem.threshold (1000)

When a threshold is reached it is then spilled to the disk.
As the copies accumulate on disk, background threads merge them into large, sorted files.
If map outputs were compressed by map tasks then it has to be decompressed in memory before merge operation.
When all the map outputs have been copied, the reduce task moves into the sort phase (actually a merge phase as sorting was carried at map side), which merges the map output while maintaining the sorting order.
During the reduce phase, the reduce function is invoked for each key in a sorted output file.
The output of this phase is written directly to the output file system typically HDFS.

Map-Reduce Shuffle and Sort

Shreyas Tapale

Shuffle and Sort:

Map Side:

Reduce Side:

More articles by Shreyas Tapale

Others also viewed

The Hidden Cost of Ignoring Iceberg Maintenance

SPARK - Partitioning

Difference between partitioning and bucketing in spark?

Load API Data into Fabric Lakehouse using Notebook: A Step By Step Guide

⚡ Is Your Spark Job Slow? Partitioning Might Be the Problem!

My Observation, we agree to disagree!

Spark Internals & Dataframe Partitions:

Spark:RDD vs DataFrame vs DataSet

Explore content categories

Shuffle and Sort:

Map Side:

Reduce Side:

More articles by Shreyas Tapale

Hadoop Core Components

Others also viewed

The Hidden Cost of Ignoring Iceberg Maintenance

SPARK - Partitioning

Difference between partitioning and bucketing in spark?

Load API Data into Fabric Lakehouse using Notebook: A Step By Step Guide

⚡ Is Your Spark Job Slow? Partitioning Might Be the Problem!

My Observation, we agree to disagree!

Spark Internals & Dataframe Partitions:

Spark:RDD vs DataFrame vs DataSet

Explore content categories