Problems with Hadoop Map Reduce

Abhinav Kumar

Published Oct 2, 2017

Lets Understand this with one example, Taking the de-facto example of Hadoop world’s Word count problem, considering the below picture say we have around 250 MB file to process (get the word count of each word present in the file) using Hadoop Map Reduce..

If you are already familiar with Hadoop, you know the input file is split into key-value pairs as per input file format ad accordingly then data will be distributed across Data Nodes. There is whole long story behind this and if you want to have a nice read, please go through this link.

https://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow. Leaving behind the science of how data is managed across the nodes, we will go a bit low-level to understand where exactly the problem lies with Map-Reduce.If we talk using Hadoop's jargons, the data that will be processed resides in Machines called Data Nodes. In Computer world, the word “resides” means data stored in some Physical disk. As we have already assumed the file size that we are dealing with is > 128 MB(250 MB in our case, which is greater than the Hadoop block size). the file is split into 2 parts (Depends upon input split and block split). Say we Have part1 and part2 residing on DataNode1 and DataNode2 respectively. part1 has to be written to the physical disk on DataNode1 and part 2 to DataNode2 .

And all know MEMORY is the place the where the processing happens , But since we have stored the data in the disks , the data has to be copied from disk to memory for processing to happen. Any operation that involves physical disk like reading data from disk and writing data to disk is commonly known as I/O operations (Input/Output Operations). Disk Operations are always slow as they involve seek operation, I/O scheduler and much more. And when you go deeper into the philosophy of Hadoop you will find so many of these operations at each step. Processing of the data happens in memory, so imagine the I/O involved here :

First Split Data has to be copied to Disk. (Write)
Then Copied Data is taken from Disk and copied to RAM. (Read)
After Processing , processed data send back to Disk. (Write)
So Minimum 2 writes and 1 read is involved in mapper phase.

Now lets expand this understanding to other steps involved in Map Reduce approach. After Mapper phase is done , sorting and shuffling will happen. During this step there is movement of data over network , so this will further add to the latency (involving network latency) and then again the data is copied to memory(read) and then given back to disk(write), involving further Disk operations. And Don’t forget we haven’t executed the Reducer Step. So Imagine so much Disk I/O for simple Map Reduce Word count , and no wonder why Map Reduce Programs are slower in Nature. So we if try to summarise , the gist is :

Latency = disk i/o (mapper)+ network latency(sort/shuffle) + disk i/o (reducer)

Now how can we get rid of this problem, by now you might have guessed it. Why not we try to reduce these disk i/o operations, why not try do everything in memory. This is the area where Spark scores over Map Reduce by introducing the concept of In Memory Data Processing using RDDs. Let explore How Spark solves this problem in next reading.

Kamlesh Dubey 7y

Quite detailed one Abhinav. Mere domain se baahar hai but liked the way u have explained.

Pawan Pundir 8y

nice explanation.. simple concise to the point

Gyanendra K 8y

Great explanation .. Insightful 👍

Kailas Walldoddi 8y

Impressive！Thanks for the post Abhinav

Romit Das Purkayastha 8y

Beautifully written

See more comments

Problems with Hadoop Map Reduce

Abhinav Kumar

More articles by this author

Others also viewed

Many Hadoop File Formats - which one to choose

How "HADOOP" revolutionised Data Processing

Integrating LVM with Hadoop

Contribution of fixed storage by DataNode to Hadoop Cluster

How to exploit the full use of Hadoop - part 2: How to create Hadoop-friendly data schema

Integration Captures Big Data’s Hidden Value

Step By Step Installation of Ubantu.

Let’s research and the world the know about the Myths of Hadoop

How to exploit the full use of Hadoop

Hadoop Architecture

Explore content categories

Elasticsearch Template based Search

Jun 12, 2017