Problems with Hadoop Map Reduce

Problems with Hadoop Map Reduce

Lets Understand this with one example, Taking the de-facto example of Hadoop world’s Word count problem, considering the below picture say we have around 250 MB file to process (get the word count of each word present in the file) using Hadoop Map Reduce..

If you are already familiar with Hadoop, you know the input file is split into key-value pairs as per input file format ad accordingly then data will be distributed across Data Nodes. There is whole long story behind this and if you want to have a nice read, please go through this link.

https://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow. Leaving behind the science of how data is managed across the nodes, we will go a bit low-level to understand where exactly the problem lies with Map-Reduce.If we talk using Hadoop's jargons, the data that will be processed resides in Machines called Data Nodes. In Computer world, the word “resides” means data stored in some Physical disk. As we have already assumed the file size that we are dealing with is > 128 MB(250 MB in our case, which is greater than the Hadoop block size). the file is split into 2 parts (Depends upon input split and block split). Say we Have part1 and part2 residing on DataNode1 and DataNode2 respectively. part1 has to be written to the physical disk on DataNode1 and part 2 to DataNode2 . 

And all know MEMORY is the place the where the processing happens , But since we have stored the data in the disks , the data has to be copied from disk to memory for processing to happen. Any operation that involves physical disk like reading data from disk and writing data to disk is commonly known as I/O operations (Input/Output Operations). Disk Operations are always slow as they involve seek operation, I/O scheduler and much more. And when you go deeper into the philosophy of Hadoop you will find so many of these operations at each step. Processing of the data happens in memory, so imagine the I/O involved here :

  1. First Split Data has to be copied to Disk. (Write)
  2. Then Copied Data is taken from Disk and copied to RAM. (Read)
  3. After Processing , processed data send back to Disk. (Write)
  4. So Minimum 2 writes and 1 read is involved in mapper phase. 

Now lets expand this understanding to other steps involved in Map Reduce approach. After Mapper phase is done , sorting and shuffling will happen. During this step there is movement of data over network , so this will further add to the latency (involving network latency) and then again the data is copied to memory(read) and then given back to disk(write), involving further Disk operations. And Don’t forget we haven’t executed the Reducer Step. So Imagine so much Disk I/O for simple Map Reduce Word count , and no wonder why Map Reduce Programs are slower in Nature. So we if try to summarise , the gist is  :

Latency = disk i/o (mapper)+ network latency(sort/shuffle) + disk i/o (reducer)

Now how can we get rid of this problem, by now you might have guessed it. Why not we try to reduce these disk i/o operations, why not try do everything in memory. This is the area where Spark scores over Map Reduce by introducing the concept of In Memory Data Processing using RDDs. Let explore How Spark solves this problem in next reading. 

Quite detailed one Abhinav. Mere domain se baahar hai but liked the way u have explained.

Like
Reply

nice explanation.. simple concise to the point

Like
Reply

Great explanation .. Insightful 👍

Like
Reply

Impressive!Thanks for the post Abhinav

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories