HDFS Architecture (Distributed Processing)
Lets Understand Distributed Processing in Hadoop
Hadoop used MapReduce as a distributed processing engine to process the data distributed across the cluster.
It has 2 Phased
Process Flow:
Record Reader reads the block and convert the data into a Key, Value pair (K,V) and this Key value pair data is feed to the mapper. Mapper is used to process the data parallelly on different data nodes in a cluster. Output of Mappers is again a (Key, value) pair which is then sent to the Reducer Node which will be one of the nodes in a cluster this process or sending of data on reducer node is called shuffling of data. Once all data is shuffled and brought to Reducer Node than sorting will happen which will group up all the data based on a key so that aggregation of the data takes places to provide the final output.
Distributed Processing Architecture
Description and usage of each components in Hadoop Distributed processing
Mapper : In Map phase Mapper gives you the parallelism. That means distributed data crosse the cluster is processed parallaly on the same Data node where data is stored.
As all the mappers in different data nodes in a cluster are running in parallel, this way we reduce the execution time and operations are very optimised.
We should try and have the maximum processing on the file in mapper phase to utilize the parallelism to max extent which is the ultimate goal of having the distributed cluster resulting into Optimized performance
Reducer : In reduce phase Reducer is use to get the aggregated values
The need to have more than 1 reducer comes into picture to avoid the bottleneck of processing all the mappers output on a single reducer node. So if we increase the reducer more than 1 then Partitions comes into the picture.
As data is processed on the node where its stored may it be mapper or reducer phase, we say that Hadoop works on the principal of Data locality
Example : Suppose we have 1 GB file is processed on a Cluster of 4 Data Nodes.
1. Number of Mappers = Number of Blocks(Part of a file(Explained in Part 2 Article))
1GB file/128MB(Block Size) = 8 Blocks
So we have 8 Mappers for above file processing
Recommended by LinkedIn
2. Number of Mappers running in parallel = Number of Date Nodes in a cluster having blocks stored in it.
4 Data Nodes with 2 Block files stored in each Data nodes.
So at a time the number of mappers running in parallel will be 4 as we have 4 nodes cluster.
3. Number of partitions = Number of Reducers
Some other keywords that comes into picture during MapReduce Phase
Record Reader : It reads the data from the block and convert it into key Value pair (k,v). Which is then sent to mapper for processing.
Example: data = 1, 101, 'Ajay', 'Sahu' --> (0, [1, 101, 'Ajay', 'Sahu']])
Partition : Partitions are present on a mapper machines. It is used to sagregate the Mapper output based on the default hash function into different partition created as per requirement.
Hash function can be also modified to custom hash function to effective distribute the data into different partitions. When one Particular Partitions output is sent one particular reducer only then partition is said to be consistent which is the required case. Output will be Key Value pair.
Shuffle : Shuffling is the process of sending Mapper output from Mapper machine to the reducer machine for further processing.
Sort : In this process all the mapper output are sorted based on the similar key and group together. It process takes place on reducer machine.
Combiner : To reduce the load on the reducer machine some task can be done at Mapper machine itself like initial data aggregation which is also called as local aggregation.
After the mapper processing data is aggregated locally on Mapper machines and reduced data is then send to reducer machine for further processing.
Combiner helps in two ways
1. To increate the optimization as we are using the parallelism for aggregation.
2. Reduced data is transferred to the reducer for final processing which also helps in faster reducer processing.
While using Combiner we should be cautions enough to make sure that the final result is not changed compared to the output without Combiner as aggregations are taking place on multiple nodes in this process.
Process Flow Diagram
Thanks for sharing😇✌✌✌
Ajaykumar Sahu good way of explaining.
Well explained.