HDFS Architecture (Distributed Processing)

HDFS Architecture (Distributed Processing)

Lets Understand Distributed Processing in Hadoop

Hadoop used MapReduce as a distributed processing engine to process the data distributed across the cluster.

It has 2 Phased 

  1. Map
  2. Reduce


Process Flow:

Record Reader reads the block and convert the data into a Key, Value pair (K,V) and this Key value pair data is feed to the mapper. Mapper is used to process the data parallelly on different data nodes in a cluster. Output of Mappers is again a (Key, value) pair which is then sent to the Reducer Node which will be one of the nodes in a cluster this process or sending of data on reducer node is called shuffling of data. Once all data is shuffled and brought to Reducer Node than sorting will happen which will group up all the data based on a key so that aggregation of the data takes places to provide the final output.


Distributed Processing Architecture

No alt text provided for this image

Description and usage of each components in Hadoop Distributed processing

Mapper : In Map phase Mapper gives you the parallelism. That means distributed data crosse the cluster is processed parallaly on the same Data node where data is stored.

As all the mappers in different data nodes in a cluster are running in parallel, this way we reduce the execution time and operations are very optimised. 

We should try and have the maximum processing on the file in mapper phase to utilize the parallelism to max extent which is the ultimate goal of having the distributed cluster resulting into Optimized performance.


Reducer : In reduce phase Reducer is use to get the aggregated values. Reducer takes output from mapper as an input to perform the aggregation and provide the final Output. By default we have a 1 reducer running, But it can be changed to 0 or more than 1 depends on the requirements of processing.

The need to have more than 1 reducer comes into picture to avoid the bottleneck of processing all the mappers output on a single reducer node. So if we increase the reducer more than 1 then Partitions comes into the picture.

As data is processed on the node where its stored may it be mapper or reducer phase, we say that Hadoop works on the principal of Data locality. where the data is local(On the machine) and code is ran on that perticular machine to process the data.

Example : Suppose we have 1 GB file is processed on a Cluster of 4 Data Nodes.

1. Number of Mappers = Number of Blocks(Part of a file(Explained in Part 2 Article)) 

1GB file/128MB(Block Size) = 8 Blocks

So we have 8 Mappers for above file processing


2. Number of Mappers running in parallel = Number of Date Nodes in a cluster having blocks stored in it.

4 Data Nodes with 2 Block files stored in each Data nodes.

So at a time the number of mappers running in parallel will be 4 as we have 4 nodes cluster.


3. Number of partitions = Number of Reducers


Some other keywords that comes into picture during MapReduce Phase

Record Reader : It reads the data from the block and convert it into key Value pair (k,v). Which is then sent to mapper for processing.

Example: data = 1, 101, 'Ajay', 'Sahu' --> (0, [1, 101, 'Ajay', 'Sahu']])

Partition : Partitions are present on a mapper machines. It is used to sagregate the Mapper output based on the default hash function into different partition created as per requirement.

Hash function can be also modified to custom hash function to effective distribute the data into different partitions. When one Particular Partitions output is sent one particular reducer only then partition is said to be consistent which is the required case. Output will be Key Value pair.

Shuffle : Shuffling is the process of sending Mapper output from Mapper machine to the reducer machine for further processing.

Sort : In this process all the mapper output are sorted based on the similar key and group together. It process takes place on reducer machine.

Combiner : To reduce the load on the reducer machine some task can be done at Mapper machine itself like initial data aggregation which is also called as local aggregation.

After the mapper processing data is aggregated locally on Mapper machines and reduced data is then send to reducer machine for further processing.

Combiner helps in two ways 

1. To increate the optimization as we are using the parallelism for aggregation.

2. Reduced data is transferred to the reducer for final processing which also helps in faster reducer processing.

While using Combiner we should be cautions enough to make sure that the final result is not changed compared to the output without Combiner as aggregations are taking place on multiple nodes in this process.

Process Flow Diagram


No alt text provided for this image

To view or add a comment, sign in

More articles by Ajaykumar Sahu

  • HDFS Architecture(Distributed Storage)

    HDFS stands for Hadoop Distributed File System HDFS works on the principal of Master and slave architecture. were we…

    5 Comments
  • Big Data & Hadoop Overview

    Big Data Overview Big data can be defined as a data that cannot be stored or processed on a single machine which has a…

    2 Comments

Others also viewed

Explore content categories