HDFS Architecture (Distributed Processing)

Ajaykumar Sahu

Published May 31, 2023

+ Follow

Lets Understand Distributed Processing in Hadoop

Hadoop used MapReduce as a distributed processing engine to process the data distributed across the cluster.

It has 2 Phased

Map
Reduce

Process Flow:

Record Reader reads the block and convert the data into a Key, Value pair (K,V) and this Key value pair data is feed to the mapper. Mapper is used to process the data parallelly on different data nodes in a cluster. Output of Mappers is again a (Key, value) pair which is then sent to the Reducer Node which will be one of the nodes in a cluster this process or sending of data on reducer node is called shuffling of data. Once all data is shuffled and brought to Reducer Node than sorting will happen which will group up all the data based on a key so that aggregation of the data takes places to provide the final output.

Distributed Processing Architecture

Description and usage of each components in Hadoop Distributed processing

Mapper : In Map phase Mapper gives you the parallelism. That means distributed data crosse the cluster is processed parallaly on the same Data node where data is stored.

As all the mappers in different data nodes in a cluster are running in parallel, this way we reduce the execution time and operations are very optimised.

We should try and have the maximum processing on the file in mapper phase to utilize the parallelism to max extent which is the ultimate goal of having the distributed cluster resulting into Optimized performance.

Reducer : In reduce phase Reducer is use to get the aggregated values. Reducer takes output from mapper as an input to perform the aggregation and provide the final Output. By default we have a 1 reducer running, But it can be changed to 0 or more than 1 depends on the requirements of processing.

The need to have more than 1 reducer comes into picture to avoid the bottleneck of processing all the mappers output on a single reducer node. So if we increase the reducer more than 1 then Partitions comes into the picture.

As data is processed on the node where its stored may it be mapper or reducer phase, we say that Hadoop works on the principal of Data locality. where the data is local(On the machine) and code is ran on that perticular machine to process the data.

Example : Suppose we have 1 GB file is processed on a Cluster of 4 Data Nodes.

1. Number of Mappers = Number of Blocks(Part of a file(Explained in Part 2 Article))

1GB file/128MB(Block Size) = 8 Blocks

So we have 8 Mappers for above file processing

Recommended by LinkedIn

HDFS Distillation

Dr. Atif Farid Mohammad PhD 10 years ago

Distibuted storage cluster…

Vivek Sharma 5 years ago

Introduction to Hadoop

Akhil Gurrapu 2 years ago

2. Number of Mappers running in parallel = Number of Date Nodes in a cluster having blocks stored in it.

4 Data Nodes with 2 Block files stored in each Data nodes.

So at a time the number of mappers running in parallel will be 4 as we have 4 nodes cluster.

3. Number of partitions = Number of Reducers

Some other keywords that comes into picture during MapReduce Phase

Record Reader : It reads the data from the block and convert it into key Value pair (k,v). Which is then sent to mapper for processing.

Example: data = 1, 101, 'Ajay', 'Sahu' --> (0, [1, 101, 'Ajay', 'Sahu']])

Partition : Partitions are present on a mapper machines. It is used to sagregate the Mapper output based on the default hash function into different partition created as per requirement.

Hash function can be also modified to custom hash function to effective distribute the data into different partitions. When one Particular Partitions output is sent one particular reducer only then partition is said to be consistent which is the required case. Output will be Key Value pair.

Shuffle : Shuffling is the process of sending Mapper output from Mapper machine to the reducer machine for further processing.

Sort : In this process all the mapper output are sorted based on the similar key and group together. It process takes place on reducer machine.

Combiner : To reduce the load on the reducer machine some task can be done at Mapper machine itself like initial data aggregation which is also called as local aggregation.

After the mapper processing data is aggregated locally on Mapper machines and reduced data is then send to reducer machine for further processing.

Combiner helps in two ways

1. To increate the optimization as we are using the parallelism for aggregation.

2. Reduced data is transferred to the reducer for final processing which also helps in faster reducer processing.

While using Combiner we should be cautions enough to make sure that the final result is not changed compared to the output without Combiner as aggregations are taking place on multiple nodes in this process.

Process Flow Diagram

Shilpy Chakraborty 2y

Thanks for sharing😇✌✌✌

1 Reaction

TrendyTech - Big Data By Sumit Mittal, graphic

TrendyTech - Big Data By Sumit Mittal 2y

Ajaykumar Sahu good way of explaining.

1 Reaction

Kirti Singh 2y

Well explained.

1 Reaction

See more comments

To view or add a comment, sign in

HDFS Architecture (Distributed Processing)

Ajaykumar Sahu

Lets Understand Distributed Processing in Hadoop

Process Flow:

Distributed Processing Architecture

Description and usage of each components in Hadoop Distributed processing

Recommended by LinkedIn

Process Flow Diagram

More articles by Ajaykumar Sahu

Others also viewed

Big Data Technologies: Hadoop vs. Spark – Which One to Choose?

Self Service Model for Hadoop

BIG DATA, DISTRIBUTED STORAGE CLUSTER AND HADOOP

Challenges of Skewed Data Join in Hadoop and Spark: The Benefits of Broadcast Join

Gain Consumer Insight with Market Basket Analysis

A Journey into the World of Big Data: Technologies and Experiences

Airflow Distributed Mode Setup using Celery Executor in Hadoop Cluster

Spark Submit Deployment - Spark Execution Types (Part-2)

You Really Need Spark When ...

Explore content categories

Lets Understand Distributed Processing in Hadoop

Process Flow:

Distributed Processing Architecture

Description and usage of each components in Hadoop Distributed processing

Recommended by LinkedIn

Process Flow Diagram

More articles by Ajaykumar Sahu

HDFS Architecture(Distributed Storage)

Big Data & Hadoop Overview

Others also viewed

Big Data Technologies: Hadoop vs. Spark – Which One to Choose?

Self Service Model for Hadoop

BIG DATA, DISTRIBUTED STORAGE CLUSTER AND HADOOP

Challenges of Skewed Data Join in Hadoop and Spark: The Benefits of Broadcast Join

Gain Consumer Insight with Market Basket Analysis

A Journey into the World of Big Data: Technologies and Experiences

Airflow Distributed Mode Setup using Celery Executor in Hadoop Cluster

Spark Submit Deployment - Spark Execution Types (Part-2)

You Really Need Spark When ...

Explore content categories