Evolution of Spark

Sunil Ghate

Published Feb 8, 2025

+ Follow

𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤

𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬 𝐰𝐢𝐭𝐡 𝐌𝐑

1. Performance : Lot of Disk IO's

2. Hard to write the MR code.

3. MapReduce only support Batch Processing

We have 2 types of processing

- Batch

- Real time processing / Streaming

4. In MapReduce is not flexible or ideal for every use case.

Whenever we use Spark we have to arrange Storage and Resource Manager, but at the same time Spark is just an plug and play Compute engine, which can just plug with any Storage such as AWS (S3) / HDFS / ADLS Gen2/ GCS / Local and it can also plug with any Resource manager such as YARN / Mesos / Kubernetes.

Officially Apache Spark is define as multi language engine for executing data engineering, data science, machine learning on single node or cluster.

Recommended by LinkedIn

Apache Spark

Dharani Ravi 1 year ago

Apache Spark Concepts Every Data Engineer Should…

Sairam Nagarajan 1 month ago

How are the initial number of partitions are…

Sugumar Srinivasan 2 years ago

A very small and precise definition of Spark is

General Purpose
In Memory
Compute Engine

MapReduce Job Challenges:

Consider we have 5 MR jobs MR1, MR2, MR3, MR4, MR5 and underneath their is HDFS. So how the MR1 will work, it take the data from HDFS i.e from Disk it will first read. This MR1 will do the process i.e. Mapper and Reducer activities and post that it will give Final output back to HDFS i.e to disk (write operation). This way other Mappers will work Reading from disk, process it and puts back to disk. From this we come to know that their is lot of Disk IOs involved when we have series or chaining of MapReduce. At the end because of Disk IOs MapReduce lags behind in Performance.

Here comes the Spark evolution and how Spark is better than traditional MapReduce. In Spark we only reads from disk initially and at the end puts data to disk as Final Output. So only two disk io involved while process data with Spark. Here Spark is having capabilities to process data in-memory, which helps in faster data process and shares data to adjacent unit for further operations.

That's why we say Spark is a In-Memory Compute Engine.

Stay stunned for Spark understanding - Deep-dive into Apache Spark in coming series....

By Sunil Ghate

#KeepLearning #KeepUpskilling

Shivam Singh 1y

Insightful

1 Reaction

Vijay Vishnu 1y

Useful tips

1 Reaction

Fardeen Khan 1y

Very helpful

1 Reaction

Ashutosh Agrawal 1y

Useful tips

1 Reaction

See more comments

To view or add a comment, sign in

Evolution of Spark

Sunil Ghate

Recommended by LinkedIn

More articles by Sunil Ghate

Others also viewed

Big Data Analytics: Spark

Is Apache Spark DAG really useful ?

Depth of understanding whether it's technical or business (ft. MapReduce, Spark, investment)

Is Spark overkill? Why I’m replacing distributed clusters with DuckDB for local pipelines.

Apache Spark Architecture | Spark Program Execution

Apache Spark: Changing the Way Big Data is Processed

Using Delta Lake with Spark in Azure Synapse Analytics: A Comprehensive Guide

Wielding Big Data Using PySpark

how-to-write-spark-applications-in-python

Apache Spark as a service -The essentials and overview of services by IBM and Databricks

Explore content categories

Recommended by LinkedIn

More articles by Sunil Ghate

First Spark Program - pySpark Orders

Sqoop Internals

MapReduce - Combiner (Local Aggregation)

MapReduce Paradigm - Multi Reducers

MapReduce Paradigm....continuation another example

MapReduce Program Paradigm

HDFS Architecture

Others also viewed

Big Data Analytics: Spark

Is Apache Spark DAG really useful ?

Depth of understanding whether it's technical or business (ft. MapReduce, Spark, investment)

Is Spark overkill? Why I’m replacing distributed clusters with DuckDB for local pipelines.

Apache Spark Architecture | Spark Program Execution

Apache Spark: Changing the Way Big Data is Processed

Using Delta Lake with Spark in Azure Synapse Analytics: A Comprehensive Guide

Wielding Big Data Using PySpark

how-to-write-spark-applications-in-python

Apache Spark as a service -The essentials and overview of services by IBM and Databricks

Similar topics

Tips for Optimizing Apache Spark Performance

How to Understand Spark Architecture

Explore content categories