Evolution of Spark

Evolution of Spark

𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤

𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬 𝐰𝐢𝐭𝐡 𝐌𝐑

1. Performance : Lot of Disk IO's

2. Hard to write the MR code.

3. MapReduce only support Batch Processing

We have 2 types of processing

- Batch

- Real time processing / Streaming

4. In MapReduce is not flexible or ideal for every use case.

Whenever we use Spark we have to arrange Storage and Resource Manager, but at the same time Spark is just an plug and play Compute engine, which can just plug with any Storage such as AWS (S3) / HDFS / ADLS Gen2/ GCS / Local and it can also plug with any Resource manager such as YARN / Mesos / Kubernetes.

Officially Apache Spark is define as multi language engine for executing data engineering, data science, machine learning on single node or cluster.

A very small and precise definition of Spark is

  • General Purpose
  • In Memory
  • Compute Engine

MapReduce Job Challenges:

Consider we have 5 MR jobs MR1, MR2, MR3, MR4, MR5 and underneath their is HDFS. So how the MR1 will work, it take the data from HDFS i.e from Disk it will first read. This MR1 will do the process i.e. Mapper and Reducer activities and post that it will give Final output back to HDFS i.e to disk (write operation). This way other Mappers will work Reading from disk, process it and puts back to disk. From this we come to know that their is lot of Disk IOs involved when we have series or chaining of MapReduce. At the end because of Disk IOs MapReduce lags behind in Performance.

Here comes the Spark evolution and how Spark is better than traditional MapReduce. In Spark we only reads from disk initially and at the end puts data to disk as Final Output. So only two disk io involved while process data with Spark. Here Spark is having capabilities to process data in-memory, which helps in faster data process and shares data to adjacent unit for further operations.

That's why we say Spark is a In-Memory Compute Engine.


Stay stunned for Spark understanding - Deep-dive into Apache Spark in coming series....


By Sunil Ghate

#KeepLearning #KeepUpskilling

To view or add a comment, sign in

More articles by Sunil Ghate

  • First Spark Program - pySpark Orders

    /* Problem 1: We want to wrie the Spark code to load / read file from HDFS Load orders data and create RDD and further…

    2 Comments
  • Sqoop Internals

    𝐇𝐨𝐰 𝐒𝐪𝐨𝐨𝐩 𝐰𝐨𝐫𝐤𝐬 𝐒𝐭𝐞𝐩 𝟏: 𝐌𝐞𝐭𝐚𝐝𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬: When you run a Sqoop import command, Sqoop…

    1 Comment
  • MapReduce - Combiner (Local Aggregation)

    MapReduce - Combiner (Local Aggregation) Consider another use case. Suppose in a factory we have a cold storage section…

  • MapReduce Paradigm - Multi Reducers

    MapReduce Paradigm Multi Reducer scenario By default we have 4 Mappers and 1 Reducer as default configured in system…

    1 Comment
  • MapReduce Paradigm....continuation another example

    Map Phase Map phase runs on each machine and gives us parallelism. Reduce Phase - Aggregation We will consider a use…

    2 Comments
  • MapReduce Program Paradigm

    MapReduce divided into two phase 1. Map phase Mapper works on a principle called Data locality we keep the mapper on…

  • HDFS Architecture

    HDFS Architecture Master Slave Architecture: - Master Node - We call Master node as NameNode. - Slave Node - We call…

    2 Comments

Others also viewed

Explore content categories