Introduction to Apache Spark

Aneshka Goyal

Published Jan 10, 2021

Spark is an open source in memory massively parallel execution environment for running analytics applications.The other framework for running analytics applications/processing big data is MapReduce.MapReduce basically is split-apply-combine strategy for data analysis and it involves storing the split data on to disks of a cluster. As opposed to this spark being in memory is faster for such parallel distributed processing of data.Spark can be thought of as an in memory layer sitting on top of a huge data store from where data can be loaded in memory and processed in parallel across the cluster. Sparks also works by distributing the data across the cluster and then performing parallel processing/analysis but as said earlier here the data resided in memory rather than on disk hence giving it the speed.

Features

Speed → Due to in memory processing
Caching → Spark has a caching layer to cache the data which makes the processing even faster
Deployment → Can be deployed in a Hadoop cluster or its own spark cluster
Polyglot → Code can be written in python, java, Scala and R
RealTime → Primarily it was developed keeping in my realtime processing of data and hence supports it.

Some of the key differences between Apache spark and MapReduce, the two most common big data processing frameworks

Spark Ecosystem

Engine -Spark Core, basic core component of Spark ecosystem on top of which the entire ecosystem is built. It performs the tasks of scheduling/monitoring and basic IO functionality
Management - Spark cluster can be managed by Hadoop YARN, Mesos and Spark cluster manager.
Library - Spark ecosystem comprises of Spark SQL(for running sql like queries on RDD or data from external sources), Spark Mlib(for ML),Spark Graph X(for construction of graphs better visualisation of data), Spark streaming(for batch processing and streaming of data in the same application)
Programming can be done in Python, Java, Scala and R
Storage - Data can be stored in HDFS, S3, local storage and it supports both SQL and NoSQL databases.

RDDS

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory or cache an RDD for performance efficiency, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures. RDDs being immutable play a great role in self recovery. Being immutable means that the sequence of transformations applied to generate an RDD needs to be stored.This is done as a DAG and the data sets are replicated on multiple nodes. So if a particular node processing a partition of data set undergoes a failure, the cluster manager can assign the processing to some other node that use the lineage of operations(in DAG) and recover the data it was working on. RDDs are schema less data structure that can handle both structured and un structured data.All operations are done on RDDs transforming one RDD to another and ultimately storing it in persistent storage.It is immutable distributed collection of objects where the objects can even belong to user defined classes.RDDs support lazy evaluation, until an action is performed the results are not evaluated. The transformations produce new RDDs and the action produces results.

Spark works primarily in 3 modes

1. Batch mode - Where a job is scheduled and there is a queue that is used to run the batch of jobs without manual intervention.

2. Stream mode - The program runs and processes as and when the data stream comes

3. Interactive mode - The user executes the command on shell, primarily used for dev purposes.

Spark Architecture

Spark works in a master slave architecture
The driver node contains the driver program that has spark context.
The driver program implicitly converts the user submitted program into a DAG(Directed acyclic graph)
This spark context with the help of cluster manager allocates the task(executable parts of job) of the job to worker nodes.
The cluster manager allocates the resources needed by the driver to execute the job
Worker nodes consist of executors which are the one where the tasks are executed and the results are returned back to the context.

Summary

Here we talked about :-

Features of spark
Spark and MapReduce differences
RDDs
Spark ecosystem
Spark architecture

Sources of Knowledge

https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/
https://www.edureka.co/blog/spark-architecture/
https://spark.apache.org/docs/latest/

To view or add a comment, sign in

Introduction to Apache Spark

Aneshka Goyal

More articles by Aneshka Goyal

Others also viewed

Apache Spark for Beginner's

Quick Introduction to Apache Spark

Spark Cluster Manager

Fast data access using GemFire and Apache Spark (Part 1):Introduction

Big data processing - Apache Spark ecosystem

Apache Spark

Apache Spark: An integral part of any modern Big Data Stack

Spark Architecture

Apache Spark

Introduction to Spark

Batch Processing in Big Data

Machine Learning Frameworks

How to Understand Spark Architecture

Tips for Optimizing Apache Spark Performance

Explore content categories

More articles by Aneshka Goyal

Event Driven Architecture - Evolution and Flavours

Introduction to Agent To Agent (A2A) Protocol

Introduction to AgenticAI

Introduction to SSO with OIDC

Introduction to Apache Cassandra

Introduction to GraphQL Federation - Netflix DGS and Apollo Gateway

Introduction to Distributed Tracing

Introduction to Service Discovery

Introduction to Micro frontend

Introduction to Pub-Sub and Streams with Redis&SpringBoot

Others also viewed

Apache Spark for Beginner's

Quick Introduction to Apache Spark

Spark Cluster Manager

Fast data access using GemFire and Apache Spark (Part 1):Introduction

Big data processing - Apache Spark ecosystem

Apache Spark

Apache Spark: An integral part of any modern Big Data Stack

Spark Architecture

Apache Spark

Introduction to Spark

Similar topics

Batch Processing in Big Data

Machine Learning Frameworks

How to Understand Spark Architecture

Tips for Optimizing Apache Spark Performance

Explore content categories