Apache Spark 101

Subodh Samal

Published Mar 21, 2016

Why Spark?

MapReduce framework developed by Google was primarily designed for distributed sorting, pattern-based searching, web access log stats, web indexing etc. It served well to an extent in it's usage, on variety of other applications, due to the following limitations

Suitable only for batch jobs not interactive
Writes and Reads from disk drive
Not good for iterative processing due to huge space consumption by each job
Not good for processing complex algorithms like m/c learning
High latency. Application that requires low latency time or random access to large data set is not feasible

Impact of MapReduce limitations:

To address MapReduce limitations, many specialized systems came into play like Impala for 'interactive and random access', Mahout for 'M/C Learning', Giraph for 'Iterative graph processing' Storm for 'Real-Time data processing etc. The idea behind spark is to address those limitations of MapReduce processing as well as provide wider language support.

What is Spark:

Spark is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified framework

Key Distinction between Spark and MapReduce:

Handle batch, interactive, and real-time within a single framework
In-Memory Storage
Rich APIs in Java, Scala, Python enable programing at a higher level of abstraction including map/reduce
Easier to implement complex algorithms like m/c learning as it support general programming construct not just map/reduce
Speed as much as 100 times faster than map reduce
Ease of use and flexibility over map reduce. Code volume can be reduced by a factor of 5 or more compare to map reduce as it is very API centric.

Key Concepts

Life cycle Of A Spark Program

Create input RDDs (RDDs are like tables in DB) from external data
Lazily transform them to define new RDDs (Transformed RDDs) using transformations like filter() or map()

Cache any intermediate RDDs that will need to be reused
Launch actions such as count() and collect() to kick off a parallel computation

Spark Program - Resilient Distributed Datasets (RDD)

Resilient Distributed Dataset or RDD is the core concept in spark framework
Think RDD as a table in database but it can hold any type of data not just structured data
Spark store data in RDD in different partitions

RDDs are fault tolerant, knows how to re-create and re-compute
RDDs are immutable. You can modify an RDD with transformation but the transformation returns a new RDD and the original RDD remains the same

Example (Creating a base RDD using Scala)

val lines = sc.textFile("README.md)

Transformations:

Transformations create a new data set (transformed RDDs) from existing one
All transformations in spark are lazy i.e. they do not compute their results right away, instead they remember the transformation applied to some base dataset.
This lineage is used to achieve fault-tolerant i.e. if any partition of an RDD is lost it will automatically be recomputed using the transformations that originally created it
For reusability, RDDs can be persisted either in memory or in disk using transformation functions like MEMORY_ONLY(default), MEMORY_AND_DISK, DISK_ONLY
map(func), filter(func), union(), distinct(), grouopByKey(), reduceByKey() are some of the transformation examples.

Example (Creating a transformed RDD using Scala)

val errors = lines.filter(_.startsWith("ERROR"))

Actions:

Action kicked up the actual computation
reduce(), collect(), count(), first(), take(), takeSample(), saveAsTextFile(path) are some examples of action

Example

errors.count() //count number of lines with the word ERROR

Difference between Hadoop & Mapreduce and Apache Spark

Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.

Is Apache Spark going to replace Hadoop?

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. Hadoop supports both traditional map/reduce and Spark.

When and Where to use Spark:

For Spark the best use cases are interactive as well as iterative data processing and adhoc analysis of moderate size data (as big as the cluster's RAM). In that sense here are some of the examples.

Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis

Sourav Parida 7y

Nice n In detail explanation.

Travis Charon 10y

Nice write up. We need to get together soon and compare notes again.

1 Reaction

See more comments

To view or add a comment, sign in

Data Architecture on BigQuery

Feb 16, 2022

Apache Spark 101

Subodh Samal

Why Spark?

Impact of MapReduce limitations:

What is Spark:

Key Distinction between Spark and MapReduce:

Key Concepts

Life cycle Of A Spark Program

Spark Program - Resilient Distributed Datasets (RDD)

Example (Creating a base RDD using Scala)

Transformations:

Example (Creating a transformed RDD using Scala)

Actions:

Example

Difference between Hadoop & Mapreduce and Apache Spark

Is Apache Spark going to replace Hadoop?

When and Where to use Spark:

More articles by this author

Others also viewed

Building scalable Machine Learning pipelines with Apache Spark and Python

Learning by Doing: Scala (and Spark)

PYSPARK - SHOULD MANAGERS USE IT?

PySpark Linear Regression Model

Hight level API in Spark

So You Want to Do Machine Learning - At Scale?

Optimizing Large-Scale Fuzzy Matching with Apache Spark and Databricks

Spark UDF - Complete Guide

Practical Apache Spark in 10 minutes. Part 2 - RDD

Introduction to Spark MLlib for Big Data and Machine Learning

Explore content categories

Why Spark?

Impact of MapReduce limitations:

What is Spark:

Key Distinction between Spark and MapReduce:

Key Concepts

Life cycle Of A Spark Program

Spark Program - Resilient Distributed Datasets (RDD)

Example (Creating a base RDD using Scala)

Transformations:

Example (Creating a transformed RDD using Scala)

Actions:

Example

Difference between Hadoop & Mapreduce and Apache Spark

Is Apache Spark going to replace Hadoop?

When and Where to use Spark:

Data Architecture on BigQuery

Feb 16, 2022

Others also viewed

Building scalable Machine Learning pipelines with Apache Spark and Python

Learning by Doing: Scala (and Spark)

PYSPARK - SHOULD MANAGERS USE IT?

PySpark Linear Regression Model

Hight level API in Spark

So You Want to Do Machine Learning - At Scale?

Optimizing Large-Scale Fuzzy Matching with Apache Spark and Databricks

Spark UDF - Complete Guide

Practical Apache Spark in 10 minutes. Part 2 - RDD

Introduction to Spark MLlib for Big Data and Machine Learning

Similar topics

Tips for Optimizing Apache Spark Performance

How to Understand Spark Architecture

Explore content categories