Introduction to Apache spark

Nikhil G R

Published Nov 16, 2023

Apache Spark is a Distributed Computing Framework.

Before going into Apache Spark let us understand what are the challenges with the Mapreduce and why Apache Spark came into existence.

Challenges of Mapreduce

It is very hard to write the code in mapreduce.
It supports only Batch processing. To use streaming, we need to use other integrations.
There will be lot of disk I/O's which leads to performance degradations.
We have to learn lot of other frameworks such as Hive, Pig, Sqoop etc.
It does consists of only Map and Reduce.
There is no Interactive mode present.

What is Apache Spark?

Apache Spark is a plug and play compute engine. Spark does not come up with storage and Resource Manager.

We can plug it with the storage of our choice such as with HDFS, Amazon S3, ADLS Gen2, Google Cloud Storage, Local storage etc.
We can plug it with the Resource Manager of our choice such as with YARN, Mesos, Kubernetes etc.

Everyone get confused that Spark is an alternative for Hadoop. But it is wrong. Apache Spark is an alternative for Mapreduce in Hadoop ecosystem.

It is a General Purpose, In-Memory, Compute Engine.

General Purpose

There is no need to seperate code which has to be written for batch processing and streaming. We can use SQL style, Dataframe style etc. We can do querying, streaming and cleaning without other tools.

In-Memory

In Mapreduce, if we have 10 mapreduce jobs, there will be 20 disk I/O's i.e Mapreduce1 will read input from HDFS and give output to HDFS. Likewise Mapreduce2 will take input as the output of Mapreduce1 job from HDFS and process the output and so on.
But in Spark, it could be as less as 2 Disk I/O's. Computation happens In-Memory and then it is written to the disk.

Compute Engine

It is mainly used for the computation or processing of the data which is distributed.

Apache Spark vs Databricks

Databricks is a company and they have a product called as Databricks.

It is Spark internally but have extra features such as:

Provide Apache Spark on Cloud - AWS/ GCP/ AZURE
Provide Optimized Spark environment.
Provide Cluster Management.
Support Delta Lake architecture.
One can colloborate the notebooks.
Provide Implemented Security.

Apache Spark provides mainly two types of API's

Spark Core API's - We work at the RDD level
Higher Level API's - We can write code in Data Frames, Spark SQL, Spark Table style. It also supports Structured Streaming, MLlib, GraphX

RDD's

RDD stands for Resilientt Distributed Dataset.
It is a basic unit which holds the data in Apache Spark.
RDD is resilient to failures. We can quickly regenerate the RDD using the parent RDD.
RDD's are immutable. We cannot make changes to an existing RDD. We end up creating new RDD always for a transformation.

Recommended by LinkedIn

Apache Flink GA - Planning for the Future

Jim Scott 9 years ago

Apache Spark on AWS

Ashok K Sahoo 1 year ago

Apache Spark Vs Apache Flink

Indrajit S. 8 years ago

Drawbacks of RDD

It does not consist of any schema. It is just a raw data which is distributed across various partitions.
It is not persistent i.e, it can only be seen for a session. If we close the session, we won't be able to see it. It is temporary.

Dataframes

A distributed collection of data grouped into named columns.
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

Spark SQL

Spark SQL is a Spark module for structured data processing.
Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.

Spark Table

Spark table is something which is persistent.
It is accessible across other Session.

Transformations, Actions and Utility Functions

Transformations - These are the operations which are applied on RDD to create a new RDD.

There are two types of Transformations:

Narrow Transformations - These transformations are the result of functions where the computed data lives on a single partition. There won't be any shuffling of data which would take place. There won't be any data movement to execute a Narrow Transformation, since the data reside on only single partition.eg: map(), filter()
Wide Transformations - These transformations are the result of functions where the computed data lives on multiple partitions. There will be shuffling of data which will take place. There will be a data movement to execute a Wide Transformation, since the data reside on multiple partitions.eg: groupByKey(), reduceByKey()

Actions - These are the operations which are applied on RDD, that instructs the Apache Spark to perform computation and send the result back to driver.

eg: reduce(), count()

Utility Functions - These are the builtin functions available for operations.

eg: cache(), printSchema()

Why Transformations are Lazy?

Consider we have a file file1 in HDFS which is having more than 10 billion records and we perform certain operations.

rdd1 = load file1 from hdfs
print first line from the above rdd1

If the transformation was not lazy, there would have been 10 billion records loaded into memory just to display a single record.

SparkSession is an entry point for any Spark program to execute. Before Spark2, we had

spark context
hive context
sql context

But after Spark2, it has been bundled as an umbrella as SparkSession.

Credit: Sumit Mittal sir

Ashwini Gadekar 1y

👍

To view or add a comment, sign in

Introduction to Apache spark

Nikhil G R

Challenges of Mapreduce

What is Apache Spark?

General Purpose

In-Memory

Compute Engine

Apache Spark vs Databricks

RDD's

Recommended by LinkedIn

Drawbacks of RDD

Dataframes

Spark SQL

Spark Table

Transformations, Actions and Utility Functions

Why Transformations are Lazy?

More articles by Nikhil G R

Others also viewed

Understanding Distributed Systems: A Comparison of Apache Spark and Hadoop MapReduce

Distributed Data Processing - MapReduce vs. Apache Spark

Redis: How This In-Memory Beast Solves Hard Engineering Problems

WHAT IS APACHE SPARK?

Spark Optimization Techniques - Part 1

Apache Spark Use Cases

Use Case Discovery :: Summarization Optimization with HDP Apache Spark & Avro.

Real-time Analytics using Kafka Streams

Performance comparison Apache Kudu vs Databricks Delta

Apache Spark as a service -The essentials and overview of services by IBM and Databricks

Explore content categories

Challenges of Mapreduce

What is Apache Spark?

General Purpose

In-Memory

Compute Engine

Apache Spark vs Databricks

RDD's

Recommended by LinkedIn

Drawbacks of RDD

Dataframes

Spark SQL

Spark Table

Transformations, Actions and Utility Functions

Why Transformations are Lazy?

More articles by Nikhil G R

Introduction to DBT (Data Build Tool)

DIFFERENCES IN SQL

Introduction to Azure Databricks (Part 2)

Introduction to Azure Databricks (Part 1)

Aggregate and Window Functions in Pyspark

Different ways of creating a Dataframe in Pyspark

Dataframes and Spark SQL Table

Dataframe Reader API

repartition vs coalesce in pyspark

Apache Spark on YARN Architecture

Others also viewed

Understanding Distributed Systems: A Comparison of Apache Spark and Hadoop MapReduce

Distributed Data Processing - MapReduce vs. Apache Spark

Redis: How This In-Memory Beast Solves Hard Engineering Problems

WHAT IS APACHE SPARK?

Spark Optimization Techniques - Part 1

Apache Spark Use Cases

Use Case Discovery :: Summarization Optimization with HDP Apache Spark & Avro.

Real-time Analytics using Kafka Streams

Performance comparison Apache Kudu vs Databricks Delta

Apache Spark as a service -The essentials and overview of services by IBM and Databricks

Explore content categories