Introduction to Apache spark

Introduction to Apache spark

Apache Spark is a Distributed Computing Framework.

Before going into Apache Spark let us understand what are the challenges with the Mapreduce and why Apache Spark came into existence.

Challenges of Mapreduce

  • It is very hard to write the code in mapreduce.
  • It supports only Batch processing. To use streaming, we need to use other integrations.
  • There will be lot of disk I/O's which leads to performance degradations.
  • We have to learn lot of other frameworks such as Hive, Pig, Sqoop etc.
  • It does consists of only Map and Reduce.
  • There is no Interactive mode present.

What is Apache Spark?

Apache Spark is a plug and play compute engine. Spark does not come up with storage and Resource Manager.

  • We can plug it with the storage of our choice such as with HDFS, Amazon S3, ADLS Gen2, Google Cloud Storage, Local storage etc.
  • We can plug it with the Resource Manager of our choice such as with YARN, Mesos, Kubernetes etc.

Everyone get confused that Spark is an alternative for Hadoop. But it is wrong. Apache Spark is an alternative for Mapreduce in Hadoop ecosystem.

It is a General Purpose, In-Memory, Compute Engine.

General Purpose

  • There is no need to seperate code which has to be written for batch processing and streaming. We can use SQL style, Dataframe style etc. We can do querying, streaming and cleaning without other tools.

In-Memory

  • In Mapreduce, if we have 10 mapreduce jobs, there will be 20 disk I/O's i.e Mapreduce1 will read input from HDFS and give output to HDFS. Likewise Mapreduce2 will take input as the output of Mapreduce1 job from HDFS and process the output and so on.
  • But in Spark, it could be as less as 2 Disk I/O's. Computation happens In-Memory and then it is written to the disk.

Compute Engine

  • It is mainly used for the computation or processing of the data which is distributed.

Apache Spark vs Databricks

Databricks is a company and they have a product called as Databricks.

It is Spark internally but have extra features such as:

  • Provide Apache Spark on Cloud - AWS/ GCP/ AZURE
  • Provide Optimized Spark environment.
  • Provide Cluster Management.
  • Support Delta Lake architecture.
  • One can colloborate the notebooks.
  • Provide Implemented Security.

Apache Spark provides mainly two types of API's

  • Spark Core API's - We work at the RDD level
  • Higher Level API's - We can write code in Data Frames, Spark SQL, Spark Table style. It also supports Structured Streaming, MLlib, GraphX

RDD's

  • RDD stands for Resilientt Distributed Dataset.
  • It is a basic unit which holds the data in Apache Spark.
  • RDD is resilient to failures. We can quickly regenerate the RDD using the parent RDD.
  • RDD's are immutable. We cannot make changes to an existing RDD. We end up creating new RDD always for a transformation.

Drawbacks of RDD

  • It does not consist of any schema. It is just a raw data which is distributed across various partitions.
  • It is not persistent i.e, it can only be seen for a session. If we close the session, we won't be able to see it. It is temporary.

Dataframes

  • A distributed collection of data grouped into named columns.
  • A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

Spark SQL

  • Spark SQL is a Spark module for structured data processing.
  • Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.

Spark Table

  • Spark table is something which is persistent.
  • It is accessible across other Session.

Transformations, Actions and Utility Functions

Transformations - These are the operations which are applied on RDD to create a new RDD.

There are two types of Transformations:

  • Narrow Transformations - These transformations are the result of functions where the computed data lives on a single partition. There won't be any shuffling of data which would take place. There won't be any data movement to execute a Narrow Transformation, since the data reside on only single partition.eg: map(), filter()
  • Wide Transformations - These transformations are the result of functions where the computed data lives on multiple partitions. There will be shuffling of data which will take place. There will be a data movement to execute a Wide Transformation, since the data reside on multiple partitions.eg: groupByKey(), reduceByKey()

Actions - These are the operations which are applied on RDD, that instructs the Apache Spark to perform computation and send the result back to driver.

eg: reduce(), count()

Utility Functions - These are the builtin functions available for operations.

eg: cache(), printSchema()

Why Transformations are Lazy?

Consider we have a file file1 in HDFS which is having more than 10 billion records and we perform certain operations.

  • rdd1 = load file1 from hdfs
  • print first line from the above rdd1

If the transformation was not lazy, there would have been 10 billion records loaded into memory just to display a single record.

SparkSession is an entry point for any Spark program to execute. Before Spark2, we had

  • spark context
  • hive context
  • sql context

But after Spark2, it has been bundled as an umbrella as SparkSession.


Credit: Sumit Mittal sir

To view or add a comment, sign in

More articles by Nikhil G R

  • Introduction to DBT (Data Build Tool)

    dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…

  • DIFFERENCES IN SQL

    WHERE vs HAVING WHERE and HAVING clauses are both used in SQL to filter data. WHERE WHERE clause should be used before…

  • Introduction to Azure Databricks (Part 2)

    DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.

  • Introduction to Azure Databricks (Part 1)

    Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…

  • Aggregate and Window Functions in Pyspark

    Aggregate Functions These are the functions where the number of output rows will always be less than the number of…

  • Different ways of creating a Dataframe in Pyspark

    Using spark.read Using spark.

  • Dataframes and Spark SQL Table

    Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…

  • Dataframe Reader API

    We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…

  • repartition vs coalesce in pyspark

    repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

    2 Comments
  • Apache Spark on YARN Architecture

    Before going through the Spark architecture, let us understand the Hadoop ecosystem. The core components of Hadoop are…

Others also viewed

Explore content categories