how-to-write-spark-applications-in-python

Shahid Ashraf

Published Apr 24, 2016

MapReduce is a programming model and an associated implementation tool for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.Google presented this in a paper in 2004 and now in the world of big data, processing petabytes of data seem easy. Later in 2010, another paper by Matei Zaharia added Spark. Thus, Spark came into existence and became Apache’s top-level project in 2014.

What is Apache Spark ?

Spark is a distributed memory computational framework.

It aims to provide a single platform that can be used for real-time applications (streaming), complex analysis (machine learning), interactive queries(shell), and batch processing (Hadoop integration).

It has specialized modules that run on top of Spark – SQL , Streaming, Graph Processing(GraphX), Machine Learning(MLLIb).
Spark introduces an abstract common data format that is used for efficient data sharing across parallel computation – RDD.
Spark supports Map/Reduce programming model. (Note: Not same as Hadoop MR).

Spark Application Building Blocks

Spark Context

First thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster.

Spark Master

Connects to a cluster manager which allocates resources across applications.
Acquires executors on cluster nodes – worker processes to run computations and store data.
Sends app code to the executors.
Sends tasks for the executors to run.

RDD

Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel.

There are currently two types:

Parallelized collections – takes an existing Scala collection and runs functions on it in parallel.
Hadoop datasets – runs functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop.

Two types of operations can be performed on RDDs – transformations and actions.

Transformations are lazy (not computed immediately).
The transformed RDD gets recomputed when an action is run on it.
However, an RDD can be persisted into storage in memory or disk.

For more to read

To view or add a comment, sign in

how-to-write-spark-applications-in-python

Shahid Ashraf

More articles by Shahid Ashraf

Others also viewed

Handling Large Data using PySpark

Getting Started with PySpark: A Comprehensive Guide to Distributed Data Processing ⚡️

PYSPARK - SHOULD MANAGERS USE IT?

Creating Random Test Data in Spark using PySpark

PySpark: Apache Spark in Pythonic Way!!

PySpark

Pyspark Dataframe Operations

Spark Tidbits - Lesson 8

Apache Spark: Architecture, Execution Model, and Optimization

Introduction to Spark MLlib for Big Data and Machine Learning

Explore content categories