how-to-write-spark-applications-in-python
MapReduce is a programming model and an associated implementation tool for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.Google presented this in a paper in 2004 and now in the world of big data, processing petabytes of data seem easy. Later in 2010, another paper by Matei Zaharia added Spark. Thus, Spark came into existence and became Apache’s top-level project in 2014.
What is Apache Spark ?
Spark is a distributed memory computational framework.
It aims to provide a single platform that can be used for real-time applications (streaming), complex analysis (machine learning), interactive queries(shell), and batch processing (Hadoop integration).
- It has specialized modules that run on top of Spark – SQL , Streaming, Graph Processing(GraphX), Machine Learning(MLLIb).
- Spark introduces an abstract common data format that is used for efficient data sharing across parallel computation – RDD.
- Spark supports Map/Reduce programming model. (Note: Not same as Hadoop MR).
Spark Application Building Blocks
Spark Context
First thing that a Spark program does is create a SparkContext object, which tells Spark how to access a cluster.
Spark Master
- Connects to a cluster manager which allocates resources across applications.
- Acquires executors on cluster nodes – worker processes to run computations and store data.
- Sends app code to the executors.
- Sends tasks for the executors to run.
RDD
Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel.
There are currently two types:
- Parallelized collections – takes an existing Scala collection and runs functions on it in parallel.
- Hadoop datasets – runs functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop.
Two types of operations can be performed on RDDs – transformations and actions.
- Transformations are lazy (not computed immediately).
- The transformed RDD gets recomputed when an action is run on it.
- However, an RDD can be persisted into storage in memory or disk.
For more to read