Introduction to Apache Spark
Spark is an open source in memory massively parallel execution environment for running analytics applications.The other framework for running analytics applications/processing big data is MapReduce.MapReduce basically is split-apply-combine strategy for data analysis and it involves storing the split data on to disks of a cluster. As opposed to this spark being in memory is faster for such parallel distributed processing of data.Spark can be thought of as an in memory layer sitting on top of a huge data store from where data can be loaded in memory and processed in parallel across the cluster. Sparks also works by distributing the data across the cluster and then performing parallel processing/analysis but as said earlier here the data resided in memory rather than on disk hence giving it the speed.
Features
- Speed → Due to in memory processing
- Caching → Spark has a caching layer to cache the data which makes the processing even faster
- Deployment → Can be deployed in a Hadoop cluster or its own spark cluster
- Polyglot → Code can be written in python, java, Scala and R
- RealTime → Primarily it was developed keeping in my realtime processing of data and hence supports it.
Some of the key differences between Apache spark and MapReduce, the two most common big data processing frameworks
Spark Ecosystem
- Engine -Spark Core, basic core component of Spark ecosystem on top of which the entire ecosystem is built. It performs the tasks of scheduling/monitoring and basic IO functionality
- Management - Spark cluster can be managed by Hadoop YARN, Mesos and Spark cluster manager.
- Library - Spark ecosystem comprises of Spark SQL(for running sql like queries on RDD or data from external sources), Spark Mlib(for ML),Spark Graph X(for construction of graphs better visualisation of data), Spark streaming(for batch processing and streaming of data in the same application)
- Programming can be done in Python, Java, Scala and R
- Storage - Data can be stored in HDFS, S3, local storage and it supports both SQL and NoSQL databases.
RDDS
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory or cache an RDD for performance efficiency, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures. RDDs being immutable play a great role in self recovery. Being immutable means that the sequence of transformations applied to generate an RDD needs to be stored.This is done as a DAG and the data sets are replicated on multiple nodes. So if a particular node processing a partition of data set undergoes a failure, the cluster manager can assign the processing to some other node that use the lineage of operations(in DAG) and recover the data it was working on. RDDs are schema less data structure that can handle both structured and un structured data.All operations are done on RDDs transforming one RDD to another and ultimately storing it in persistent storage.It is immutable distributed collection of objects where the objects can even belong to user defined classes.RDDs support lazy evaluation, until an action is performed the results are not evaluated. The transformations produce new RDDs and the action produces results.
Spark works primarily in 3 modes
1. Batch mode - Where a job is scheduled and there is a queue that is used to run the batch of jobs without manual intervention.
2. Stream mode - The program runs and processes as and when the data stream comes
3. Interactive mode - The user executes the command on shell, primarily used for dev purposes.
Spark Architecture
- Spark works in a master slave architecture
- The driver node contains the driver program that has spark context.
- The driver program implicitly converts the user submitted program into a DAG(Directed acyclic graph)
- This spark context with the help of cluster manager allocates the task(executable parts of job) of the job to worker nodes.
- The cluster manager allocates the resources needed by the driver to execute the job
- Worker nodes consist of executors which are the one where the tasks are executed and the results are returned back to the context.
Summary
Here we talked about :-
- Features of spark
- Spark and MapReduce differences
- RDDs
- Spark ecosystem
- Spark architecture
Sources of Knowledge
- https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/
- https://www.edureka.co/blog/spark-architecture/
- https://spark.apache.org/docs/latest/