Apache Spark 101

Apache Spark 101

Why Spark?

MapReduce framework developed by Google was primarily designed for distributed sorting, pattern-based searching, web access log stats, web indexing etc. It served well to an extent in it's usage, on variety of other applications, due to the following limitations

  1. Suitable only for batch jobs not interactive
  2. Writes and Reads from disk drive
  3. Not good for iterative processing due to huge space consumption by each job
  4. Not good for processing complex algorithms like m/c learning
  5. High latency. Application that requires low latency time or random access to large data set is not feasible

Impact of MapReduce limitations:

To address MapReduce limitations, many specialized systems came into play like Impala for 'interactive and random access', Mahout for 'M/C Learning', Giraph for 'Iterative graph processing' Storm for 'Real-Time data processing etc. The idea behind spark is to address those limitations of MapReduce processing as well as provide wider language support.

What is Spark:

Spark is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified framework

 Key Distinction between Spark and MapReduce:

  • Handle batch, interactive, and real-time within a single framework
  • In-Memory Storage
  • Rich APIs in Java, Scala, Python enable programing at a higher level of abstraction including map/reduce
  • Easier to implement complex algorithms like m/c learning as it support general programming construct not just map/reduce
  • Speed as much as 100 times faster than map reduce
  • Ease of use and flexibility over map reduce. Code volume can be reduced by a factor of 5 or more compare to map reduce as it is very API centric.

Key Concepts

Life cycle Of A Spark Program

  •  Create input RDDs (RDDs are like tables in DB) from external data
  • Lazily transform them to define new RDDs (Transformed RDDs) using transformations like filter() or map()
  •  Cache any intermediate RDDs that will need to be reused
  •  Launch actions such as count() and collect() to kick off a parallel computation

Spark Program - Resilient Distributed Datasets (RDD)

  • Resilient Distributed Dataset or RDD is the core concept in spark framework
  • Think RDD as a table in database but it can hold any type of data not just structured data
  • Spark store data in RDD in different partitions
  • RDDs are fault tolerant, knows how to re-create and re-compute
  • RDDs are immutable. You can modify an RDD with transformation but the transformation returns a new RDD and the original RDD remains the same

Example (Creating a base RDD using Scala)

        val lines = sc.textFile("README.md)

Transformations:

  •  Transformations create a new data set (transformed RDDs) from existing one
  • All transformations in spark are lazy i.e. they do not compute their results right away, instead they remember the transformation applied to some base dataset.
  • This lineage is used to achieve fault-tolerant i.e. if any partition of an RDD is lost it will automatically be recomputed using the transformations that originally created it
  • For reusability, RDDs can be persisted either in memory or in disk using transformation functions like MEMORY_ONLY(default), MEMORY_AND_DISK, DISK_ONLY
  • map(func), filter(func), union(), distinct(), grouopByKey(), reduceByKey() are some of the transformation examples.

Example (Creating a transformed RDD using Scala)

        val errors = lines.filter(_.startsWith("ERROR"))

Actions:

  • Action kicked up the actual computation
  • reduce(), collect(), count(), first(), take(), takeSample(), saveAsTextFile(path) are some examples of action

Example

errors.count() //count number of lines with the word ERROR

Difference between Hadoop & Mapreduce and Apache Spark

Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.

Is Apache Spark going to replace Hadoop?

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. Hadoop supports both traditional map/reduce and Spark.

When and Where to use Spark:

For Spark the best use cases are interactive as well as iterative data processing and adhoc analysis of moderate size data (as big as the cluster's RAM). In that sense here are some of the examples.

  1. Iterative Algorithms in Machine Learning
  2. Interactive Data Mining and Data Processing
  3. Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis

Nice n In detail explanation.

Like
Reply

Nice write up. We need to get together soon and compare notes again.

To view or add a comment, sign in

Others also viewed

Explore content categories