DAG in Apache Spark
Directed Acyclic Graph DAG in Apache Spark

DAG in Apache Spark

What is DAG in Apache Spark?

(Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. In Spark DAG, every edge is directed from earlier to later in the sequence. On calling of Action, the created DAG is submitted to DAG Scheduler which further splits the graph into the stages of the task.

DAG is a finite directed graph with no directed cycles. There are finitely many vertices and edges, where each edge directed from one vertex to another. It contains a sequence of vertices such that every edge is directed from earlier to later in the sequence. It is a strict generalization of MapReduce model. DAG operations can do better global optimization than other systems like MapReduce. The picture of DAG becomes clear in more complex jobs.

Apache Spark DAG allows the user to dive into the stage and expand on detail on any stage. In the stage view, the details of all RDDs belonging to that stage are expanded. The Scheduler splits the Spark RDD into stages based on various transformation applied. (You can refer this link to learn RDD Transformations and Actions in detail) Each stage is comprised of tasks, based on the partitions of the RDD, which will perform same computation in parallel. The graph here refers to navigation, and directed and acyclic refers to how it is done.

Need of Directed Acyclic Graph in Spark

The limitations of Hadoop MapReduce became a key point to introduce DAG in Spark. The computation through MapReduce is carried in three steps:

The data is read from HDFS.

Map and Reduce operations are applied.

The computed result is written back to HDFS.

Each MapReduce operation is independent of each other and HADOOP has no idea of which Map reduce would come next. Sometimes for some iteration, it is irrelevant to read and write back the immediate result between two map-reduce jobs. In such case, the memory in stable storage (HDFS) or disk memory gets wasted.

In multiple-step, till the completion of the previous job all the jobs are blocked from the beginning. As a result, complex computation can require a long time with small data volume.

While in Spark, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed. In this way, the execution plan is optimized, e.g. to minimize shuffling data around. In contrast, it is done manually in MapReduce by tuning each MapReduce step.

How DAG works in Spark?

The interpreter is the first layer, using a Scala interpreter, Spark interprets the code with some modifications.

Spark creates an operator graph when you enter your code in Spark console.

When an Action is called on Spark RDD at a high level, Spark submits the operator graph to the DAG Scheduler.

Operators are divided into stages of the task in the DAG Scheduler. A stage contains task based on the partition of the input data. The DAG scheduler pipelines operators together. For example, map operators are scheduled in a single stage.

The stages are passed on to the Task Scheduler. It launches task through cluster manager. The dependencies of stages are unknown to the task scheduler.

The Workers execute the task on the slave.

Read Complete Article >>

See Also-

How does Apache Spark work?

Apache Spark vs. Hadoop Mapreduce



To view or add a comment, sign in

More articles by Malini Shukla

Others also viewed

Explore content categories