Apache YARN - part 3

Sumit V.

Published Mar 26, 2017

If you haven't read my previous two articles on this topic, I would encourage you to read part 1 and part 2. In this article, I want to try to highlight significance of YARNs most important component, the Capacity Scheduler. Two main sources for my learning has been Apache YARN site and Apache Hadoop YARN: Moving beyond Mapreduce.. book. There are some good pointers from Cloudera and Hortonworks resources too.

Capacity Scheduler

Recall the driving factors for evolution of YARN? If not, give a quick round of reading to part 1 of this series. So, once the deployment architecture in an organization evolves, centralized data repositories and shared compute clusters become a need. A successful model for this is for multiple teams, suborganizations, or business units within a single parent organization to come together to pool compute resources and share resources for efficiency. Apache Hadoop started supporting such shared clusters beginning with Hadoop version 0.20. A simple first-in, first-out (FIFO) job scheduler that allowed scheduling for shared clusters but was insufficient to address various emerging use-cases. This led to implementation of Capacity scheduler and Apache Hadoop 2 inherits most of the functionality.

The Capacity scheduler supports sharing of resources between individuals, teams, and suborganizations in an elastic fashion by its queue capacities, minimum user percentages, and limits. It is designed to enable sharing of a single YARN cluster while simultaneously giving each organization guarantees on the allocated capacities. ApplicationMaster knows the nodes where the data blocks are and based on this it negotiates resources and capacity scheduler knows the availability of resources on nodes and these two together make the eco-system dynamic, resource aware and data locality aware.

Hierarchical Queues

The fundamental unit of scheduling in YARN is a queue. A queue is either a logical collection of applications submitted by various users or a composition of more queues. In the Capacity scheduler, each queue typically represents an organization, while the capacity of the queue represents the capacity (of the cluster) that the organization is entitled to use. See the following sample cluster capacity allocation.

Queues are of two types: parent queues and leaf queues. They are represented in a tree structure. Parent queues enable the management of resources across organizations and suborganizations. They can contain more parent queues or leaf queues, they do not themselves accept any application submissions directly. Top most parent is called root, which represents cluster itself. Leaf queues denote the queues that can accept applications and have no more child. Using parent and leaf queues, administrators can do capacity allocations to various organizations and suborganizations. See above picture captured from one of the Hortonworks presentations.

Root Queue, distributes resources among all the parent queues and orchestrates the application submission to child queues. Sorted list of child queues based on current capacity in use is maintained by each parent queue and leaf queues hold list of active applications from users and schedule resources in FIFO manner, while respecting resource allocation limits at user/group level. Capacity scheduler reaches any queue from top down by following the entire path starting from root. So, from the above figure we can deduce the queue path definition that will be used by capacity scheduler in following fashion:

and access control is controlled by following sample configuration on Mrkting queue:

Using similar configurations (yarn.scheduler.capacity.root.Mrkting.capacity = 30) capacity allocation is achieved in the scheduler. No capacity can be more than 100 and leafs that are most under-served are given priority while allocating resources and inside a leaf, resource are allocated to applications FIFO. http://host:8080/cluster/scheduler can be viewed for the current state of scheduler.

In my view, YARN is to be accepted as the most complex and least understood component of Hadoop eco-system. I hope this series of articles on YARN helps us get more familiar with it.

To view or add a comment, sign in

Apache YARN - part 3

Sumit V.

Capacity Scheduler

Hierarchical Queues

More articles by Sumit V.

Others also viewed

Microsoft Azure and Hadoop Ecosystem

Arth - Task3

Sharing Limited Storage of a Slave Node to the Master Node in an HDFS Cluster

🔷Task 4.1 :- In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?

Apache Phoenix - OLTP and operational analytics for Apache Hadoop

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

What is Apache Tez?

YARN & MapR, YARN Requirements and YARN Frameworks

Hadoop 3 is here! What's new?

Explore content categories

Capacity Scheduler

Hierarchical Queues

More articles by Sumit V.

Kafka - Core Components in 10 mins

Apache YARN - part 2

Apache YARN – Part 1

Flume architecture and concepts

Partitioning Hive tables

Hive on Spark vs SparkSQL

Others also viewed

Microsoft Azure and Hadoop Ecosystem

Arth - Task3

Sharing Limited Storage of a Slave Node to the Master Node in an HDFS Cluster

🔷Task 4.1 :- In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?

Apache Phoenix - OLTP and operational analytics for Apache Hadoop

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

What is Apache Tez?

YARN & MapR, YARN Requirements and YARN Frameworks

Hadoop 3 is here! What's new?

Explore content categories