Creating a Standalone Spark Cluster on a Single Window Machine

Chandan Parihar

Published Nov 13, 2018

One may often feel the need of creating a cluster like environment quickly & rapidly as and when it requires. In your organization, sometimes it is possible to have only one environment ready i.e. QA/Test/Prod where one can go & test things arbitrarily. However, you do not want to mess up with the Production environment especially to simply test the development code. In most of the companies, working with an existing cluster may not be easy due to limited access imposed by the organization or having a shared cluster, be it a production or test. You might have come across situations where your job fails due to not enough resources available on the cluster as there are other jobs which have taken up all the cluster resources and here your job which may be couple of lines of codes to test some operations on the cluster got struck.

Ok, so to set the context straight, here we are not concerned about data volume but the integrity of the code written for a big data solution.

So, following are some of the options to address this problem-

Creating cluster using multiple VMs
Creating cluster using multiple docker instances
Creating multiple workers within the same machine

Most of us are quite familiar with the first two options. Those are very well-known solutions. However, it takes lot of efforts from someone to setup a cluster using these and there are couple of challenges as well:

It takes a lot of time, installations & configurations
It requires someone to establish connection between various instances by opening the required ports & by setting the password-less ssh between them
Requires people expertise on DevOps

Though those are very prominent ways to setup a cluster, however, to test a sample application or a couple of lines of code written in Spark, we need something very quick.

So, in this article I am going to talk about the third solution which is not so well known in the developer community but it’s quite handy and quick & dirty way to setup the cluster on one machine and test stuffs.

So, to make it possible, Spark comes with a special package called “spark-class” as part of its installation packages. It is located under %SPARK_HOME%\bin folder.

How it works- Using this utility, you can simply start the master and worker instances separately. When you start the master, you don’t have to provide the resource information. However, while starting worker you may have to provide the resource requirements.

Note: the following assumes that you have good amount of resources available on your PC. Though that’s not an absolute necessity i.e. RAM > =8 GB , Cores >= 4 c

Starting Master

spark-class org.apache.spark.deploy.master.Master

This is as straightforward as it seen. You can specify the Host & Port by passing the -I & -p parameters along with it, if you needed.

Starting Worker

1st Instance - spark-class org.apache.spark.deploy.worker.Worker spark://[HostName]:7077 -c 1 -m 500mb

2nd Instance - spark-class org.apache.spark.deploy.worker.Worker spark:// [HostName]:7077 -c 1 -m 500mb

Starting two instances of the same configuration with 1 core & 500mb executor memory. So here is the thing, now your client application should not go beyond this resource limit, if it does then spark may not launch processes inside any of these instances.

Client Application

spark-shell --master spark:// [HostName]:7077 --executor-memory 500mb --executor-cores 1

Notice the above command, here you’d need to specify the executor requirements for the client application, if you don’t do that it may start the spark-shell without using the worker instances. So one has to keep this in mind the configurations of already running executors before starting the client application.

The following configuration options can be passed to the master and worker:

Problem Seen

One of the problem faced by me while doing the above thing is the following exception message which had been prompting even if I changed the resource limits

“java.lang.IllegalArgumentException : System memory “x-y”must be at least x”. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration

Solution

Just add this in %SPARK_HOME%\conf\spark-default.conf

spark.testing.memory 512000000

In this article, I haven’t changed the values of resource limits that I used for the testing. You can play around with those values, if you want. Once you’re ready with this, navigate to Spark UI to see your job running and the master/worker instances that you just ran in, to see & get a feel of cluster like environment in a single machine.

Creating a Standalone Spark Cluster on a Single Window Machine

Chandan Parihar

More articles by Chandan Parihar

Others also viewed

Learn Spark Driver Memory Allocation the Simple Way

Spark Resource Allocation

Testing Databricks Auto Loader

Airflow In Short

🚀 How Apache Spark Executes Queries Internally: A Step-by-Step Guide

Spark with dependency injection

Spark Internals

Performance Tuning in join Spark 3.0

Data Source V2 API in Spark 3.0 - Part 1 : Motivation for New Abstractions

Explore content categories