Spark Performance Tuning

Harjeet Kumar Rajpal

Published Jan 9, 2016

Starting with spark to develop big data applications is very easy. Spark provide so many options to do a simple thing. Performance tuning in big data is very important aspect. If you can do even minor improvements, on huge scale you can save a lot of time and resources. In this blog post, we will discuss that what are those small things that we can do to get maximum out of our spark cluster.

Before jumping on the specifics, let us try to understand that how spark works.

Actors: Spark has two major components,

1. Driver

2. Executor

Driver is kind of master process which controls everything. Spark runs multiple Executors. Executors are like slaves, they do the actual execution of tasks. we should also understand that a work is just set of some tasks. Driver hands the work and executors handle tasks. One executor can run multiple tasks at a time. it depends on how much resources we have.

Once the job is submitted to spark. spark will create an execution plan for executing the job. This plan divides the whole job into set of stages. Each stage contains multiple tasks. Sequence of stages to run is decided by the dependency. when we start only those stages run for which data is already available and they do not depend on any other stage and this is subsequently repeated. Creation of the stages also depends on shuffling of data. Within a stage data shuffle must to be required.

One more important thing to consider here is that, more is the shuffles, slower are the things. shuffle is normally required in Wide transformations. People who are not aware, spark has two kind of transformation Narrow and wide. Narrow Transformation output is generated using only one partition data. In wide transformation, data from multiple partitions may be required. Eg map is narrow transformation and groupByKey is wide transformation . I will discuss it in detail in another post.

Now we understand some of basics of Spark. Let us explore, where we can add more power

1. Define Right number of Executors and Executor cores - You can change number of while submitting a job or while starting shell using property --num-executors. After this job is not done. you need to also define number of executor cores. Executor core defines number of concurrent tasks each executor can run. If we give executors cores very high, then spark will create very high number of tasks for each executor. These tasks will compete each other for resources and It will reduce data I/O throughput.

Eg. spark-shell --num-executers 15 --executor-cores 5 --master yarn test.jar com.test.Example

2. Changing No of Partitions - No of tasks created in a stage is equal to no of partitions of data. So having vary huge number of partitions can reduce your clusters performance. This is very common problem when you run spark code locally, if your data has huge number of partitions , you will see your code running very slowly. change number of partitions using coalesce method. If you are using spark data frames, then use repartition method.

3. Use Broadcast Variables - This is like Distributed cache in Hadoop. we can share data across worker nodes in read only way. If used intelligently , we can optimize our code with this.

4. Caching your Data - If you are doing lot of exploration on a single data frame, caching that data will speed up any execution that you try on that data frame.

5. Prefer ReduceByKey than groupByKey - If I explain it in Hadoop terms, then reduceByKey is reduce operation with combiner and groupByKey is recude without combiner. So when data shuffle happens during these operations as these are wide transformations, more data transfer happen in case of groupByKey. so it is slower as compare to reduceByKey.

6. flatmap-join-groupBy vs cogroup - Use cogroup wherever possible, because it has better performance than flatmap-join-groupBy pattern. It avoid extra overheads of packing and unpacking of data.

These were some of my thoughts on improving performance of any work that we do in Spark. Please feel free to share your thoughts.

Rahul Singh Rawat 6y

One questions sir how to do we actually decide how many partitions be there in my spark application. What are the. Major deciding factors for choosing the the number of partitions

Kumar Chinnakali 10y

Nice and it's very use full

See more comments

To view or add a comment, sign in

Spark Performance Tuning

Harjeet Kumar Rajpal

More articles by Harjeet Kumar Rajpal

Others also viewed

Evolution of Big Data from being complementary to be a necessity.

From Parquet to Liquid Clustering: How Databricks Organizes Big Data

When Spark Tuning Is not Enough: What Data Engineers Can Do Next

Contour in Palantir Foundry: The Hidden Gem for Data Teams

Data Platform Insight : Edition 06

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ✅

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

Azure Databricks - Data Skipping Optimization in Delta Tables

Handling skewed data in Spark✨

Event Series Joins in Vertica: ASOF Joins, Evolved

Explore content categories

More articles by Harjeet Kumar Rajpal

HBase vs Cassandra

Demystifying Consistent Hashing

Consistent Hashing Unveiled: Your Ticket to Efficient Data Distribution

Enterprise Kafka and Spark : Kerberos based Integration

Advanced Spark : Custom Receiver For Spark Streaming

Kafka Spark Integration Issue- Task Not Serializable

Kafka and Spark Streaming Integration

Different Modes of Submitting Spark Job on Yarn

Custom UDF in Apache Spark

Starting with Spark Streaming