Spark Performance Tuning

Starting with spark to develop big data applications is very easy. Spark provide so many options to do a simple thing. Performance tuning in big data is very important aspect. If you can do even minor improvements, on huge scale you can save a lot of time and resources. In this blog post, we will discuss that what are those small things that we can do to get maximum out of our spark cluster.


Before jumping on the specifics, let us try to understand that how spark works.


Actors:  Spark has two major components, 

1. Driver

2. Executor


Driver is kind of master process which controls everything. Spark runs multiple Executors. Executors are like slaves, they do the actual execution of tasks. we should also understand that a work is just set of some tasks. Driver hands the work and executors handle tasks. One executor can run multiple tasks at a time. it depends on how much resources we have.

         Once the job is submitted to spark. spark will create an execution plan for executing the job. This plan divides the whole job into set of stages. Each stage contains multiple tasks. Sequence of stages to run is decided by the dependency. when we start only those stages run for which data is already available and they do not depend on any other stage and this is subsequently repeated. Creation of the stages also depends on shuffling of data. Within a stage data shuffle must to be required.


     One more important thing to consider here is that, more is the shuffles, slower are the things. shuffle is normally required in Wide transformations. People who are not aware, spark has two kind of transformation Narrow and wide. Narrow Transformation output is generated using only one partition data. In wide transformation, data from multiple partitions may be required. Eg map is narrow transformation and groupByKey is wide transformation  .  I will discuss it in detail in another post. 


 Now we understand some of basics of Spark. Let us explore, where we can add more power


1. Define Right number of Executors and Executor cores -  You can change number of while submitting a job or while starting shell using property  --num-executors. After this job is not done. you need to also define number of executor cores. Executor core defines number of concurrent tasks each executor can run. If we give executors cores very high, then spark will create very high number of tasks for each executor. These tasks will compete each other for resources and It will reduce data I/O throughput.


Eg. spark-shell --num-executers 15 --executor-cores 5 --master yarn test.jar com.test.Example   


2. Changing No of Partitions - No of tasks created in a stage is equal to no of partitions of data. So having vary huge number of partitions can reduce your clusters performance. This is very common problem when you run spark code locally,  if your data has huge number of partitions , you will see your code running very slowly. change number of partitions using coalesce method. If you are using spark data frames, then use repartition method. 


3. Use Broadcast Variables - This is like Distributed cache in Hadoop. we can share data across worker nodes in read only way. If used intelligently , we can optimize our code with this.


4. Caching your Data - If you are doing lot of exploration on a single data frame, caching that data will speed up any execution that you try on that data frame.


5. Prefer ReduceByKey than groupByKey - If I explain it in Hadoop terms, then reduceByKey is reduce operation with combiner and groupByKey is recude without combiner. So when data shuffle happens during these operations as these are wide transformations, more data transfer happen in case of groupByKey. so it is slower as compare to reduceByKey. 


6. flatmap-join-groupBy vs cogroup - Use cogroup wherever possible, because it has better performance than flatmap-join-groupBy pattern. It avoid extra overheads of packing and unpacking of data.


      These were some of my thoughts on improving performance of any work that we do in Spark. Please feel free to share your thoughts. 

One questions sir how to do we actually decide how many partitions be there in my spark application. What are the. Major deciding factors for choosing the the number of partitions

Like
Reply

Nice and it's very use full

Like
Reply

To view or add a comment, sign in

More articles by Harjeet Kumar Rajpal

  • HBase vs Cassandra

    HBase vs Cassandra: Choosing the Right Dance Partner for Your Big Data Waltz The big data ballroom can be overwhelming,…

    3 Comments
  • Demystifying Consistent Hashing

    In the vast universe of distributed systems, there's a powerful concept called Consistent Hashing that plays a crucial…

    1 Comment
  • Consistent Hashing Unveiled: Your Ticket to Efficient Data Distribution

    Picture a library bustling with books and readers. Just like in the digital world, there's a challenge in efficiently…

  • Enterprise Kafka and Spark : Kerberos based Integration

    In previous posts we have seen how to integrate Kafka with Spark streaming. However in a typical enterprise environment…

  • Advanced Spark : Custom Receiver For Spark Streaming

    Apache Spark has become very widely used tool in Big data world. It provides us a one stop shop to do lot of activities…

    2 Comments
  • Kafka Spark Integration Issue- Task Not Serializable

    When you Integrate Kafka with spark streaming. There are some very small things that needs to be taken care.

    1 Comment
  • Kafka and Spark Streaming Integration

    Spark Streaming and Kafka are two talked about technologies these days. Kafka is a distributed queue.

    1 Comment
  • Different Modes of Submitting Spark Job on Yarn

    If you are using spark on Yarn, then you must have observed that there are different ways a job can be run on yarn…

  • Custom UDF in Apache Spark

    Apache Spark has become very widely used framework to build Big data application. Spark SQL has made adhoc analysis on…

    1 Comment
  • Starting with Spark Streaming

    Spark has evolved a lot as a distributed computing framework. In the beginning, it was only intended to be used for…

    8 Comments

Others also viewed

Explore content categories