Spark Performance Optimization: Data Serialization

Rahul Chanda

Published Jun 22, 2021

Serialization is to convert an object to byte stream and the vice versa is for de-serialization. This is very helpful when you save object to disk and send them in network. These scenarios are commonly happen when we execute in distributed environments. As we know Apache spark works in distributed environments. Sometimes object need to travel over a network from driver to executors or between the executors.

We can configure the serializer type using property => spark.serializer

There are two types of serializers provided in Spark

1. Java serialization - By default, Spark serializes objects using java.io.Serializable. Class or any of its superclasses implements either the java.io.Serializable interface or its subinterface, java.io.Externalizable. A class is never serialized only object of a class is serialized.

Advantages -

a. Simple and convenient to implement and use

b. No need to configure manually

Disadvantages -

a. Not efficient for large objects

b. Serialization speed is low

c. Serialized Data size is large compared to Kryo. Memory consumption is more.

2. Kryo serialization - Kryo is a Java serialization framework with a focus on speed, efficiency, and a user-friendly API.

Advantages -

a. Faster than Java serialization mechanism

Recommended by LinkedIn

What Happens in the Spark Driver Program?

Dinesh Rajput 11 months ago

🔍 Delving into the Internal Working of Java HashMap:…

ASHIK L J 1 year ago

HashMap and its Internal Working

AlphaDot Technologies 2 years ago

b. Serialize data size is smaller compared to Java serialized. Memory consumption is less.

c. Kryo Serialization when shuffle can optimize the performance of network transmission.

Disadvantages -

a. Need to manually configure

b. Need to register the classes in advance.

When an unregistered class is encountered, a serializer is automatically choosen from a list of “default serializers” that maps a class to a serializer. If no default serializers match a class, then the global default serializer is used.

The global default serializer is set to FieldSerializer by default.

How to Use Kryo serialization -

1. Initialize Spark configuration by setting property -> SparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

2. Register custom classes using SparkConf.registerKryoClasses()

setWarnUnregisteredClasses = true => Give warnings if custom class are not registered

spark.kryo.registrationRequired = true => Throws exception for classes not registered

Please let me know your thoughts/inputs in case anything needed to be added

Daniel Kreeft 3y

I would add that class registration is not mandatory. The Kryo library will still work without registration but it will most likely not achieve the same efficiency as with registration. I have seen and implemented jobs where simply changing the spark.serializer already improved performance. This is also what the Spark documentation (source: https://spark.apache.org/docs/latest/tuning.html#data-serialization) refers to when saying "Finally, if you don’t register your custom classes, Kryo will still work" (Last paragraph on data serialization). The Spark documentation officially recommends testing if the Kryo library leads to performance improvements.

3 Reactions

To view or add a comment, sign in

Spark Performance Optimization: Data Serialization

Rahul Chanda

Recommended by LinkedIn

More articles by Rahul Chanda

Others also viewed

What is the collection framework

🔍 Unlocking the Power of Java HashSet: Efficient Data Management!

The triangle for GC performance metrics - What makes a good GC algorithm

Implementing Stack Data Structure in Java

Vector vs ArrayList

HashMap vs Hashtable vs ConcurrentHashMap — Internal Working & Use Cases

🚀 10 Data Structures We Use Every Day (and How to Apply Them in Java with Real-World Examples)

Create an Empty Spark Dataset / Dataframe using Java

Custom HashMap Implementation in Java

Kafka Consumer with Circuit Breaker, Retry Patterns using Resilience4j

Explore content categories

Recommended by LinkedIn

More articles by Rahul Chanda

Git - Devops tool

Spark Performance Optimization: Partitioning in Spark

Spark Performance Optimization: Spark Advance Variables

Others also viewed

What is the collection framework

🔍 Unlocking the Power of Java HashSet: Efficient Data Management!

The triangle for GC performance metrics - What makes a good GC algorithm

Implementing Stack Data Structure in Java

Vector vs ArrayList

HashMap vs Hashtable vs ConcurrentHashMap — Internal Working & Use Cases

🚀 10 Data Structures We Use Every Day (and How to Apply Them in Java with Real-World Examples)

Create an Empty Spark Dataset / Dataframe using Java

Custom HashMap Implementation in Java

Kafka Consumer with Circuit Breaker, Retry Patterns using Resilience4j

Similar topics

How to Optimize Data Serialization

Tips for Optimizing Apache Spark Performance

How to Optimize Pyspark Job Performance

Explore content categories