Spark Performance Optimization: Data Serialization

Serialization is to convert an object to byte stream and the vice versa is for de-serialization. This is very helpful when you save object to disk and send them in network. These scenarios are commonly happen when we execute in distributed environments. As we know Apache spark works in distributed environments. Sometimes object need to travel over a network from driver to executors or between the executors.

We can configure the serializer type using property => spark.serializer

There are two types of serializers provided in Spark

1. Java serialization - By default, Spark serializes objects using java.io.Serializable. Class or any of its superclasses implements either the java.io.Serializable interface or its subinterface, java.io.Externalizable. A class is never serialized only object of a class is serialized.

Advantages -

a. Simple and convenient to implement and use

b. No need to configure manually

Disadvantages - 

a. Not efficient for large objects

b. Serialization speed is low

c. Serialized Data size is large compared to Kryo. Memory consumption is more.

2. Kryo serialization - Kryo is a Java serialization framework with a focus on speed, efficiency, and a user-friendly API. 

Advantages - 

a. Faster than Java serialization mechanism

b. Serialize data size is smaller compared to Java serialized. Memory consumption is less.

c. Kryo Serialization when shuffle can optimize the performance of network transmission.

Disadvantages -

a. Need to manually configure

b. Need to register the classes in advance.

When an unregistered class is encountered, a serializer is automatically choosen from a list of “default serializers” that maps a class to a serializer. If no default serializers match a class, then the global default serializer is used.

The global default serializer is set to FieldSerializer by default.

How to Use Kryo serialization -

1. Initialize Spark configuration by setting property -> SparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

2. Register custom classes using SparkConf.registerKryoClasses()

setWarnUnregisteredClasses = true => Give warnings if custom class are not registered

spark.kryo.registrationRequired = true => Throws exception for classes not registered


Please let me know your thoughts/inputs in case anything needed to be added

I would add that class registration is not mandatory. The Kryo library will still work without registration but it will most likely not achieve the same efficiency as with registration. I have seen and implemented jobs where simply changing the spark.serializer already improved performance. This is also what the Spark documentation (source: https://spark.apache.org/docs/latest/tuning.html#data-serialization) refers to when saying "Finally, if you don’t register your custom classes, Kryo will still work" (Last paragraph on data serialization). The Spark documentation officially recommends testing if the Kryo library leads to performance improvements.

To view or add a comment, sign in

More articles by Rahul Chanda

Others also viewed

Explore content categories