Begin your data engineering with Spark!

Data Engineering is a core discipline in the field of data science and analytics and is focused on the design, development, and maintenance of ETL systems that collect, store, process, and transform raw data into usable formats. Data Engineering is the backbone of data-driven services in any organization that rely on data to drive innovation and strategy and make informed decisions.

Key domains of data engineering are building data pipelines, data integration, data cleaning and Transformation, database design and management, Infrastructure and Tools, collaboration.

An optimized distributed computing system like, Apache Spark, is critical for fast processing of large-scale data. Apache spark is known for its speed, ease of use, and versatility in handling a wide range of data processing tasks — from batch and streaming to machine learning and graph processing.

Key Strengths of Apache spark come from its:

Faster Processing: Processing data in memory (RAM), which makes it blazingly fast compared to traditional disk-based systems like Hadoop MapReduce.

Unified Engine approach: Supporting multiple workloads including batch processing, real-time streaming, SQL, machine learning (MLlib), and graph processing (GraphX) - all within a single framework.

High Scalability: Spark can scale from a single machine to thousands of nodes in a cluster.

Multi Language Support: It offers APIs in Scala, Java, Python (PySpark), and R, making it accessible to a wide range of developers and data scientists.

Robust Integration: Easily integrates with big data tools and platforms like Hadoop HDFS, Hive, Cassandra, HBase, and cloud storage systems.

Referring to workflow model above, a basic Apache Spark Execution Model, data flows through the Driver Node, undergoes transformations and actions across Executor Nodes, and finally reaches the Destination Database.

=>Driver Node: The driver node manages Spark jobs and orchestrates execution. It reads data from the Source Database and partitions it into multiple chunks (Data 1 and Data 2). Driver performs two key operations, Partitioning - Distributes data across multiple executors and Collect - Gathers results when necessary.

=>Multiple Executor Nodes processes data using Resilient Distributed Datasets (RDDs)

Execution Flow:

RDD Creation: Raw data is loaded into RDD 11 and RDD 21 in separate executors.
Transformations: These include operations like map(), filter(), and flatMap(), applied to RDD 12 and RDD 22.
Shuffle: Data is shuffled between executors, a costly operation that occurs when Spark needs to redistribute data (e.g., during groupByKey() or reduceByKey()).
Actions: Final computations are performed on RDD 13 and RDD 23, producing the results.

Final Data Storage

The processed data is sent to the Destination Database, completing the Spark job.

This execution model highlights Spark’s capability for large-scale distributed data processing, making it ideal for ETL pipelines, machine learning, and analytics.

Optimizing Apache Spark Execution Model in Python

To improve the performance of the execution model shown in the diagram, consider these optimizations:

1. Minimize Shuffle Operations

Issue:

Shuffling (data movement between nodes) is costly and slows execution.
Occurs in transformations like groupByKey(), reduceByKey(), and join().

Optimization:

Use map-side reductions with reduceByKey() instead of groupByKey(), as reduceByKey() performs partial aggregation before shuffling.

✅ Use this:

rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
optimized_rdd = rdd.reduceByKey(lambda x, y: x + y)  # Less shuffle

❌ Avoid this:

rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
non_optimized_rdd = rdd.groupByKey().mapValues(sum)  # High shuffle

2. Optimize Data Partitioning

Issue:

Improper partitioning can lead to data skew (one executor getting too much data).
Small partitions cause task overhead, while large partitions slow processing.

Optimization:

Use coalesce() to reduce partitions after a shuffle.
Use repartition() when increasing partitions (causes full shuffle).

✅ Use this (reducing partitions efficiently after shuffle):

rdd = rdd.repartition(4)  # Full shuffle
rdd = rdd.coalesce(2)  # Minimized shuffle

3. Cache and Persist to Reduce Recomputations

Issue:

RDD transformations are lazy and recompute results if used multiple times.

Optimization:

Use cache() for small datasets and persist() for large ones with specific storage levels.

✅ Use this:

rdd = rdd.map(lambda x: x * 2)
rdd.cache()  # Keeps in memory for reuse
result1 = rdd.count()  # Uses cached data
result2 = rdd.collect()  # Uses cached data

4. Use Broadcast Variables to Reduce Network Overhead

Issue:

Joins can cause a lot of data movement, leading to expensive shuffling.

Optimization:

Use broadcast variables for small lookup datasets to prevent redundant data transfers.

✅ Use this:

from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
small_lookup_data = {"a": 1, "b": 2, "c": 3}
broadcast_var = sc.broadcast(small_lookup_data)
rdd = sc.parallelize([("a", 100), ("b", 200), ("c", 300)])
optimized_rdd = rdd.map(lambda x: (x[0], x[1] * broadcast_var.value[x[0]]))

5. Use DataFrames Instead of RDDs

Issue:

RDDs are low-level and require manual optimizations.
DataFrames use Catalyst Optimizer, making queries faster.

Optimization:

Convert RDDs to DataFrames and use Spark SQL for better performance.

✅ Use this:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Optimization").getOrCreate()
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df_filtered = df.filter(df["age"] > 30)
df_filtered.show()

❌ Avoid this (less optimized RDD approach):

rdd = sc.textFile("data.csv").map(lambda x: x.split(","))
filtered_rdd = rdd.filter(lambda row: int(row[1]) > 30)

Happy engineering with data!

Begin your data engineering with Spark!

Raghu Dongur

Optimizing Apache Spark Execution Model in Python

1. Minimize Shuffle Operations

Issue:

Optimization:

2. Optimize Data Partitioning

Issue:

Optimization:

3. Cache and Persist to Reduce Recomputations

Issue:

Optimization:

4. Use Broadcast Variables to Reduce Network Overhead

Issue:

Optimization:

5. Use DataFrames Instead of RDDs

Issue:

Optimization:

More articles by Raghu Dongur

Explore content categories

Optimizing Apache Spark Execution Model in Python

1. Minimize Shuffle Operations

Issue:

Optimization:

2. Optimize Data Partitioning

Issue:

Optimization:

3. Cache and Persist to Reduce Recomputations

Issue:

Optimization:

4. Use Broadcast Variables to Reduce Network Overhead

Issue:

Optimization:

5. Use DataFrames Instead of RDDs

Issue:

Optimization:

More articles by Raghu Dongur

Switching to Microservices? Plan to exploit its Security Benefits!

Traits of Development Manager

Doing smart business with AWS Cross-Region Replication!

Docker..Ride it for your cloud computing!

Data driven application management

Explore content categories