Apache Spark Catalyst Optimizer

Shobhit Singh

Published Oct 20, 2022

At the core of Spark SQL is the Catalyst Optimizer. Let's explore its different phases.

Analysis
Logical Optimization
Physical Planning
Code Generation

Spark will raise org.apache.spark.sql.catalyst.parser.ParseException in case of any syntax issues. For example, I have provided wrong spelling of cast function in select expression, hence Spark raised an exception.

Once the syntax check is done, Spark creates the Unresolved Logical Plan for the given dataframe.

Analysis

In this phase catalyst optimizer convert the Unresolved Logical Plan to Logical Plan by resolving all references using the catalog.

Resolving references means whether the provided column name is valid or not, type of the column is matching with the performed computation. Spark SQL uses catalyst rules & catalog object – that tracks the tables in data sources to resolve the attributes.

Let’s say we have created a Dataframe, and by mistake we have provided any wrong column name, here Spark will raise an exception org.apache.spark.sql.AnalysisException

In this phase Spark also resolve the types also for all columns. If you notice below explain plan – In Parsed Logical Plan types of the columns are not associated, but in Analyzed Logical Plan types are associated.

Logical Optimization

In this stage catalyst optimizer convert the Logical Plan to Optimized Logical Plan by applying standard rule-based optimizations to the logical plan. These include predicate pushdown, projection pruning etc.

See this example of filter pushed down. Suppose we are running select & then filtering data.

Physical Planning

In this phase – Spark SQL takes a Optimized Logical Plan and generates one or more Physical Plans using physical operators that match the Spark execution engine. Later Spark select the best plan using a cost model.

Here Spark also performs rule-based physical optimizations, such as filters into one Spark map operation. Also, it can push operations from the logical plan into data sources that support predicate or projection pushdown.

See this example of pushed filter:

Code Generation

At the end, Spark only deals with the lower level construct i.e. RDD (Resilient Distributed Datasets). In this final phase catalyst optimizer generate bytecode that will run on each machine on these in-memory datasets efficiently.

Please refer the spark documentation for more information.

References

https://www.databricks.com/glossary/catalyst-optimizer

https://www.databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Dataset Used for Analysis: 2022 January - Yellow Taxi Trip Records

Catalyst Optimizer Image Credit: Databricks

long nv 2y

😍

To view or add a comment, sign in

Apache Spark Catalyst Optimizer

Shobhit Singh

Analysis

Logical Optimization

Recommended by LinkedIn

Physical Planning

Code Generation

References

More articles by Shobhit Singh

Others also viewed

SQL vs Pandas: A Comparative Study

SQL vs Pandas: The ALTER Operation🛠️

Robust LLM-Driven SQL Generation

Analyzing Data with Apache Spark on Microsoft Fabric: A Practical Data Dive

DP-600 Essentials: When and How to Use Each Language in Microsoft Fabric

SQL vs Pandas: DDL Operations | DROP, COMMENT, and TRUNCATE 🗑️📝✂️

No CLR, No Cry: Native Regex Rocks SQL Server 2025

Exploring the String Functions in Spark SQL: A Guide with Examples

Finding lineage using Spark Catalyst Parser

Understanding Spark's Query Planning: Parsed, Analyzed, and Optimized Logical Plans

Explore content categories

Analysis

Logical Optimization

Recommended by LinkedIn

Physical Planning

Code Generation

References

More articles by Shobhit Singh

EDA using Apache Spark

Apache Spark – SparkSession

Apache Spark – implicits object (Implicits Conversions)

Apache Spark - createDataFrame

Scala Object

Configuring Spark Application

Python Identifiers - Non-ASCII

Others also viewed

SQL vs Pandas: A Comparative Study

SQL vs Pandas: The ALTER Operation🛠️

Robust LLM-Driven SQL Generation

Analyzing Data with Apache Spark on Microsoft Fabric: A Practical Data Dive

DP-600 Essentials: When and How to Use Each Language in Microsoft Fabric

SQL vs Pandas: DDL Operations | DROP, COMMENT, and TRUNCATE 🗑️📝✂️

No CLR, No Cry: Native Regex Rocks SQL Server 2025

Exploring the String Functions in Spark SQL: A Guide with Examples

Finding lineage using Spark Catalyst Parser

Understanding Spark's Query Planning: Parsed, Analyzed, and Optimized Logical Plans

Explore content categories