Apache Spark Catalyst Optimizer

Apache Spark Catalyst Optimizer

At the core of Spark SQL is the Catalyst Optimizer. Let's explore its different phases.

  • Analysis
  • Logical Optimization
  • Physical Planning
  • Code Generation

No alt text provided for this image

Spark will raise org.apache.spark.sql.catalyst.parser.ParseException in case of any syntax issues. For example, I have provided wrong spelling of cast function in select expression, hence Spark raised an exception.

No alt text provided for this image

Once the syntax check is done, Spark creates the Unresolved Logical Plan for the given dataframe.

Analysis

In this phase catalyst optimizer convert the Unresolved Logical Plan to Logical Plan by resolving all references using the catalog.

Resolving references means whether the provided column name is valid or not, type of the column is matching with the performed computation. Spark SQL uses catalyst rules & catalog object – that tracks the tables in data sources to resolve the attributes.

Let’s say we have created a Dataframe, and by mistake we have provided any wrong column name, here Spark will raise an exception org.apache.spark.sql.AnalysisException

No alt text provided for this image

In this phase Spark also resolve the types also for all columns. If you notice below explain plan – In Parsed Logical Plan types of the columns are not associated, but in Analyzed Logical Plan types are associated.

No alt text provided for this image

Logical Optimization

In this stage catalyst optimizer convert the Logical Plan to Optimized Logical Plan by applying standard rule-based optimizations to the logical plan. These include predicate pushdown, projection pruning etc.

See this example of filter pushed down. Suppose we are running select & then filtering data.

df_locations.selectExpr("locationid||'-'||zone as loc_zon").filter(col("borough")==="EWR" )        

Notice in below explain plan: In Analyzed Logical Plan Filter is applied after the Project is applied. In Optimized Plan, filter is pushed down.

No alt text provided for this image

Physical Planning

In this phase – Spark SQL takes a Optimized Logical Plan and generates one or more Physical Plans using physical operators that match the Spark execution engine. Later Spark select the best plan using a cost model.

Here Spark also performs rule-based physical optimizations, such as filters into one Spark map operation. Also, it can push operations from the logical plan into data sources that support predicate or projection pushdown.

See this example of pushed filter:

No alt text provided for this image

Code Generation

At the end, Spark only deals with the lower level construct i.e. RDD (Resilient Distributed Datasets). In this final phase catalyst optimizer generate bytecode that will run on each machine on these in-memory datasets efficiently.

Please refer the spark documentation for more information.

References

https://www.databricks.com/glossary/catalyst-optimizer

https://www.databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Dataset Used for Analysis: 2022 January - Yellow Taxi Trip Records

Catalyst Optimizer Image Credit: Databricks

To view or add a comment, sign in

More articles by Shobhit Singh

  • EDA using Apache Spark

    There are bunch of methods available in Apache Spark for performing Exploratory Data Analysis. In this article we are…

  • Apache Spark – SparkSession

    SparkSession was introduced in version Spark 2.0.

  • Apache Spark – implicits object (Implicits Conversions)

    In the SparkSession class there is one object defined as implicits, which extends SQLImplicits abstract class. So once…

  • Apache Spark - createDataFrame

    createDataFrame is an overloaded method present in SparkSession class type (org.apache.

  • Scala Object

    First thing first – Don’t get confused with the instance of an class. When we create an instance of an class, we…

  • Configuring Spark Application

    Apache Spark includes a number of different configurations. Depending on what we are trying to achieve.

  • Python Identifiers - Non-ASCII

    Identifiers i.e.

    1 Comment

Others also viewed

Explore content categories