Apache Spark Catalyst Optimizer
At the core of Spark SQL is the Catalyst Optimizer. Let's explore its different phases.
Spark will raise org.apache.spark.sql.catalyst.parser.ParseException in case of any syntax issues. For example, I have provided wrong spelling of cast function in select expression, hence Spark raised an exception.
Once the syntax check is done, Spark creates the Unresolved Logical Plan for the given dataframe.
Analysis
In this phase catalyst optimizer convert the Unresolved Logical Plan to Logical Plan by resolving all references using the catalog.
Resolving references means whether the provided column name is valid or not, type of the column is matching with the performed computation. Spark SQL uses catalyst rules & catalog object – that tracks the tables in data sources to resolve the attributes.
Let’s say we have created a Dataframe, and by mistake we have provided any wrong column name, here Spark will raise an exception org.apache.spark.sql.AnalysisException
In this phase Spark also resolve the types also for all columns. If you notice below explain plan – In Parsed Logical Plan types of the columns are not associated, but in Analyzed Logical Plan types are associated.
Logical Optimization
In this stage catalyst optimizer convert the Logical Plan to Optimized Logical Plan by applying standard rule-based optimizations to the logical plan. These include predicate pushdown, projection pruning etc.
See this example of filter pushed down. Suppose we are running select & then filtering data.
Recommended by LinkedIn
df_locations.selectExpr("locationid||'-'||zone as loc_zon").filter(col("borough")==="EWR" )
Notice in below explain plan: In Analyzed Logical Plan Filter is applied after the Project is applied. In Optimized Plan, filter is pushed down.
Physical Planning
In this phase – Spark SQL takes a Optimized Logical Plan and generates one or more Physical Plans using physical operators that match the Spark execution engine. Later Spark select the best plan using a cost model.
Here Spark also performs rule-based physical optimizations, such as filters into one Spark map operation. Also, it can push operations from the logical plan into data sources that support predicate or projection pushdown.
See this example of pushed filter:
Code Generation
At the end, Spark only deals with the lower level construct i.e. RDD (Resilient Distributed Datasets). In this final phase catalyst optimizer generate bytecode that will run on each machine on these in-memory datasets efficiently.
Please refer the spark documentation for more information.
References
Dataset Used for Analysis: 2022 January - Yellow Taxi Trip Records
Catalyst Optimizer Image Credit: Databricks
😍