Catalyst Optimizer in Apache Spark

Sai Prabhanj Turaga

Published Jun 27, 2018

+ Follow

Most of the power of Spark SQL comes due to Catalyst optimizer, so let’s have a look into it

Catalyst optimizer has two primary goals:

· Make adding new optimization techniques easy
· Enable external developers to extend the optimizer

Spark SQL uses Catalyst's transformation framework in four phases:

· Analyzing a logical plan to resolve references
· Logical plan optimization
· Physical planning
· Code generation to compile the parts of the query to Java bytecode

Analysis

The analysis phase involved looking at a SQL query or a DataFrame, creating a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype) and then resolving this plan using the Catalog object (which connects to the physical data source), and creating a logical plan

Logical plan optimization

The logical plan optimization phase applies standard rule-based optimization to the logical plan. These include constant folding, predicate pushdown, projection pruning, null propagation, Boolean expression simplification, and other rules.

I would like to draw special attention to predicate the pushdown rule here. The concept is simple; if you issue a query in one place to run against the massive data, which is another place, it can lead to a lot of unnecessary data moving across the network.

If we can push down the part of the query to where the data is stored, and thus filter out unnecessary data, it reduces network traffic significantly.

Physical planning

In the physical planning phase, Spark SQL takes a logical plan and generates one or more physical plans. It then measures the cost of each physical plan and generates one physical plan based on that.

Code generation

The final phase of query optimization involves generating Java bytecode to run on each machine. It uses a special Scala feature called Quasi quotes to accomplish that

To view or add a comment, sign in

Catalyst Optimizer in Apache Spark

Sai Prabhanj Turaga

More articles by Sai Prabhanj Turaga

Others also viewed

Practical Apache Spark in 10 minutes. Part 3 - DataFrames and SQL

Simple Apache Spark PID masking with DataFrame, SQLContext, regexp_replace, Hive, and Oracle.

SQL at 50: What Lies Ahead for the Structured Query Language?

SQL Scripting in Upcoming Apache Spark 4.0

Hive on Spark Engine Versus Spark Using Hive Metastore

Apache Spark Basics 101: select() vs. selectExpr()

Spark SQL DataFrame

The Language-Exposed: Getting to know SQL

Optimize Join operation using Bucketing in Apache Spark

Microsoft SQL Server and ML Modeling with Python

Explore content categories

More articles by Sai Prabhanj Turaga

Basics of Kafka Architecture

Spark Driver out of memory

Performance tuning in hive

Spark memory configuration approach

Importing data having column names with spaces and clob\blob\nvarchar datatype using sqoop to avro

Have a look about NoSQL Databases