Catalyst Optimizer in Apache Spark

Catalyst Optimizer in Apache Spark

Most of the power of Spark SQL comes due to Catalyst optimizer, so let’s have a look into it




Catalyst optimizer has two primary goals:

  • ·        Make adding new optimization techniques easy
  • ·        Enable external developers to extend the optimizer

Spark SQL uses Catalyst's transformation framework in four phases:

  • ·        Analyzing a logical plan to resolve references
  • ·        Logical plan optimization
  • ·        Physical planning
  • ·        Code generation to compile the parts of the query to Java bytecode

Analysis

The analysis phase involved looking at a SQL query or a DataFrame, creating a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype) and then resolving this plan using the Catalog object (which connects to the physical data source), and creating a logical plan

Logical plan optimization

The logical plan optimization phase applies standard rule-based optimization to the logical plan. These include constant folding, predicate pushdown, projection pruning, null propagation, Boolean expression simplification, and other rules.

I would like to draw special attention to predicate the pushdown rule here. The concept is simple; if you issue a query in one place to run against the massive data, which is another place, it can lead to a lot of unnecessary data moving across the network.


If we can push down the part of the query to where the data is stored, and thus filter out unnecessary data, it reduces network traffic significantly.

Physical planning

In the physical planning phase, Spark SQL takes a logical plan and generates one or more physical plans. It then measures the cost of each physical plan and generates one physical plan based on that.

Code generation

The final phase of query optimization involves generating Java bytecode to run on each machine. It uses a special Scala feature called Quasi quotes to accomplish that



To view or add a comment, sign in

More articles by Sai Prabhanj Turaga

  • Basics of Kafka Architecture

    Kafka is a publish-subscribe based durable messaging system exchanging data between processes, applications, and…

  • Spark Driver out of memory

    A driver in Spark is the JVM where the application’s main control flow runs. More often than not, the driver fails with…

    1 Comment
  • Performance tuning in hive

    Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using…

  • Spark memory configuration approach

    Key Components Spark Master, Worker and Executor JVM’s: SparkMaster and Worker JVM’s are the resource managers. All…

  • Importing data having column names with spaces and clob\blob\nvarchar datatype using sqoop to avro

    In few databases we might have columns names including spaces in them, so while importing data from those databases…

  • Have a look about NoSQL Databases

    What is NoSQl NoSQL is a non-relational database management system. It is designed for distributed data stores where…

    2 Comments

Others also viewed

Explore content categories