SQL Optimization

SQL Optimization

Indexing

  • Suitable for traditional SQL databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.
  • Ensure proper indexes exist on columns used in WHERE, JOIN, and ORDER BY clauses.
  • Use composite indexes if multiple columns are frequently queried together.
  • Avoid over-indexing, as it can slow down INSERT, UPDATE, and DELETE operations.
  • Avoid functions on indexed columns in WHERE (e.g., WHERE LOWER(name) = 'john' prevents index usage).
  • Use ORDER BY and GROUP BY on indexed columns to improve performance.

CREATE INDEX idx_customer_id  ON transactions (customer_id);        

Query Structure

  • Use SELECT specific columns instead of SELECT * to minimize data retrieval.
  • Use EXISTS instead of IN for subqueries when checking existence. EXISTS stops checking as soon as it finds a match. IN evaluates the entire subquery result set first
  • Avoid correlated subqueries; prefer JOIN instead. Correlated subqueries are inefficient because they execute once per row in the outer query, whereas a JOIN processes all data at once.

Correlated subqueries: 

SELECT t1.customer_id, t1.customer_name, 
       (SELECT MAX(t2.amount) 
        FROM transactions t2 
        WHERE t2.customer_id = t1.customer_id) AS max_amount
FROM customers t1;

Optimized: 
SELECT t1.customer_id, t1.customer_name, t2.max_amount
FROM customers t1
JOIN (SELECT customer_id, MAX(amount) AS max_amount 
      FROM transactions 
      GROUP BY customer_id) t2
ON t1.customer_id = t2.customer_id;        

  • Use HAVING only when necessary (prefer WHERE for filtering before aggregation).
  • Optimize CASE statements by reducing conditions or moving them to derived tables(or CTE).

  • Use Common Table Expressions (CTEs) for readability but check performance impact.
  • Use temporary tables if a result is reused multiple times in a query.


Materialized views

  • Use materialized views for expensive aggregations that don’t change frequently. A Materialized View (MV) stores the results physically, unlike a normal view, which recomputes the query every time.MVs do not update automatically, they must be refreshed periodically.

CREATE MATERIALIZED VIEW customer_spending AS
SELECT customer_id, SUM(amount) AS total_spent, COUNT(*) AS num_transactions
FROM transactions
GROUP BY customer_id;        

🚫 Avoid using when: Data updates frequently, real-time query results are required, or the dataset is small.


Joins Optimization

  • Choose appropriate join type (INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN).
  • Ensure indexed columns are used for joining.
  • Avoid Cartesian joins (unintended cross joins).
  • Filter data as early as possible using WHERE before joining.
  • Implement table partitioning for large datasets to improve query performance.
  • Consider sharding for horizontal scalability in distributed systems.


Spark SQL

  • Always select only required columns to reduce the amount of data shuffled across nodes.
  • Use broadcast joins (BROADCAST JOIN) for small tables to avoid costly shuffle joins.
  • Ensure JOIN keys are of the same data type to avoid expensive type conversions.

  • Apply WHERE filters as early as possible to minimize data shuffling (Predicate Pushdown).
  • Ensure predicate pushdown works for data sources like Parquet, ORC, Delta.
  • Use GROUP BY with partitions or bucketed columns to optimize performance.
  • Partition tables on high-cardinality(large number of unique values) columns to improve query performance.


Avoid Unnecessary Shuffles

  • Prefer window functions over GROUP BY when applicable
  • Avoid cross joins unless necessary.
  • Use Caching & Persisting
  • Adaptive Query Execution (AQE) was introduced in Apache Spark 3.0 and is enabled by default in Spark 3.2 and later.
  • EXPLAIN helps analyze query execution plans and identify performance bottlenecks.

Key things to look for:

Exchange operators → Check for unnecessary shuffles.

Filter pushdown → Ensure predicates are applied early.

Join strategies → Use broadcast joins where applicable


Bucketing

Bucketing helps by controlling the number of files within each partition and improving query performance by reducing shuffling.

When to Use Bucketing?

  • When you frequently join or aggregate on a specific column (customer_id, transaction_id).
  • When partitions alone are not enough to control small files.

Bucketing and partitioning

Article content


To view or add a comment, sign in

More articles by Radha Kuchekar.

  • Apache Airflow

    Apache Airflow Apache airflow is powerful tool allows you to author and run workflows. Each workflow specifies the set…

  • Data Locality

    Data locality in Apache Spark refers to how close the data is to the computation that processes it. Since Spark is…

  • HDFS - Read to Analyze

    HDFS CLI The HDFS CLI (Command-Line Interface) provides a way to interact with the Hadoop Distributed File System…

  • Oracle To Hive/HDFS Ingestion

    Ingesting data from Oracle to Hive/HDFS is a common workflow in big data ecosystems. This process involves extracting…

  • Data Reconciliation With Spark SQL

    Data reconciliation is the process of comparing and validating data from different sources to ensure consistency…

  • Small Files

    A small file is typically defined as a file significantly smaller than the HDFS (Hadoop Distributed File System) block…

  • Schema Evolution

    Adding a New Column (Backward-Compatible Change) Fig. Add New Column Adding a column is a backward-compatible change…

  • Resource Allocation

    Efficient resource allocation in Apache Spark is crucial for optimizing performance, reducing execution time, and…

    2 Comments
  • Window and Robbers

    Window Unlike group aggregation functions, window functions do not collapse rows into a single output but instead…

  • Job, Stages, and Tasks

    Job -> Created for each action. Job Creation Key Scenarios in Apache Spark Stage -> Created at each shuffle boundary.

Others also viewed

Explore content categories