SQL Optimization

Radha Kuchekar.

Published Mar 14, 2025

Indexing

Suitable for traditional SQL databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.
Ensure proper indexes exist on columns used in WHERE, JOIN, and ORDER BY clauses.
Use composite indexes if multiple columns are frequently queried together.
Avoid over-indexing, as it can slow down INSERT, UPDATE, and DELETE operations.
Avoid functions on indexed columns in WHERE (e.g., WHERE LOWER(name) = 'john' prevents index usage).
Use ORDER BY and GROUP BY on indexed columns to improve performance.

CREATE INDEX idx_customer_id  ON transactions (customer_id);

Query Structure

Use SELECT specific columns instead of SELECT * to minimize data retrieval.
Use EXISTS instead of IN for subqueries when checking existence. EXISTS stops checking as soon as it finds a match. IN evaluates the entire subquery result set first
Avoid correlated subqueries; prefer JOIN instead. Correlated subqueries are inefficient because they execute once per row in the outer query, whereas a JOIN processes all data at once.

Correlated subqueries: 

SELECT t1.customer_id, t1.customer_name, 
       (SELECT MAX(t2.amount) 
        FROM transactions t2 
        WHERE t2.customer_id = t1.customer_id) AS max_amount
FROM customers t1;

Optimized: 
SELECT t1.customer_id, t1.customer_name, t2.max_amount
FROM customers t1
JOIN (SELECT customer_id, MAX(amount) AS max_amount 
      FROM transactions 
      GROUP BY customer_id) t2
ON t1.customer_id = t2.customer_id;

Use HAVING only when necessary (prefer WHERE for filtering before aggregation).
Optimize CASE statements by reducing conditions or moving them to derived tables(or CTE).

Use Common Table Expressions (CTEs) for readability but check performance impact.
Use temporary tables if a result is reused multiple times in a query.

Materialized views

Use materialized views for expensive aggregations that don’t change frequently. A Materialized View (MV) stores the results physically, unlike a normal view, which recomputes the query every time.MVs do not update automatically, they must be refreshed periodically.

CREATE MATERIALIZED VIEW customer_spending AS
SELECT customer_id, SUM(amount) AS total_spent, COUNT(*) AS num_transactions
FROM transactions
GROUP BY customer_id;

🚫 Avoid using when: Data updates frequently, real-time query results are required, or the dataset is small.

Joins Optimization

Choose appropriate join type (INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN).
Ensure indexed columns are used for joining.
Avoid Cartesian joins (unintended cross joins).
Filter data as early as possible using WHERE before joining.
Implement table partitioning for large datasets to improve query performance.
Consider sharding for horizontal scalability in distributed systems.

Recommended by LinkedIn

Intro to SQL

Ijaz Khan 1 year ago

SQL Fundamentals

Devashree Sharma 1 year ago

SQL Fundamentals

NANNEBOINA MAHESH 11 months ago

Spark SQL

Always select only required columns to reduce the amount of data shuffled across nodes.
Use broadcast joins (BROADCAST JOIN) for small tables to avoid costly shuffle joins.
Ensure JOIN keys are of the same data type to avoid expensive type conversions.

Apply WHERE filters as early as possible to minimize data shuffling (Predicate Pushdown).
Ensure predicate pushdown works for data sources like Parquet, ORC, Delta.
Use GROUP BY with partitions or bucketed columns to optimize performance.
Partition tables on high-cardinality(large number of unique values) columns to improve query performance.

Avoid Unnecessary Shuffles

Prefer window functions over GROUP BY when applicable
Avoid cross joins unless necessary.
Use Caching & Persisting
Adaptive Query Execution (AQE) was introduced in Apache Spark 3.0 and is enabled by default in Spark 3.2 and later.
EXPLAIN helps analyze query execution plans and identify performance bottlenecks.

Key things to look for:

✔ Exchange operators → Check for unnecessary shuffles.

✔ Filter pushdown → Ensure predicates are applied early.

✔ Join strategies → Use broadcast joins where applicable

Bucketing

Bucketing helps by controlling the number of files within each partition and improving query performance by reducing shuffling.

When to Use Bucketing?

When you frequently join or aggregate on a specific column (customer_id, transaction_id).
When partitions alone are not enough to control small files.

Bucketing and partitioning

To view or add a comment, sign in

SQL Optimization

Radha Kuchekar.

Indexing

Query Structure

Materialized views

Joins Optimization

Recommended by LinkedIn

Avoid Unnecessary Shuffles

When to Use Bucketing?

Bucketing and partitioning

More articles by Radha Kuchekar.

Others also viewed

Oracle SQL: Effectively finding circuit's current issue in MetaSolv database

How to Tune SQL with “Exists subquery” for openGauss?

Understanding SQL: The Five Types of Language in Database Management

Explain All JOIN in SQL (Guide to All Types of SQL JOINS)

SQL Injection fundamentals

Structured Query Language

Introduction To SQL

Types of Joins in SQL

Explore content categories

Indexing

Query Structure

Materialized views

Joins Optimization

Recommended by LinkedIn

Avoid Unnecessary Shuffles

When to Use Bucketing?

Bucketing and partitioning

More articles by Radha Kuchekar.

Apache Airflow

Data Locality

HDFS - Read to Analyze

Oracle To Hive/HDFS Ingestion

Data Reconciliation With Spark SQL

Small Files

Schema Evolution

Resource Allocation

Window and Robbers

Job, Stages, and Tasks

Others also viewed

Oracle SQL: Effectively finding circuit's current issue in MetaSolv database

How to Tune SQL with “Exists subquery” for openGauss?

Understanding SQL: The Five Types of Language in Database Management

Explain All JOIN in SQL (Guide to All Types of SQL JOINS)

SQL Injection fundamentals

Structured Query Language

Introduction To SQL

Types of Joins in SQL

Similar topics

SQL Optimization Techniques

Query Optimization Methods

Explore content categories