Indexing
- Suitable for traditional SQL databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.
- Ensure proper indexes exist on columns used in WHERE, JOIN, and ORDER BY clauses.
- Use composite indexes if multiple columns are frequently queried together.
- Avoid over-indexing, as it can slow down INSERT, UPDATE, and DELETE operations.
- Avoid functions on indexed columns in WHERE (e.g., WHERE LOWER(name) = 'john' prevents index usage).
- Use ORDER BY and GROUP BY on indexed columns to improve performance.
CREATE INDEX idx_customer_id ON transactions (customer_id);
Query Structure
- Use SELECT specific columns instead of SELECT * to minimize data retrieval.
- Use EXISTS instead of IN for subqueries when checking existence. EXISTS stops checking as soon as it finds a match. IN evaluates the entire subquery result set first
- Avoid correlated subqueries; prefer JOIN instead. Correlated subqueries are inefficient because they execute once per row in the outer query, whereas a JOIN processes all data at once.
Correlated subqueries:
SELECT t1.customer_id, t1.customer_name,
(SELECT MAX(t2.amount)
FROM transactions t2
WHERE t2.customer_id = t1.customer_id) AS max_amount
FROM customers t1;
Optimized:
SELECT t1.customer_id, t1.customer_name, t2.max_amount
FROM customers t1
JOIN (SELECT customer_id, MAX(amount) AS max_amount
FROM transactions
GROUP BY customer_id) t2
ON t1.customer_id = t2.customer_id;
- Use HAVING only when necessary (prefer WHERE for filtering before aggregation).
- Optimize CASE statements by reducing conditions or moving them to derived tables(or CTE).
- Use Common Table Expressions (CTEs) for readability but check performance impact.
- Use temporary tables if a result is reused multiple times in a query.
Materialized views
- Use materialized views for expensive aggregations that don’t change frequently. A Materialized View (MV) stores the results physically, unlike a normal view, which recomputes the query every time.MVs do not update automatically, they must be refreshed periodically.
CREATE MATERIALIZED VIEW customer_spending AS
SELECT customer_id, SUM(amount) AS total_spent, COUNT(*) AS num_transactions
FROM transactions
GROUP BY customer_id;
🚫 Avoid using when: Data updates frequently, real-time query results are required, or the dataset is small.
Joins Optimization
- Choose appropriate join type (INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN).
- Ensure indexed columns are used for joining.
- Avoid Cartesian joins (unintended cross joins).
- Filter data as early as possible using WHERE before joining.
- Implement table partitioning for large datasets to improve query performance.
- Consider sharding for horizontal scalability in distributed systems.
- Always select only required columns to reduce the amount of data shuffled across nodes.
- Use broadcast joins (BROADCAST JOIN) for small tables to avoid costly shuffle joins.
- Ensure JOIN keys are of the same data type to avoid expensive type conversions.
- Apply WHERE filters as early as possible to minimize data shuffling (Predicate Pushdown).
- Ensure predicate pushdown works for data sources like Parquet, ORC, Delta.
- Use GROUP BY with partitions or bucketed columns to optimize performance.
- Partition tables on high-cardinality(large number of unique values) columns to improve query performance.
Avoid Unnecessary Shuffles
- Prefer window functions over GROUP BY when applicable
- Avoid cross joins unless necessary.
- Use Caching & Persisting
- Adaptive Query Execution (AQE) was introduced in Apache Spark 3.0 and is enabled by default in Spark 3.2 and later.
- EXPLAIN helps analyze query execution plans and identify performance bottlenecks.
✔ Exchange operators → Check for unnecessary shuffles.
✔ Filter pushdown → Ensure predicates are applied early.
✔ Join strategies → Use broadcast joins where applicable
Bucketing helps by controlling the number of files within each partition and improving query performance by reducing shuffling.
When to Use Bucketing?
- When you frequently join or aggregate on a specific column (customer_id, transaction_id).
- When partitions alone are not enough to control small files.
Bucketing and partitioning