Thinking in SQL Mathematically

Jody Hesch

Published Apr 2, 2025

+ Follow

Intro

I wanted to quickly share one of my favorite questions to put in Data Engineering exams when I'm hiring.

It's a deceptively simple question that, once solved, leads to some pretty powerful insights:

1) How to adopt a "first principles" mindset when it comes to coding.

2) How to think of SQL, and optimization problems in particular, mathematically.

3) How to think in terms of set operations rather than row-level operations (which usually leads to better optimized SQL).

The question is just about how to rewrite range predicates.

Here goes!

Data Engineering Exam Question

"Imagine you've analyzed a slow-running query and found that the bottleneck, for whatever reason, is how the database optimizer is handling a range predicate in a join. More specifically, you find that this SQL condition is being evaluated particularly slowly by the database

...WHERE A BETWEEN X AND Y

If your query absolutely requires that same logic, but you have no choice but to try and rewrite it in the hopes of better leveraging the underlying query engine, how would you do so? What would your resulting SQL look like?

(This is based on a real-world optimization problem that involved a range join from SAP hierarchy tables, i.e. - SELECT ... FROM... WHERE CEPC.PRCTR BETWEEN SETLEAF.VALFROM AND SETLEAF.VALTO)

Start from this query (pseudocode):

SELECT ... FROM ... WHERE A BETWEEN X AND Y;

Hint: this is not about trying to outsmart the SQL optimizer, but simply get past an apparent bug in the optimizer. In other words, this is simply an exercise asking you to think about how you could rewrite the logic as an attempt to try another approach.

Solution

Don't keep reading right away. Spend 5 minutes thinking through the problem. 😉

Ok, here is the solution approach with comments. Each step refactors the logic slightly while maintaining functional equivalence.I generally award points for each step that a candidate provides, i.e. it's worth giving credit even for just steps 2 and/or 3, IMO.

 -- starting point 
1) SELECT... WHERE A BETWEEN X AND Y;

-- rewrite the predicate explicitly
2) SELECT... WHERE A >= X AND A <= Y; 

-- explicitly expand the otherwise implicit boolean operators
3) SELECT... (WHERE A > X OR A = X) AND (A < Y OR A = Y) 

-- the secret sauce: refactor boolean operators as set operators 
4) (SELECT... WHERE A > X UNION ALL SELECT... WHERE A = X) INTERSECT (SELECT... WHERE A < Y UNION ALL SELECT... WHERE A = Y)

Discussion

First Principles

The first step of rewriting the BETWEEN condition explicitly demonstrates an effort to remove the "syntactic sugar" of SQL to get at the root logic, to start examining how it could potentially be re-written.

Recommended by LinkedIn

What is SQL and What is the Importance of Learning it?

khadar valli 3 years ago

Comprehensive Guide to SQL: From Basics to Advanced…

Vivek Kumar 1 year ago

The Roadmap to Learning SQL

Ruth Yakubu 3 months ago

The next step of explicitly pulling out the boolean OR operators that are otherwise implicit to <= and >= follows this same line of thinking.

When it comes to optimizing code (or even optimizing physical objects, processes, etc.) it often helps to simplify/reduce what you're working on to its most basic components to then see how you might rearrange/reconfigure them (in a way that remains functionally equivalent).

Thinking Mathematically

Optimizing code requires rewriting code in a way that maintains its functional logic. This is the exact same process we all learned in high school when it came to writing proofs in Geometry, Algebra, and other classes.

You first show that A = B, then that B = C, and then you can finally demonstrate that A = C. By maintaining functional equivalence, you ensure the quality of your optimization without sacrificing logical consistency by trying to "vibe optimize" your code. 😎

Thinking in Sets

Many Data Engineers don't have formal training in set theory, and that's perfectly okay. Databases by definition are supposed make declarative logic simple to implement and auto-optimizing (to an extent).

That being said, anyone who's spent more than a couple years in industry has come to learn that there are nonetheless important patterns for writing SQL code efficiently, and one of the core such patterns is to consider writing logic in a set-based fashion rather than a row-based fashion.

What the code above does is demonstrates one way of doing so which takes advantage of the formal definition of UNION and INTERSECT set operations.

You can reference the linked Wikipedia articles above for a more detailed explanation, but what they basically say, in SQL terms, is that a boolean operator evaluated between records can be rewritten as a set operator between tables (or more generally - result sets, i.e. results of queries).

Here is the other formula in the case of set intersection. (The U above in the formula above represents the set operator UNION, and the upside down U in the formula below represents the set operator INTERSECT, both of which are represented directly in SQL as-is.)

(Just a friendly reminder in SQL that UNION will remove duplicate records, which in our case above, we don't want - hence we used UNION ALL).

Final Thoughts

Please, please, do NOT go rewriting all of your range joins with this logic. Most database optimizers, most of the time, do a pretty good job optimizing reasonably well-written SQL, and the code in the solution above is obviously excessively verbose.

The point of this exercise, as indicated in the hint above, is simply about getting you to think through other ways to conceptualize your logic, so that if/when/as you start thinking about optimization, you have reasonable things to try out.

That all being said, refactoring row-level logic as set-based logic is an overall strategy that can be very helpful in many other contexts when optimizing database queries, so the overall idea still holds. In the future, I will be blogging about more such examples.

Extra Credit

Some databases, especially newer ones, don't yet have an INTERSECT operator. In such cases, how might you further refactor the code to implement the functional equivalent of INTERSECT?

I'll gladly share the answer with anyone curious, just send me a note! 😃

Happy coding!

Ahmed Elhenawy 1mo

Spot on. This is huge for growth right now. Thanks Jody!

1 Reaction

Gordon Weakliem 1y

regarding the intersect operator, didn't you give the definition above? intersect => x in A and x in B and ...

1 Reaction

Ramona C. Truta, MSc 1y

And a quick primer I wrote on sets vs bags. SQL implements bag semantics; the relational model is rooted in set theory. https://www.garudax.id/posts/ramonactruta_datafundamentals-datawithheart-activity-7282547896242094080-rGL9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAHkY7kB_HKb2pfub5YnxVY_MaKPfsRRVCA

1 Reaction

Brian Shanny 1y

I'd love to see real world examples of performance gains using the approach outlined here. It seems more likely to harm performance, especially on large tables (or complex queries). Plus, intersect will only return distinct rows, not necessarily the same result set.

Thinking in SQL Mathematically

Jody Hesch

Intro

Data Engineering Exam Question

Solution

Discussion

First Principles

Recommended by LinkedIn

Thinking Mathematically

Thinking in Sets

Final Thoughts

Extra Credit

More articles by Jody Hesch

Others also viewed

The Role of Mathematics in SQL: A Powerful Combination for Data Management

Complete SQL Tutorial: From Basics to Advanced

The Role of SQL in the Data Analyst’s Toolbox

My Journey into SQL: Unlocking the Power of Data and Data Manipulation

SQL MASTERY

SQL Tutorial: Dive into the World of Databases

How to Learn SQL for Data Analytics: A Comprehensive Guide

SQL Basics: Your Guide to Getting Started with Databases

Unlocking the Power of SQL: An Introduction to Data Management

SQL Order of Execution

Explore content categories

Intro

Data Engineering Exam Question

Solution

Discussion

First Principles

Recommended by LinkedIn

Thinking Mathematically

Thinking in Sets

Final Thoughts

Extra Credit

More articles by Jody Hesch

Tied With A Bow: Wrapping Up the Hierarchy Discussion (Part 6 of 6)

Edge Cases: Handling Ragged and Unbalanced Hierarchies (Part 5 of 6)

Flat Out: Introducing Level Hierarchies (4 of 6)

Family Matters: Introducing Parent-Child Hierarchies (3 of 6)

More Than Pipelines: DAGs as Precursors to Hierarchies (2 of 6)

Nodes, Edges and Graphs: Providing Context for Hierarchies (1 of 6)

Reverse Engineering a Source System - Metadata-Driven Code Generation (3 of 5)

Reverse Engineering a Source System - Metadata (2 of 5)

Reverse Engineering a Source System - Data Model (1 of 5)

Others also viewed

The Role of Mathematics in SQL: A Powerful Combination for Data Management

Complete SQL Tutorial: From Basics to Advanced

The Role of SQL in the Data Analyst’s Toolbox

My Journey into SQL: Unlocking the Power of Data and Data Manipulation

SQL MASTERY

SQL Tutorial: Dive into the World of Databases

How to Learn SQL for Data Analytics: A Comprehensive Guide

SQL Basics: Your Guide to Getting Started with Databases

Unlocking the Power of SQL: An Introduction to Data Management

SQL Order of Execution

Explore content categories