Thinking in SQL Mathematically
Intro
I wanted to quickly share one of my favorite questions to put in Data Engineering exams when I'm hiring.
It's a deceptively simple question that, once solved, leads to some pretty powerful insights:
1) How to adopt a "first principles" mindset when it comes to coding.
2) How to think of SQL, and optimization problems in particular, mathematically.
3) How to think in terms of set operations rather than row-level operations (which usually leads to better optimized SQL).
The question is just about how to rewrite range predicates.
Here goes!
Data Engineering Exam Question
"Imagine you've analyzed a slow-running query and found that the bottleneck, for whatever reason, is how the database optimizer is handling a range predicate in a join. More specifically, you find that this SQL condition is being evaluated particularly slowly by the database
...WHERE A BETWEEN X AND Y
If your query absolutely requires that same logic, but you have no choice but to try and rewrite it in the hopes of better leveraging the underlying query engine, how would you do so? What would your resulting SQL look like?
(This is based on a real-world optimization problem that involved a range join from SAP hierarchy tables, i.e. - SELECT ... FROM... WHERE CEPC.PRCTR BETWEEN SETLEAF.VALFROM AND SETLEAF.VALTO)
Start from this query (pseudocode):
SELECT ... FROM ... WHERE A BETWEEN X AND Y;
Hint: this is not about trying to outsmart the SQL optimizer, but simply get past an apparent bug in the optimizer. In other words, this is simply an exercise asking you to think about how you could rewrite the logic as an attempt to try another approach.
Solution
Don't keep reading right away. Spend 5 minutes thinking through the problem. 😉
Ok, here is the solution approach with comments. Each step refactors the logic slightly while maintaining functional equivalence.I generally award points for each step that a candidate provides, i.e. it's worth giving credit even for just steps 2 and/or 3, IMO.
-- starting point
1) SELECT... WHERE A BETWEEN X AND Y;
-- rewrite the predicate explicitly
2) SELECT... WHERE A >= X AND A <= Y;
-- explicitly expand the otherwise implicit boolean operators
3) SELECT... (WHERE A > X OR A = X) AND (A < Y OR A = Y)
-- the secret sauce: refactor boolean operators as set operators
4) (SELECT... WHERE A > X UNION ALL SELECT... WHERE A = X) INTERSECT (SELECT... WHERE A < Y UNION ALL SELECT... WHERE A = Y)
Discussion
First Principles
The first step of rewriting the BETWEEN condition explicitly demonstrates an effort to remove the "syntactic sugar" of SQL to get at the root logic, to start examining how it could potentially be re-written.
Recommended by LinkedIn
The next step of explicitly pulling out the boolean OR operators that are otherwise implicit to <= and >= follows this same line of thinking.
When it comes to optimizing code (or even optimizing physical objects, processes, etc.) it often helps to simplify/reduce what you're working on to its most basic components to then see how you might rearrange/reconfigure them (in a way that remains functionally equivalent).
Thinking Mathematically
Optimizing code requires rewriting code in a way that maintains its functional logic. This is the exact same process we all learned in high school when it came to writing proofs in Geometry, Algebra, and other classes.
You first show that A = B, then that B = C, and then you can finally demonstrate that A = C. By maintaining functional equivalence, you ensure the quality of your optimization without sacrificing logical consistency by trying to "vibe optimize" your code. 😎
Thinking in Sets
Many Data Engineers don't have formal training in set theory, and that's perfectly okay. Databases by definition are supposed make declarative logic simple to implement and auto-optimizing (to an extent).
That being said, anyone who's spent more than a couple years in industry has come to learn that there are nonetheless important patterns for writing SQL code efficiently, and one of the core such patterns is to consider writing logic in a set-based fashion rather than a row-based fashion.
What the code above does is demonstrates one way of doing so which takes advantage of the formal definition of UNION and INTERSECT set operations.
You can reference the linked Wikipedia articles above for a more detailed explanation, but what they basically say, in SQL terms, is that a boolean operator evaluated between records can be rewritten as a set operator between tables (or more generally - result sets, i.e. results of queries).
Here is the other formula in the case of set intersection. (The U above in the formula above represents the set operator UNION, and the upside down U in the formula below represents the set operator INTERSECT, both of which are represented directly in SQL as-is.)
(Just a friendly reminder in SQL that UNION will remove duplicate records, which in our case above, we don't want - hence we used UNION ALL).
Final Thoughts
Please, please, do NOT go rewriting all of your range joins with this logic. Most database optimizers, most of the time, do a pretty good job optimizing reasonably well-written SQL, and the code in the solution above is obviously excessively verbose.
The point of this exercise, as indicated in the hint above, is simply about getting you to think through other ways to conceptualize your logic, so that if/when/as you start thinking about optimization, you have reasonable things to try out.
That all being said, refactoring row-level logic as set-based logic is an overall strategy that can be very helpful in many other contexts when optimizing database queries, so the overall idea still holds. In the future, I will be blogging about more such examples.
Extra Credit
Some databases, especially newer ones, don't yet have an INTERSECT operator. In such cases, how might you further refactor the code to implement the functional equivalent of INTERSECT?
I'll gladly share the answer with anyone curious, just send me a note! 😃
Happy coding!
Spot on. This is huge for growth right now. Thanks Jody!
regarding the intersect operator, didn't you give the definition above? intersect => x in A and x in B and ...
And a quick primer I wrote on sets vs bags. SQL implements bag semantics; the relational model is rooted in set theory. https://www.garudax.id/posts/ramonactruta_datafundamentals-datawithheart-activity-7282547896242094080-rGL9?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAHkY7kB_HKb2pfub5YnxVY_MaKPfsRRVCA
I'd love to see real world examples of performance gains using the approach outlined here. It seems more likely to harm performance, especially on large tables (or complex queries). Plus, intersect will only return distinct rows, not necessarily the same result set.