My two cents on SQL: the IN condition

Pushpinder Singh Arora

Published Jun 24, 2020

I learnt something new today, by encountering a perplexing problem (for me) while creating a new visualization for my client. I was crunching some data using SQL, and here's a summary of what happened:

I was trying to create a dimension table, which would contain certain ids, with their respective columns. These ids were to be checked against another table. The logic was: keep only those ids in the final table, which don't exist in another table (I was going to union the second table to make a mutually exclusive and completely exhaustive set of ids).

I was left bewildered when I found out that the ids were not showing the looked up dimensions. I prodded a little and found out that my query wasn't fetching the desired results. The query was:

SELECT DISTINCT

id,
dim1,
dim2,
.
.
FROM table1 

WHERE table1.id NOT IN (SELECT DISTINCT id FROM table2)

-- to make this table exclusive of table2 ids

The missing data in question (a particular id say 12345) was not to be found in the result! And hence it wasn't carrying the required dimensions to wherever I was using this dimension table.

The id 12345 was definitely present in table1 (confirmed by an independent query); It wasn't present in table2 (confirmed by another independent query); Then how on earth is this simple exclusion condition deleting the 12345 id from the result?

It took my whole morning, firng queries, verifying data to find out what was happening here. Finally a google search brought me stackoverflow, wherein, people had discussed about the perils of "IN" (and NOT IN) query.

Apparently, the query fails whenever there are NULL values involved in any of the datasets. IN (hence NOT IN) condition returns a NULL whenever it encounters a NULL in the data set.

I think what was happening here was:

There were NULL values in table2. Id 12345, present in table1, when compared to table2 ids through IN condition, encountered a NULL value in one of the rows. It immediately returned a NULL. It was happening to all the ids in table1.

The culprit was caught red handed. There's a very simple fix to this situation: make each table free of NULL values.

Final query (which worked):

SELECT DISTINCT

id,
dim1,
dim2,
.
.

FROM table1 
WHERE table1.id NOT IN (SELECT DISTINCT id FROM table2 where id IS NOT NULL)

There! It took me, a good half of my day, music from Hans Zimmer, then Dans Macabre and finaly Vivaldi's Four Seasons and of course the issue explanation on Stackoverflow to finally solve this!

My two cents on SQL: the IN condition

Pushpinder Singh Arora

More articles by Pushpinder Singh Arora

Others also viewed

TIPS ON SOME SQL FUNCTIONS.

Understanding the Difference Between ON and USING in SQL Joins

The story of how I optimized a SQL code to improve our data analysis in my company

Dealing with Missing Data in SQL: NVL and COALESCE Explained

Static SQL Checks for BIP - Part 4: Divide-by-Zero and Functions in WHERE

SQL vs RLA - Round 2

⚡ SQL Query Optimization — Practical Techniques & Real Examples #SQLServer #QueryOptimization #DatabasePerformance #MSSQL #DataEngineering #TSQL #Perf

Unlock the Power of CTEs: Simplify and Supercharge Your SQL Queries!

SQL Day 27/50

Explore content categories