My two cents on SQL: the IN condition

I learnt something new today, by encountering a perplexing problem (for me) while creating a new visualization for my client. I was crunching some data using SQL, and here's a summary of what happened:

I was trying to create a dimension table, which would contain certain ids, with their respective columns. These ids were to be checked against another table. The logic was: keep only those ids in the final table, which don't exist in another table (I was going to union the second table to make a mutually exclusive and completely exhaustive set of ids).

I was left bewildered when I found out that the ids were not showing the looked up dimensions. I prodded a little and found out that my query wasn't fetching the desired results. The query was:

SELECT DISTINCT

id,
dim1,
dim2,
.
.
FROM table1 

WHERE table1.id NOT IN (SELECT DISTINCT id FROM table2)

-- to make this table exclusive of table2 ids

The missing data in question (a particular id say 12345) was not to be found in the result! And hence it wasn't carrying the required dimensions to wherever I was using this dimension table.

The id 12345 was definitely present in table1 (confirmed by an independent query); It wasn't present in table2 (confirmed by another independent query); Then how on earth is this simple exclusion condition deleting the 12345 id from the result?

It took my whole morning, firng queries, verifying data to find out what was happening here. Finally a google search brought me stackoverflow, wherein, people had discussed about the perils of "IN" (and NOT IN) query.

Apparently, the query fails whenever there are NULL values involved in any of the datasets. IN (hence NOT IN) condition returns a NULL whenever it encounters a NULL in the data set.

I think what was happening here was:

There were NULL values in table2. Id 12345, present in table1, when compared to table2 ids through IN condition, encountered a NULL value in one of the rows. It immediately returned a NULL. It was happening to all the ids in table1.

The culprit was caught red handed. There's a very simple fix to this situation: make each table free of NULL values.

Final query (which worked):

SELECT DISTINCT

id,
dim1,
dim2,
.
.

FROM table1 
WHERE table1.id NOT IN (SELECT DISTINCT id FROM table2 where id IS NOT NULL)

There! It took me, a good half of my day, music from Hans Zimmer, then Dans Macabre and finaly Vivaldi's Four Seasons and of course the issue explanation on Stackoverflow to finally solve this!

Here's the link to the Stackoverflow page where I found the result:


To view or add a comment, sign in

More articles by Pushpinder Singh Arora

  • Returning Customers vs New Customers Analysis using SQL

    In most of the marketing literature, one finds that the benefits of retaining a customer are more than acquiring a new…

    2 Comments
  • A marketing analysis of IPL

    Indian Premier League (IPL) is a cricket playing league in India. It was founded by the board of control for cricket in…

  • Experience of a first time speaker…

    “Speakers who talk about what life has taught them never fail to keep the attention of their listeners.” ~ Dale…

  • Data Analytics: An Outsider’s Take

    Data. The Merriam-Websters dictionary defines the word as: factual information (such as measurements or statistics)…

Others also viewed

Explore content categories