the "Sin"​ of the Inner Join
INNER JOIN SIN

the "Sin" of the Inner Join

People like to use INNER JOIN when connecting different tables together to get more information for analysis in a relational database sense. If you are one of them, you have to keep reading because you might have done something inappropriate to prevent you from getting what you are after.

The INNER JOIN can potentially hide some valuable information that you are after.

For everyone working in the data industry, the most common thing we do every day is using a lot of "JOIN"s. As a sceptical person, I have an obsessive preference of using LEFT JOIN/RIGHT JOIN when merging different data source together because I am very afraid of losing valuable information when using “INNER JOIN” and I am a perfectionist. I pursuit to acquire the most accurate real-world picture as I can. 

Let's take a closer look to understand why I say "losing" information when using the INNER JOIN?

Imagine we are having a staff_fact_table and a staff_type_dimension_table, and we need to find out the distribution of all the staff by staff_type.

#simple Fact table

staff_id   staff_name   staff_type_code 
 1         Mike         T1             
 2         Minnie       T2             
 3         Donald       T3


#simple Dimension Table

staff_type_code     staff_type_desc   
    T1              Data Analyst   
    T2              HR Manager

By applying inner join to the fact and dimension table, we will LOSE staff "Donald", only two staff left because there is no T3 code in the dimension code. The INNER JOIN outcome is as below:

staff_id   staff_name   staff_type_code   staff_type_desc
1          Mike         T1                Data Analyst   
2          Minnie       T2                HR Manager

However, if we use either LEFT JOIN or RIGHT JOIN we will be able to capture a better picture in a sense of not kicking anyone out of our company. The LEFT JOIN outcome follows:

staff_id 	staff_name 	staff_type_code 	staff_type_desc
1        	Mike       	T1              	Data Analyst   
2        	Minnie     	T2              	HR Manager     
3        	Donald     	T3              	<NA>

In the latter case by deploying the LEFT JOIN, we can easily find out what critical information we missed in the dimension table. Therefore, we can get back to our data source and fill up the gap quickly. Unfortunately, this can NOT be notified by using the inner join.

My example above does not provide you with an exciting discovery because we only have three records. But what if we are querying a dataset with millions of records, if that case, all the T3 staff will be missing. I believe someone will not be happy!

I cannot remember when I used INNER JOIN last time. Thanks to my personality, I have always liked LEFT JOIN/RIGHT JOIN more, which gave me a third eye to identify data gap and improved the data quality in my data career.

To increase your confidence in your data and masterpiece conclusions, it is just a simple change from INNER to LEFT/RIGHT.


If you have other thoughts, you are more than welcome to contact me.



It why the default option at Power BI, for instance is left join :). This enable you to have insights about the data you are dealing with, finding possibly data inconsistencies and make you think about creating process to avoid, in first instance, or at least highlight those cases to be fixed

To view or add a comment, sign in

Others also viewed

Explore content categories