the "Sin" of the Inner Join
People like to use INNER JOIN when connecting different tables together to get more information for analysis in a relational database sense. If you are one of them, you have to keep reading because you might have done something inappropriate to prevent you from getting what you are after.
The INNER JOIN can potentially hide some valuable information that you are after.
For everyone working in the data industry, the most common thing we do every day is using a lot of "JOIN"s. As a sceptical person, I have an obsessive preference of using LEFT JOIN/RIGHT JOIN when merging different data source together because I am very afraid of losing valuable information when using “INNER JOIN” and I am a perfectionist. I pursuit to acquire the most accurate real-world picture as I can.
Let's take a closer look to understand why I say "losing" information when using the INNER JOIN?
Imagine we are having a staff_fact_table and a staff_type_dimension_table, and we need to find out the distribution of all the staff by staff_type.
#simple Fact table
staff_id staff_name staff_type_code
1 Mike T1
2 Minnie T2
3 Donald T3
#simple Dimension Table
staff_type_code staff_type_desc
T1 Data Analyst
T2 HR Manager
By applying inner join to the fact and dimension table, we will LOSE staff "Donald", only two staff left because there is no T3 code in the dimension code. The INNER JOIN outcome is as below:
staff_id staff_name staff_type_code staff_type_desc
1 Mike T1 Data Analyst
2 Minnie T2 HR Manager
However, if we use either LEFT JOIN or RIGHT JOIN we will be able to capture a better picture in a sense of not kicking anyone out of our company. The LEFT JOIN outcome follows:
staff_id staff_name staff_type_code staff_type_desc
1 Mike T1 Data Analyst
2 Minnie T2 HR Manager
3 Donald T3 <NA>
In the latter case by deploying the LEFT JOIN, we can easily find out what critical information we missed in the dimension table. Therefore, we can get back to our data source and fill up the gap quickly. Unfortunately, this can NOT be notified by using the inner join.
My example above does not provide you with an exciting discovery because we only have three records. But what if we are querying a dataset with millions of records, if that case, all the T3 staff will be missing. I believe someone will not be happy!
I cannot remember when I used INNER JOIN last time. Thanks to my personality, I have always liked LEFT JOIN/RIGHT JOIN more, which gave me a third eye to identify data gap and improved the data quality in my data career.
To increase your confidence in your data and masterpiece conclusions, it is just a simple change from INNER to LEFT/RIGHT.
If you have other thoughts, you are more than welcome to contact me.
It why the default option at Power BI, for instance is left join :). This enable you to have insights about the data you are dealing with, finding possibly data inconsistencies and make you think about creating process to avoid, in first instance, or at least highlight those cases to be fixed