Differences between Azure Databricks SQL and Microsoft T-SQL: LEFT SEMI JOIN and LEFT ANTI JOIN
Azure Databricks SQL has several types of joins that are not directly supported in Microsoft T-SQL. One of these is the LEFT SEMI JOIN. It differs from the LEFT OUTER JOIN in that it only returns data from the first table, without duplicates if the joining IDs appear multiple times in the second table. Another join type is the LEFT ANTI JOIN, which returns records from the first table that do NOT have matching records in the second table.
Sample Data
Table1:
Table2:
LEFT OUTER JOIN
returns the following results in either Azure Databricks SQL or Microsoft T-SQL:
LEFT SEMI JOIN
returns the following results in Azure Databricks SQL:
Differences from LEFT OUTER JOIN:
To achieve the same results from Microsoft T-SQL would require a query like this:
Recommended by LinkedIn
LEFT ANTI JOIN
produces the following results in Azure Databricks SQL:
Differences from LEFT OUTER JOIN:
The equivalent query in Microsoft T-SQL would be:
More complex example
Azure Databricks version:
Microsoft T-SQL version:
The Azure Databricks example is more readable, because the joining conditions for Table2 and Table3 are next to the joining conditions for the other tables, in the FROM clause, rather than being buried among the rest of the WHERE conditions. Also, the Azure Databricks example will be more efficient, because it will filter Table1 using Table2 and Table3 before joining to Table4 and Table5, while the T-SQL version will join the full contents of Table1 with Table4 and Table5, and only filter the results using Table2 and Table3 later, when the WHERE clause is evaluated.
Conclusion
If you need to filter the contents of one table using another table in Azure Databricks, using LEFT SEMI JOIN or LEFT ANTI JOIN instead of LEFT OUTER JOIN will produce more efficient code that is easier to read.
That feels so weird, a join statement that doesn’t return all columns from joined tables. Why not all nulls like outer join? Anyway, there are reasons why this is not valid SQL. Semi- and anti- joins are filters not products. They belong to the where clause.
Thank you, interesting read.