Differences between Azure Databricks SQL and Microsoft T-SQL: LEFT SEMI JOIN and LEFT ANTI JOIN

Differences between Azure Databricks SQL and Microsoft T-SQL: LEFT SEMI JOIN and LEFT ANTI JOIN

Azure Databricks SQL has several types of joins that are not directly supported in Microsoft T-SQL. One of these is the LEFT SEMI JOIN. It differs from the LEFT OUTER JOIN in that it only returns data from the first table, without duplicates if the joining IDs appear multiple times in the second table. Another join type is the LEFT ANTI JOIN, which returns records from the first table that do NOT have matching records in the second table.

Sample Data

Table1:

Sample data - Table 1

Table2:

Sample data - Table 2

LEFT OUTER JOIN

Left Outer Join query:
SELECT *
FROM Table1 a
LEFT OUTER JOIN Table2 b ON a.ID = b.ID;

returns the following results in either Azure Databricks SQL or Microsoft T-SQL:

Left Outer Join

LEFT SEMI JOIN

Left Semi Join query in Azure Databricks:
SELECT *
FROM Table1 a
LEFT SEMI JOIN Table2 b ON a.ID = b.ID;

returns the following results in Azure Databricks SQL:

Left Semi Join

Differences from LEFT OUTER JOIN:

  1. The rows from Table1 with ID = 3 are omitted, because there is no match for them in Table 2.
  2. The rows from Table1 with ID = 4 are not duplicated, because the query only checks to see if there is at least one match in Table2; it does not return all matches.
  3. The columns from Table2 are omitted, because they are only used for filtering the data from Table1.

 To achieve the same results from Microsoft T-SQL would require a query like this:

LEFT SEMI JOIN query equivalent in T-SQL:
SELECT *
FROM Table1 a
WHERE EXISTS (SELECT * FROM Table2 b WHERE a.ID = b.ID);

LEFT ANTI JOIN

Left Anti Join query:
SELECT *
FROM Table1 a
LEFT ANTI JOIN Table2 b ON a.ID = b.ID;

produces the following results in Azure Databricks SQL:

Left Anti Join

Differences from LEFT OUTER JOIN:

  1. Only the rows from Table1 with ID = 3 are included, because they do NOT have a match in Table2.
  2. The columns from Table2 are omitted, because they are only used for filtering the data from Table1.

 The equivalent query in Microsoft T-SQL would be:

LEFT ANTI JOIN query equivalent in T-SQL:
SELECT *
FROM Table1 a
WHERE NOT EXISTS (SELECT * FROM Table2 b WHERE a.ID = b.ID);

More complex example

Azure Databricks version:

More complex Databricks example:
SELECT *
FROM Table1 a
LEFT SEMI JOIN Table2 b ON a.ID = b.ID
LEFT ANTI JOIN Table3 c ON a.ID = c.ID
LEFT OUTER JOIN Table4 d ON a.ID = d.ID
LEFT OUTER JOIN Table5 e ON a.ID = e.ID
WHERE a.X = 5
AND a.Y = 7;

Microsoft T-SQL version:

More complex Microsoft T-SQL example:
SELECT *
FROM Table1 a
LEFT OUTER JOIN Table4 d ON a.ID = d.ID
LEFT OUTER JOIN Table5 e ON a.ID = e.ID
WHERE a.X = 5
AND a.Y = 7
AND EXISTS (SELECT * FROM Table2 b WHERE a.ID = b.ID)
AND NOT EXISTS (SELECT * FROM Table3 c WHERE a.ID = c.ID);

The Azure Databricks example is more readable, because the joining conditions for Table2 and Table3 are next to the joining conditions for the other tables, in the FROM clause, rather than being buried among the rest of the WHERE conditions. Also, the Azure Databricks example will be more efficient, because it will filter Table1 using Table2 and Table3 before joining to Table4 and Table5, while the T-SQL version will join the full contents of Table1 with Table4 and Table5, and only filter the results using Table2 and Table3 later, when the WHERE clause is evaluated.

 Conclusion

If you need to filter the contents of one table using another table in Azure Databricks, using LEFT SEMI JOIN or LEFT ANTI JOIN instead of LEFT OUTER JOIN will produce more efficient code that is easier to read.

That feels so weird, a join statement that doesn’t return all columns from joined tables. Why not all nulls like outer join? Anyway, there are reasons why this is not valid SQL. Semi- and anti- joins are filters not products. They belong to the where clause.

Like
Reply

Thank you, interesting read.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories