Find elements in a database table that also exist in another table
There are different ways to frame this question. In terms of set theory, we are interested in the intersection of 2 sets, i.e., find all members of a set that are also members of another set. In a relational database, this becomes "find all the keys in one table that are also present in another table." There are many business use cases for this kind of question. Two examples are:
Below are four ways to do this in SQL with the associated commentary on why to choose one over the other.
For the company that provides online education, we'll start with 2 tables: a customers table and a customers_courses table. The customers_courses table contains 1 row for each customer who signs up for a particular course. So, if someone signs up for 3 courses, there will be three records in the table and if someone has registered on the website and not signed up for any course, there will be zero records in the customers_courses table.
The customers table schema:
CREATE TABLE `customers` (
`ID` int NOT NULL AUTO_INCREMENT,
`fName` varchar(45) DEFAULT NULL,
`lName` varchar(45) DEFAULT NULL,
`UserID` int NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `UserID_UNIQUE` (`UserID`)
)
And, the customers_courses schema:
CREATE TABLE `customers_courses` (
`ID` int NOT NULL AUTO_INCREMENT,
`Customer_ID` int NOT NULL,
`Course_ID` int NOT NULL,
PRIMARY KEY (`ID`),
KEY `customers_Customer_ID` (`Customer_ID`),
CONSTRAINT `Customer_ID` FOREIGN KEY (`Customer_ID`) REFERENCES `customers` (`ID`)
)
Next, we look at the four ways to find all customers who have signed up for at least one course.
1) Clearly, we need a JOIN in the SQL query since customer information is in one table and the association with courses is in another. Most SQL developers I've worked with instinctively use a LEFT JOIN. If we do that, we will have a query that has a left join and then a WHERE clause to filter out customers who haven't signed up for any course. Also, without the DISTINCT, we would get 1 record for each customer-course combination whereas we want only one reference to a customer. The use of DISTINCT gives us the desired result but is inefficient since the query gets all records and then removes duplicates!
Recommended by LinkedIn
select distinct customers.userid, customers.fName
from dbo.customers
left join dbo.customers_courses
on customers_courses.Customer_ID = customers.ID
where customers_courses.id is not null
;
For those interested in learning how the SQL engine will execute the query, run it after adding EXPLAIN ANALYZE before the SELECT. EXPLAIN works for all SQL implementations and ANALYZE is a sub-variant for MySQL; other implementations have their own version of EXPLAIN - it provides more details about the execution plan.
2) We can improve on the LEFT JOIN with the WHERE by using a JOIN - but we still need the inefficient DISTINCT.
select distinct customers.userid, customers.fName
from dbo.customers
join dbo.customers_courses
on customers_courses.Customer_ID = customers.ID
;
3) It's more effective to let the SQL engine remove the duplicates for us (rather than asking for all the result rows and then distilling with DISTINCT). The IN clause is one option.
select customers.userid, customers.fName
from dbo.customers
where customers.id in (
select customer_id from dbo.customers_courses)
;
My concern with IN is that it limits us to an equality test on one field - and one field only.
4) The last option is the WHERE EXISTS clause. I like it the best because
select
customers.userid, customers.fName
from dbo.customers
where exists (
select * from dbo.customers_courses
where customers_courses.customer_id = customers.id
)
;
Thanks for sharing your insights and tips!