Exists vs IN vs JOIN

Exists vs IN vs JOIN

I was wondering what is better EXISTS, IN or JOIN

Code for the demo:

We will see which operator performs best and when. Let's do some testing!

First, we create a big table and a small table, insert some random data into it and create clustered index for both tables:

CREATE TABLE BigTable (ID INT NOT NULL, 
						FirstName VARCHAR(100), 
						LastName VARCHAR(100), 
						City VARCHAR(100))
GO
-- Insert One Hundred Thousand Records
INSERT INTO BigTable (ID,FirstName,LastName,City)
SELECT TOP 100000 ROW_NUMBER() OVER (ORDER BY a.name) RowID, 
					'Bobby', 
		CASE WHEN  ROW_NUMBER() OVER (ORDER BY a.name)%2 = 1 THEN 'Howard' 
					ELSE 'Evans' END,
		CASE WHEN ROW_NUMBER() OVER (ORDER BY a.name)%10 = 1 THEN 'New York' 
			 WHEN  ROW_NUMBER() OVER (ORDER BY a.name)%10 = 5 THEN 'San Marino' 
			 WHEN  ROW_NUMBER() OVER (ORDER BY a.name)%10 = 3 THEN 'Los Angeles' 
			 WHEN ROW_NUMBER() OVER (ORDER BY a.name)%427 = 1 THEN 'Houston'
					ELSE 'New Mexico' END
FROM sys.all_objects a
CROSS JOIN sys.all_objects b
GO

-- Create Indexes 
-- Create Clustered Index
CREATE CLUSTERED INDEX IX_BigTable_ID
ON BigTable(ID)
GO

-- Create Table
CREATE TABLE SmallTable (ID INT NOT NULL, 
						FirstName VARCHAR(100), 
						LastName VARCHAR(100), 
						City VARCHAR(100))
GO
-- Insert One Hundred Thousand Records
INSERT INTO SmallTable (ID,FirstName,LastName,City)
SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY a.name) RowID, 
					'Bobby', 
		CASE WHEN  ROW_NUMBER() OVER (ORDER BY a.name)%2 = 1 THEN 'Howard' 
					ELSE 'Evans' END,
		CASE WHEN ROW_NUMBER() OVER (ORDER BY a.name)%10 = 1 THEN 'New York' 
			 WHEN  ROW_NUMBER() OVER (ORDER BY a.name)%10 = 5 THEN 'San Marino' 
			 WHEN  ROW_NUMBER() OVER (ORDER BY a.name)%10 = 3 THEN 'Los Angeles' 
			 WHEN ROW_NUMBER() OVER (ORDER BY a.name)%427 = 1 THEN 'Houston'
					ELSE 'New Mexico' END
FROM sys.all_objects a
CROSS JOIN sys.all_objects b
GO
-- Create Indexes 
-- Create Clustered Index
CREATE CLUSTERED INDEX IX_SmallTable_ID
ON SmallTable(ID)
GO

Now let’s see what is better in terms of performance:

-- IN Clause
SELECT ID, City
FROM BigTable 
WHERE ID IN (SELECT ID FROM SmallTable)
GO
-- Exists Clause
SELECT ID, City
FROM BigTable 
WHERE EXISTS (SELECT ID FROM SmallTable WHERE SmallTable.ID = BigTable.ID)
GO
-- Using JOIN
SELECT bt.ID, bt.City
FROM BigTable bt
INNER JOIN SmallTable st ON bt.ID = st.ID
GO

Looking at the execution plan we can see there is no difference!

No alt text provided for this image

What if we check if the IDs from the big table and not in the IDs of the small table?!

Here we have some interesting results:

No alt text provided for this image

We can see that NOT IN and NOT EXISTS is better than left join.

Setting statistics IO on will give us the answer to why this is happening.

SET STATISTICS IO ON

No alt text provided for this image

When using left join, a worktable is created in order to satisfy the HASH MATCH operator used in the execution plan.

But that’s not all! The eagle-eyed of you might have noticed that I have created intentionally created the ID columns with the NOT NULL constraint.

What if we change the column definition and we make the column NULLABLE?

No alt text provided for this image

WOW, That was unexpected! Let’s also look at the IO!

No alt text provided for this image

200.000 logical reads when using NOT IN! That is 400 times more IO used when you use NOT IN on nullable columns!

I will certainly be paying attention to this in the future!

Hope you enjoyed this one, let me know if this was helpful for you, and were you right?



Super Bogdan! Bravo, personal imi plac mult postarile tale de genul asta, very insightfull!

To view or add a comment, sign in

More articles by Bogdan Feodot

Others also viewed

Explore content categories