SQL Data Scientist Interview (Part 1)
SQL is the the most fundamental skill for a data analyst/data scientist-analytics/insights/product (whatever you want to call it). It is how you talk to the data. Not surprisingly, it is also an essential part of the interview process. Actually, it is a part of the tech screen and without that it is not possible to get to an onsite, where there is an additional SQL round.
As someone who used to to give interviews for fun, there were a couple of interesting SQL questions I got wrong and they exposed me to some recurring concepts. I was able to use these failures not only for my future interviews but I think it also helped me in my job.
I feel a lot of people currently interviewing would also benefit from these concepts as they are tested in almost every advanced SQL interview.
The first concept is a SELF JOIN and I will explain with an example:
The way to approach it would be to break it down and bring everything to user level.
First find users who bought coffee:
SELECT DISTINCT user_id
FROM T
WHERE LOWER(category)='coffee'
This will leave us with this dataset:
Then find users who bought Starbucks:
SELECT DISTINCT user_id
FROM T
WHERE LOWER(brand)='starbucks'
This will leave us with this dataset:
Combine these 2 pieces of information with a JOIN on user_id, essentially a join to self aka SELF JOIN:
WITH coffee_buyers AS
(
SELECT DISTINCT user_id
FROM T
WHERE LOWER(category)='coffee'
),
starbucks_buyers AS
(
SELECT DISTINCT user_id
FROM T
WHERE LOWER(brand)='starbucks'
)
SELECT DISTINCT coffee_buyers.user_id
FROM coffee_buyers
LEFT JOIN starbucks_buyers
ON coffee_buyers.user_id=starbucks_buyers.user_id
This will leave us with this dataset:
Since we want users who bought coffee but did NOT buy Starbucks, we can filter for users with NULL for starbucks_buyers.user_id. That can be done by adding this line to code above:
WHERE starbucks_buyers.user_id IS NULL
That brings us to the end of this question.
There are other variations of this question one can come across:
In the next part we will look at lead/lag and window functions.
Interesting problem and neat solution. Thank you for posting this.