SQL + Pandas: The Ultimate Data Duo
Who knew SQL and Pandas could team up to make data analysis this fast, fun, and a little addictive?
PANDASQL - pandasql lets you run SQL queries directly on your Pandas dataframes—so you get the power of SQL without leaving Python!
SQL is useful for easily filtering rows, aggregating data, or applying multi-condition logic. Python, on the other hand, offers advanced tools for statistical analysis and custom computations, as well as set-based operations, which extend beyond SQL’s capabilities. When used together, SQL simplifies data selection, while Python adds analytical flexibility and versatility.
1. How to Use pandasql in Google Colab?
To run pandasql Inside Colab, start with the following code :
Step 1: Install pandasql
Run this in a Colab cell:
!pip install pandasql
Step 2: Import the necessary libraries
import pandas as pd
from pandasql import sqldf
# Create a helper function for running queries
pysqldf = lambda q: sqldf(q, globals())
# Example: Select all rows where Age > 30
query = "SELECT * FROM df WHERE Age > 30;"
result = pysqldf(query)
result
# Real-World Project: Analyzing Uber Driver Performance Data
In this project, we analyze Uber driver performance data to evaluate and optimise bonus strategies, aiming to improve efficiency and driver engagement.
// Data Exploration and Analytics
Now, let’s explore the datasets. First, we will load the data.
// Initial Dataset Loading
Let’s load the dataset by using just pandas.
import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')
Recommended by LinkedIn
// Exploring the Data
Now let’s review the dataset.
df.head(10)
The dataset includes columns such as driver name, number of trips completed, acceptance rate (percentage of trip requests accepted), total supply hours (hours spent online), and average rating. Before we dive into the analysis, it’s important to verify the column names, check the data types of each column, and identify any missing values. This ensures that we reference the data correctly and handle any data quality issues upfront, making our analysis accurate and reliable.
The dataset includes columns such as driver name, number of trips completed, acceptance rate (percentage of trip requests accepted), total supply hours (hours spent online), and average rating. Before starting the analysis, let’s also check the data types of each column and identify if there are any missing values
The dataset has five different columns, and there are no missing values. I have used both SQL and Python to answer the questions.
Question 1: Who Qualifies for Bonus Option 1?
$50 for each driver who is online at least 8 hours, accepts 90% of requests, completes 10 trips, and has a rating of 4.7 or better during the time frame.
// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)
In this step, I'm using pandasql.
In the following code, I have selected all drivers who meet the conditions for the Option 1 bonus using the WHERE clause and the An An AND operator for linking multiple conditions.
import pandas as pd
from pandasql import sqldf
# Who qualifies for Bonus
query = """
SELECT Name
FROM df
WHERE Hours_Worked >= 60
AND Rating >= 4.05
"""
# Run the query
result = sqldf(query, locals())
print(result)
// Step 2: Finishing in Pandas
Using the following Python code, I have calculated the total payout by multiplying the number of qualified drivers (using len()) by the $50 bonus per driver.
pay_out = 50*len(result)
print(f"payout_is :${pay_out:,}")
payout_is :$800
Here, I’ve presented Q1 as an example. For the complete set of questions and detailed analysis, check out the full file.
Here, I've shown Q1 as an example. To explore the full set of questions and detailed analysis, check out the full file via the link.
Very well articulated! ✨
Thank you so much, Mohammad Amin Bigdeli. You were the very first to like my post, and it honestly made me so happy 😊. Your little gesture truly brightened my day!