SQL for Data Projects: Key Commands You Need

SQL for Data Projects: Key Commands You Need

WSDA News | February 07, 2025

Structured Query Language (SQL) is an indispensable tool for data scientists and analysts. From retrieving data to performing complex aggregations, SQL helps unlock insights from raw data efficiently. If you're in data science, mastering a core set of SQL statements can save you time and improve the accuracy of your analysis.

In this guide, we’ll explore essential SQL queries that cover a wide range of data tasks. These queries form the foundation of most data projects, whether you're performing basic queries or working with large-scale data transformations.


1. Selecting Data (SELECT)

This is the most fundamental SQL statement, allowing you to fetch data from a table.

SELECT 
    * 
FROM 
    customers;        

What It Does: Retrieves all columns from the customers table. For more precise queries, specify the columns you want.


2. Filtering Rows (WHERE)

To narrow down your results, use the WHERE clause.

SELECT 
    name, 
    age
FROM 
    customers
WHERE 
    age > 30;        

What It Does: Filters the dataset to include only customers older than 30. You can combine multiple conditions using AND or OR.


3. Sorting Results (ORDER BY)

Organizing data by specific columns helps when reviewing large datasets.

SELECT 
    name, 
    age
FROM 
    customers
ORDER BY 
    age DESC;        

What It Does: Sorts customers by age in descending order. Use ASC for ascending order.


4. Grouping and Aggregation (GROUP BY)

You can summarize data with aggregate functions like COUNT(), SUM(), and AVG().

SELECT 
    city, 
    COUNT(*) AS customer_count
FROM 
    customers
GROUP BY 
    city;        

What It Does: Counts the number of customers in each city.


5. Joining Tables (JOIN)

Combining data from multiple tables is crucial in data science projects.

SELECT 
    customers.name, 
    orders.order_date
FROM 
    customers
JOIN 
    orders ON customers.customer_id = orders.customer_id;        

What It Does: Retrieves customer names along with their order dates by linking the customers and orders tables.


6. Handling Duplicates (DISTINCT)

Remove duplicates from your results with the DISTINCT keyword.

SELECT 
    DISTINCT city
FROM 
    customers;        

What It Does: Returns a list of unique cities where customers are located.


7. Limiting Results (LIMIT)

Use LIMIT to restrict the number of rows returned by your query.

SELECT 
    * 
FROM 
    customers
LIMIT 
    10;        

What It Does: Fetches only the first 10 rows from the customers table.


8. Calculating Averages (AVG)

You can calculate summary statistics using aggregate functions.

SELECT 
    AVG(salary) AS average_salary
FROM 
    employees;        

What It Does: Returns the average salary of all employees.


9. Finding Null Values (IS NULL)

Data cleaning often involves identifying missing values.

SELECT 
    * 
FROM 
    orders
WHERE 
    order_date IS NULL;        

What It Does: Finds all records where the order_date is missing.


10. Creating New Columns (CASE Statement)

Create conditional logic within your query using CASE.

SELECT 
    name, 
    CASE
        WHEN age < 18 THEN 'Minor'
        ELSE 'Adult'
    END AS age_group
FROM 
    customers;        

What It Does: Categorizes customers as either "Minor" or "Adult" based on their age.


11. Calculating Running Totals (Window Functions)

Window functions allow cumulative calculations across rows.

SELECT 
    order_id, 
    SUM(order_amount) OVER (ORDER BY order_date) AS running_total
FROM 
    orders;        

What It Does: Calculates a running total of order amounts based on the order date.


12. Union Queries (UNION)

Use UNION to combine the results of multiple queries.

SELECT 
    name 
FROM 
    customers_in_europe

UNION

SELECT 
    name 
FROM 
    customers_in_america;        

What It Does: Returns a combined list of customer names from both Europe and America, removing duplicates.


13. Deleting Data (DELETE)

If you need to remove specific records, use the DELETE statement carefully.

DELETE FROM 
    customers
WHERE 
    last_order_date < '2022-01-01';        

What It Does: Deletes all customers who haven’t placed an order since January 1, 2022.


Tips for Writing Efficient SQL Queries

  1. Index Usage: Use indexed columns in your WHERE and JOIN conditions to speed up query execution.
  2. **Avoid SELECT *: Be specific about the columns you need to minimize data transfer and improve performance.
  3. Optimize Joins: When joining large tables, ensure appropriate filtering to avoid excessive row combinations.
  4. Use Comments: Document your queries with comments to explain complex logic.


Conclusion

SQL is a powerful tool for data professionals, offering a wide array of commands to manipulate and analyze data. Mastering these core queries will enhance your productivity and help you tackle diverse data science challenges with confidence. Remember, consistent practice is key to becoming proficient.

Data No Doubt! Check out WSDALearning.ai and start learning Data Analytics and Data Science Today!

To view or add a comment, sign in

More articles by Walter Shields

Others also viewed

Explore content categories