Mastering SQL Window Functions for Data Engineering and Analytics

🚀 Mastering Window Functions in SQL (Quick Notes): If you're preparing for Data Engineering or Analytics roles, window functions are a must-know concept. Here’s a quick breakdown 👇 🔹 What are Window Functions? Window functions perform calculations across a set of rows related to the current row without collapsing the result (unlike GROUP BY). ----------------------------------------------------------------- 🔹 Key Components • PARTITION BY → divides data into groups • ORDER BY → defines row sequence within partition • OVER() → defines the window scope ----------------------------------------------------------------- 🔹 Most Common Window Functions 1️⃣ ROW_NUMBER() Assigns unique row numbers Example: Find top records per group 2️⃣ RANK() Same rank for duplicates, skips next rank Example: Ranking with gaps 3️⃣ DENSE_RANK() Same rank for duplicates, no gaps Example: Continuous ranking 4️⃣ LAG() / LEAD() Access previous/next row values Example: Compare current vs previous data 5️⃣ SUM() / AVG() OVER() Running totals and moving averages ----------------------------------------------------------------- 🔹 Example 👉 Same rank for duplicates, skips next rank: SELECT emp_name, dept, salary, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank FROM employees; 👉 Get previous row value: SELECT emp_name, salary, LAG(salary, 1) OVER (ORDER BY emp_id) AS prev_salary FROM employees; ----------------------------------------------------------------- 🔹 Why Window Functions? • Avoid complex subqueries • Perform row-level + aggregated analysis together • Essential for real-world ETL & analytics ----------------------------------------------------------------- 🔹 Real Use Cases ✔ Top N per group ✔ Running totals ✔ Duplicate detection ✔ Time-based comparisons ----------------------------------------------------------------- 💡 Pro Tip: Always clarify PARTITION and ORDER logic — that’s where most mistakes happen. #SQL #DataEngineering #Analytics #WindowFunctions #LearningJourney

To view or add a comment, sign in

More Relevant Posts

Moe Ahad
1w
Report this post
𝗪𝗶𝗻𝗱𝗼𝘄 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀 — 𝘁𝗵𝗲 𝗦𝗤𝗟 𝘀𝗸𝗶𝗹𝗹 𝘁𝗵𝗮𝘁 𝘀𝗲𝗽𝗮𝗿𝗮𝘁𝗲𝘀 𝗯𝗮𝘀𝗶𝗰 𝗾𝘂𝗲𝗿𝗶𝗲𝘀 𝗳𝗿𝗼𝗺 𝗿𝗲𝗮𝗹 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 A lot of SQL users hit a wall when queries start getting complex… They rely on: • Self joins • Nested subqueries • Overuse of GROUP BY It works — but it gets messy fast. 👉 That’s where window functions come in. What makes window functions different? They let you perform calculations across a set of related rows — without collapsing the data. That’s the game changer. • GROUP BY → aggregates and reduces rows • Window functions → keep every row + add context The core concept (this is what unlocks everything): 𝘸𝘪𝘯𝘥𝘰𝘸_𝘧𝘶𝘯𝘤𝘵𝘪𝘰𝘯([𝘢𝘳𝘨𝘶𝘮𝘦𝘯𝘵]) 𝘖𝘝𝘌𝘙 ( [𝘗𝘈𝘙𝘛𝘐𝘛𝘐𝘖𝘕 𝘉𝘠 𝘤𝘰𝘭𝘶𝘮𝘯_𝘭𝘪𝘴𝘵] [𝘖𝘙𝘋𝘌𝘙 𝘉𝘠 𝘤𝘰𝘭𝘶𝘮𝘯_𝘭𝘪𝘴𝘵] [𝘙𝘖𝘞𝘚 | 𝘙𝘈𝘕𝘎𝘌 𝘧𝘳𝘢𝘮𝘦_𝘤𝘭𝘢𝘶𝘴𝘦] ) • PARTITION BY → defines the group (like a “virtual GROUP BY”) • ORDER BY → defines the sequence within that group • The function runs across that “window” of data Where this becomes powerful in real work: 🔥 Ranking within groups: Who are the top customers 𝘱𝘦𝘳 𝘳𝘦𝘨𝘪𝘰𝘯? • ROW_NUMBER() → unique ranking • RANK() → handles ties (with gaps) • DENSE_RANK() → handles ties (no gaps) 📈 Running totals (cumulative metrics): • Track revenue growth over time without collapsing your dataset • Perfect for dashboards (Power BI/Tableau) 🔁 Comparing rows (this replaces a LOT of joins): Look at previous or next values • LAG() → previous row • LEAD() → next row Example use cases: • Month-over-month growth • Detecting spikes/drops • Customer behavior changes 📊 Advanced analytics without extra tables - you can: • Calculate % of total per group • Create moving averages • Segment users dynamically All in a single query. Why this matters (especially in BI roles): If you’re building dashboards or answering stakeholder questions, you’re constantly asked: • “What changed compared to last period?” • “Who are the top performers in each category?” • “What’s the trend over time?” 👉 Window functions answer these directly, without overcomplicating your SQL. My rule of thumb: • Comparing rows? → use LAG / LEAD • Ranking? → use ROW_NUMBER / RANK • Trends over time? → use running totals Start simple — then layer complexity. Once you get comfortable with window functions, your SQL becomes: • Cleaner • More scalable • Way easier to explain to stakeholders Curious — what’s the first window function that “clicked” for you?
Like Comment
To view or add a comment, sign in
Anudeep KADAVARTHI
3w
Report this post
🚀 Day 26/50 — The SQL "Self-Reflection": Mastering the SELF JOIN 🔄 Have you ever looked at a table and realized the answer you need is hidden in another row of that same table? In most cases, we join different tables (like Customers and Orders). But sometimes, a table needs to talk to itself. This is called a SELF JOIN. 🔎 What is a SELF JOIN? A Self Join is a regular join, but the table is joined with itself. It is extremely useful for querying hierarchical data or comparing rows within the same dataset. To do this, you must use table aliases. Since you are calling the same table twice, SQL needs a way to distinguish between the "first copy" and the "second copy." If you’re looking to build a strong foundation in SQL, I highly recommend checking it out: https://lnkd.in/eHTv-Qz8 📊 Real-World Scenario: The Manager Hierarchy Imagine an employees table where every worker has a manager_id. That manager_id actually refers back to the employee_id of someone else in the same table! Table: employees | employee_id | name | manager_id | | :--- | :--- | :--- | | 1 | Alice (CEO) | NULL | | 2 | Bob | 1 | | 3 | Charlie | 1 | | 4 | David | 2 | The Goal: Show each employee's name next to their manager's name. The Query: SQL SELECT e.name AS Employee, m.name AS Manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.employee_id; The Result: | Employee | Manager | | :--- | :--- | | Bob | Alice | | Charlie | Alice | | David | Bob | 🧠 Why Self Joins are a "Senior-Level" Skill Understanding how a table can reference itself is a hallmark of an advanced #DataAnalyst or #DataEngineer. ✅ Organizational Charts: Mapping who reports to whom in a company. ✅ Comparative Analysis: Comparing a product's price this month to its price last month in a single transaction table. ✅ Networking: Finding "friends of friends" in social media datasets. 🎯 Day 26 Mini Challenge Let's map a hierarchy! 🧑💻 The Scenario: You have a table called categories: category_id | category_name | parent_category_id The Task: Write a SQL query to display the category_name and its parent_category_name by joining the table to itself. Drop your query in the comments! 👇 📚 Hands-on Lab Practice "talking to yourself" (in SQL) live: 👉 sql-practice.com Follow along for Day 27: SET Operators — UNION and UNION ALL 🏗️ #SQL #SelfJoin #DataAnalytics #DataScience #LearningInPublic #DataEngineering #BusinessIntelligence #anudeepdatajourney #TechCareers #DatabaseDesign #CareerGrowth #TorontoTech #Day26 #SQLTips
Like Comment
To view or add a comment, sign in
Bogdan Topalov
2w
Report this post
In today's SQL lesson, we answer: "Is this user becoming more or less active?" You have monthly session data for every user. You want to know: Is Ana growing? Is Ben churning? A regular query can't answer this, so you need context across rows. That's exactly what window functions are built for. Breaking it down: • OVER() is what makes it a window function. Without it, SUM collapses rows like GROUP BY. With it, SUM computes a running total and every single row survives in the output. That's the core idea. • LAG(sessions) pulls the value from the previous row inside the window. For February, it returns January's sessions. That's your month-over-month comparison in one column, no self-join needed. • PARTITION BY user_id resets the window for each user. Without it, Ben's last row would bleed into Ana's first and your LAG values become noise. • ORDER BY month sets the row sequence inside each window. Without order, "previous row" has no definition. For any ranking or time-based function, this is required. • LEAD() is the forward-looking twin of LAG. It looks at the next row's value. Useful for predicting what a user does after a specific event. You can't get both the monthly breakdown and the running total from a GROUP BY. Window functions can give you both in the same query, on the same row. Where does this show up in real work? Every question about change over time is a window function waiting to happen. Is revenue growing month over month? → LAG on monthly totals Who are the top 3 customers per region? → RANK + PARTITION BY region What's the 7-day rolling average of signups? → AVG OVER with a frame clause Which users dropped off after their first week? → LEAD on session dates And that’s how an entire category of business questions becomes answerable in a single query. 🔖 Save this if you work with data. ✅ Follow me for more practical SQL, data engineering tips and automation breakdowns for teams that run on data.
Like Comment
To view or add a comment, sign in
Harshini Ravi
1w
Report this post
Day 12/365 — Mastering SQL by Understanding SQL JOINs — A Must-Know for Data Professionals SQL JOINs allow you to combine data from multiple tables and uncover meaningful insights. Here’s a simple breakdown: INNER JOIN Returns only the matching rows from both tables. Example: SELECT c.customer_name, o.order_id FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id; LEFT JOIN (LEFT OUTER JOIN) Returns all rows from the left table + matching rows from the right. (If No match found - You’ll see NULLs.) Example: SELECT c.customer_name, o.order_id FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id; Shows all customers, even those without orders. RIGHT JOIN (RIGHT OUTER JOIN) Returns all rows from the right table + matching rows from the left. Example: SELECT c.customer_name, o.order_id FROM customers c RIGHT JOIN orders o ON c.customer_id = o.customer_id; Shows all orders, even if customer details are missing. FULL JOIN (FULL OUTER JOIN) Returns all rows from both tables, whether there’s a match or not. Example: SELECT c.customer_name, o.order_id FROM customers c FULL JOIN orders o ON c.customer_id = o.customer_id; CROSS JOIN Returns all possible combinations of rows between two tables. Example: SELECT p.product_name, c.category_name FROM products p CROSS JOIN categories c; Useful when generating combinations. SELF JOIN Joins a table with itself — useful for hierarchical data (like employee-manager relationships). Example: SELECT e.employee_name, m.employee_name AS manager FROM employees e JOIN employees m ON e.manager_id = m.employee_id; Useful for hierarchical relationships. Why this matters in the real world? Think about analyzing customer orders, tracking user activity, or building dashboards — JOINs help you bring scattered data together into one clear picture. #SQL #DataAnalytics #DataScience #Learning #TechCareers #SQLjoin
Like Comment
To view or add a comment, sign in
Venkatesh Gunasekaran
1mo
Report this post
💬 SQL Challenge of the Day Problem: You have a table named "orders" that contains order information including order_id, customer_id, order_date, and order_amount. Write a SQL query to calculate the running total of order_amount for each customer, ordered by order_date, within each customer_id group. Query: ```sql SELECT order_id, customer_id, order_date, order_amount, SUM(order_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total FROM orders ``` Answer: The SQL query calculates the running total of order_amount for each customer, ordered by order_date, within each customer_id group. Explanation: The query uses a window function with the PARTITION BY clause to partition the data by customer_id and then calculates the running total of order_amount using the SUM() function over the ordered rows by order_date. Example: Consider the "orders" table: | order_id | customer_id | order_date | order_amount | |----------|-------------|------------|--------------| | 1 | 101 | 2022-01-01 | 100 | | 2 | 101 | 2022-01-03 | 150 | | 3 | 102 | 2022-01-02 | 200 | The query will return: | order_id | customer_id | order_date | order_amount | running_total | |----------|-------------|------------|--------------|---------------| | 1 | 101 | 2022-01-01 | 100 | 100 | | 2 | 101 | 2022-01-03 | 150 | 250 | | 3 | 102 | 2022-01-02 | 200 | 200 | #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
Like Comment
To view or add a comment, sign in
Venkatesh Gunasekaran
1w
Report this post
💬 SQL Challenge of the Day Problem: You have a table named "sales_data" containing information about sales transactions. Each row represents a single transaction with columns: transaction_id, product_id, sale_amount, and transaction_date. Write a SQL query to calculate the cumulative sum of sales_amount for each product_id, ordered by transaction_date, resetting the sum when encountering a new product_id. Query: ```sql SELECT transaction_id, product_id, sale_amount, SUM(sale_amount) OVER(PARTITION BY product_id ORDER BY transaction_date) AS cumulative_sum FROM sales_data ``` Answer: The SQL query calculates the cumulative sum of sale_amount for each product_id, resetting the sum when a new product_id is encountered, and orders the results by transaction_date. Explanation: The query uses a window function with the PARTITION BY clause to calculate the cumulative sum of sale_amount for each product_id. It resets the sum when a new product_id is encountered due to the PARTITION BY clause. The results are ordered by transaction_date to show the cumulative sum in chronological order. Example: Assume the "sales_data" table has the following data: | transaction_id | product_id | sale_amount | transaction_date | |----------------|------------|-------------|------------------| | 1 | A | 100 | 2022-01-01 | | 2 | A | 150 | 2022-01-03 | | 3 | B | 200 | 2022-01-02 | | 4 | A | 120 | 2022-01-05 | The query will output: | transaction_id | product_id | sale_amount | cumulative_sum | |----------------|------------|-------------|----------------| | 1 | A | 100 | 100 | | 2 | A | 150 | 250 | | 4 | A | 120 | 120 | | 3 | B | 200 | 200 | #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
Like Comment
To view or add a comment, sign in
Venkatesh Gunasekaran
3w
Report this post
💬 SQL Challenge of the Day Problem: Given a table "orders" with columns order_id, customer_id, order_date, and total_amount, write a SQL query to calculate the running total of total_amount for each customer_id ordered by order_date. Query: ```sql SELECT order_id, customer_id, order_date, total_amount, SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total FROM orders ``` Answer: The SQL query provided calculates the running total of total_amount for each customer_id based on the order_date. Explanation: In the query, the SUM function with the OVER clause is used to calculate the running total for each customer_id. The PARTITION BY clause is used to partition the data by customer_id, and the ORDER BY clause within the OVER function determines the order in which the running total is calculated. Example: Consider the "orders" table: | order_id | customer_id | order_date | total_amount | |----------|-------------|------------|--------------| | 1 | 101 | 2021-01-01 | 50 | | 2 | 101 | 2021-01-03 | 30 | | 3 | 102 | 2021-01-02 | 40 | The query will output: | order_id | customer_id | order_date | total_amount | running_total | |----------|-------------|------------|--------------|---------------| | 1 | 101 | 2021-01-01 | 50 | 50 | | 2 | 101 | 2021-01-03 | 30 | 80 | | 3 | 102 | 2021-01-02 | 40 | 40 | #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
Like Comment
To view or add a comment, sign in
Venkatesh Gunasekaran
2w
Report this post
💬 SQL Challenge of the Day Problem: You are given a table named "sales_data" with the following columns: - date (date of the sale) - product_id (unique identifier for the product) - revenue (amount of revenue generated from the sale) Write a SQL query to calculate the cumulative revenue for each product starting from the first sale date to the current sale date. Query: ```sql SELECT date, product_id, revenue, SUM(revenue) OVER(PARTITION BY product_id ORDER BY date) AS cumulative_revenue FROM sales_data ``` Answer: The SQL query calculates the cumulative revenue for each product from the first sale date to the current sale date using a window function. Explanation: The query uses the SUM() function along with the OVER() clause to calculate the cumulative revenue for each product. The PARTITION BY clause partitions the data by product_id, and the ORDER BY clause orders the data by date. This ensures that the cumulative revenue is calculated in chronological order for each product. Example: Consider the "sales_data" table: | date | product_id | revenue | |------------|------------|---------| | 2021-01-01 | A | 100 | | 2021-01-02 | A | 150 | | 2021-01-01 | B | 200 | | 2021-01-03 | A | 120 | | 2021-01-02 | B | 180 | The query will output: | date | product_id | revenue | cumulative_revenue | |------------|------------|---------|--------------------| | 2021-01-01 | A | 100 | 100 | | 2021-01-02 | A | 150 | 250 | | 2021-01-03 | A | 120 | 370 | | 2021-01-01 | B | 200 | 200 | | 2021-01-02 | B | 180 | 380 | #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
Like Comment
To view or add a comment, sign in
Imoleayo Sani
1w
Report this post
What is SQL? SQL (Structured Query Language) is used to communicate with databases. It helps you store, retrieve, filter, update, and analyze data. Used by Data Analysts, Data Scientists, Developers, and Businesses. Basic Database Terms · Database – A collection of tables · Table – Stores data in rows and columns · Row – A single record · Column – A field/category · Query – SQL command 1. View Data – SELECT SELECT * FROM SalesData; SELECT CustomerName, Sales_Value FROM SalesData; 2. Filter Data – WHERE SELECT * FROM SalesData WHERE Sales_Value > 5000; 3. Sort Data – ORDER BY SELECT * FROM SalesData ORDER BY Sales_Value DESC; 4. Count / Sum Data SELECT COUNT(*) FROM SalesData; SELECT SUM(Sales_Value) FROM SalesData; 5. Group Data – GROUP BY SELECT Product, SUM(Sales_Value) AS TotalSales FROM SalesData GROUP BY Product; 6. Update Data UPDATE SalesData SET Sales_Value = 7000 WHERE ID = 1; 7. Rename Column EXEC sp_rename 'SalesData.[Sales Value]', 'Sales_Value', 'COLUMN'; 8. SQL Query Order SELECT → FROM → WHERE → GROUP BY → ORDER BY This keywords are tools to be played with everything to enhance mastery. @TechCrush.pro #RisewithTechCrush #Tech4Africans #LearningwithTechCrush
Like Comment
To view or add a comment, sign in
Venkatesh Gunasekaran
3w
Report this post
💬 SQL Challenge of the Day Problem: Given a table "orders" with columns (order_id, customer_id, order_date, total_amount), write a SQL query to calculate the running total of total_amount for each customer ordered by order_date. Query: ```sql SELECT order_id, customer_id, order_date, total_amount, SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total FROM orders ``` Answer: The SQL query calculates the running total of total_amount for each customer based on the order_date. Explanation: The query uses a window function with the PARTITION BY clause to separate the data into partitions by customer_id and then calculates the running total of total_amount within each partition ordered by order_date. Example: Consider the following "orders" table: | order_id | customer_id | order_date | total_amount | |----------|-------------|------------|--------------| | 1 | A | 2022-01-01 | 100 | | 2 | A | 2022-01-03 | 150 | | 3 | B | 2022-01-02 | 200 | | 4 | A | 2022-01-05 | 120 | The query will output: | order_id | customer_id | order_date | total_amount | running_total | |----------|-------------|------------|--------------|---------------| | 1 | A | 2022-01-01 | 100 | 100 | | 2 | A | 2022-01-03 | 150 | 250 | | 4 | A | 2022-01-05 | 120 | 370 | | 3 | B | 2022-01-02 | 200 | 200 | #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
Like Comment
To view or add a comment, sign in

1,625 followers

21 Posts

View Profile Connect

Mastering SQL Window Functions for Data Engineering and Analytics

More Relevant Posts

Explore related topics

Explore content categories