SQL and Spark Patterns for Scalable Data Systems

People think SQL problems are new ❌ They’re not 🔸They’re the same real-world patterns repeating again and again 🔁 --- Here’s the twist 👇 🔸The same logic works in SQL and Spark DataFrame API ⚡ --- 🔷 Duplicate records 🔁 Same student marked twice → SQL: GROUP BY + HAVING → Spark: groupBy + count + filter --- 🔷 Second highest salary 🥈 Runner-up in a race → SQL: subquery / window → Spark: dense_rank() --- 🔷 Top 3 salaries 🏆 Top performers → SQL: ORDER BY + LIMIT → Spark: orderBy + limit --- 🔷 Revenue per product 💰 Which item earns most → SQL: SUM + GROUP BY → Spark: groupBy + agg --- 🔷 No department ❌ Missing relationships → SQL: LEFT JOIN + NULL → Spark: left join + isNull --- 🔷 Loyal customers 🤝 Never returned items → SQL: NOT IN / NOT EXISTS → Spark: left anti join --- 🔷 Orders per customer 📊 Visit frequency → SQL: COUNT → Spark: groupBy + count --- 🔷 Joined in 2023 📅 New employees → SQL: EXTRACT(YEAR) → Spark: year() --- 🔷 Avg order value 📈 Spending behavior → SQL: AVG → Spark: avg() --- 🔷 Latest order 🕒 Last interaction → SQL: MAX(date) → Spark: max() --- Same logic Two implementations --- The real skill? 🧠 🔸Not SQL 🔸Not Spark 🔹Understanding patterns once and applying everywhere 🚀 --- That’s how you move from writing queries to building scalable data systems 🔥 #dataengineering #sql #pyspark #bigdata #datapipelines #learningjourney #careergrowth

1 Comment

Arulkumaran Kumaraswamipillai 1w

There are patterns for self-joins, semi- anti joins, aggregate queries with different window frames, CTEs vs temp tables & at times sequence of updates on a temp table is better than one huge select.

To view or add a comment, sign in

More Relevant Posts

PIYUSH SHENDE
4d
Report this post
🔷 SQL Cheat Sheet for Data Engineers & Analysts 🔷 Mastering SQL is a must-have skill for anyone in data — whether you're working in analytics, backend, or data engineering. I’ve created this simple SQL cheat sheet covering all the essential concepts in one place: ✔️ Basic Commands (SELECT, INSERT, UPDATE, DELETE) ✔️ Filtering & Sorting Data ✔️ Joins (INNER, LEFT, RIGHT, FULL, CROSS) ✔️ Aggregations & Grouping ✔️ Subqueries & Set Operations ✔️ Indexing & Transactions ✔️ Views, Triggers & CTEs ✔️ Window Functions (RANK, ROW_NUMBER, etc.) ✔️ Date & Time Functions ✔️ Conditional Logic 💡 Whether you're preparing for interviews or working on real-world data pipelines, this will help you revise quickly. Save it for later and share with someone who is learning SQL 🚀 #SQL #DataEngineering #MySQL #BigQuery #Database #Analytics #LearnSQL #TechLearning #DataAnalytics #DataEngineer #100DaysOfCode
2 Comments
Like Comment
To view or add a comment, sign in
Sai Venkat Abhiram Kotipam
2w
Report this post
🚀 SQL Journey – Day 26 to Day 28: Deep Dive into Subqueries Over the last three days, I explored one of the most powerful and widely used SQL concepts - Subqueries. These concepts helped me understand how to break complex problems into simpler, logical steps. 🔹 Day 26 – Introduction to Subqueries A subquery is a query inside another query used for intermediate calculations and filtering. 💼 Real-world perspective: • Which product performs well? • Which store performs well? • Which customers perform well? 👉 Subqueries help answer these by comparing values with averages or totals. ✔ Learned about: • Correlated Subqueries (row-wise execution) • Non-correlated Subqueries (single execution) 🔹 Day 27 – Types of Subqueries (Based on Result Type) • Scalar Subquery → Returns single value • Row Subquery → Returns one row, multiple columns • Table Subquery → Returns multiple rows & columns 💡 Practice focus: ✔ Customers spending above average ✔ Filtering & counting results dynamically 🔹 Day 28 – Correlated vs Non-Correlated Subqueries • Correlated Subquery → Depends on outer query → Executes for each row → Used for detailed comparisons • Non-Correlated Subquery → Independent → Executes once → Faster and efficient 📊 Key Learnings: ✔ Breaking complex queries into smaller steps ✔ Dynamic filtering using subqueries ✔ Choosing the right type improves performance ✔ Strong foundation for real-world analytics 💡 Key Takeaway: Subqueries are essential for solving real-world business problems, especially when dealing with comparisons, filtering, and analytical queries. 🔥 Consistency + Practice = Mastery in SQL #SQL #DataAnalytics #LearningJourney #Subqueries #SQLPractice #Database #TechLearning #30DaysOfSQL #AIStudent 🚀
Like Comment
To view or add a comment, sign in
Lakshmanan M
3w
Report this post
🚀 SQL Subqueries – Deep Dive (with Hands-on Examples) Continuing my SQL learning journey by practicing Subqueries using real datasets 👇 🔹 What is a Subquery? A subquery is a query written inside another query, used to get intermediate results. 🔹 1. Single-row Subquery (Compare with a single aggregated value) SELECT * FROM samples.bakehouse.sales_transactions WHERE unitPrice > ( SELECT AVG(unitPrice) FROM samples.bakehouse.sales_transactions ); 🔹 2. Multiple-row Subquery (Use IN, ANY, ALL when multiple values are returned) SELECT * FROM samples.bakehouse.sales_transactions WHERE unitPrice = ( SELECT AVG(unitPrice) FROM samples.bakehouse.sales_transactions WHERE product = 'bread' ); 🔹 3. Correlated Subquery (Runs for each row of outer query) SELECT p.product FROM samples.bakehouse.sales_transactions p WHERE unitPrice = ( SELECT AVG(unitPrice) FROM samples.bakehouse.sales_transactions k WHERE k.product = p.product ); 🔹 4. Subquery in SELECT (Scalar Subquery) SELECT product, (SELECT AVG(unitPrice) FROM samples.bakehouse.sales_transactions) AS avg_unit_price FROM samples.bakehouse.sales_transactions; 🔹 5. Subquery in FROM (Derived Table) SELECT Unit_Price, Product_Name FROM ( SELECT AVG(unitPrice) AS Unit_Price, COUNT(product) AS Product_Name FROM samples.bakehouse.sales_transactions ) t; 💡 Key Learnings: Use = for single value, IN for multiple values Correlated subqueries can be expensive → optimize with joins/window functions Subqueries are powerful for filtering, aggregation, and transformations #SQL #DataAnalytics #DataEngineering #Subquery #Learning #PySpark #Databricks #100DaysOfCode
Like Comment
To view or add a comment, sign in
Pallavi B S
3w Edited
Report this post
🚀 SQL Roadmap: From Beginner to Advanced 📊 Want to master SQL step-by-step? Here’s a clean, structured roadmap to guide your journey 👇 🔰 1. Basics 📌 Understand what SQL is 📌 Learn Databases & Tables 📌 Master CRUD operations (Create, Read, Update, Delete) 📌 Get familiar with DDL & DML 📊 2. Queries ✨ SELECT, WHERE (AND/OR/NOT) ✨ ORDER BY, GROUP BY ✨ LIMIT & DISTINCT ⚙️ 3. Functions 🔢 Aggregate: COUNT, SUM, AVG, MIN, MAX 🔤 String: UPPER, LOWER, CONCAT 📅 Date: NOW, DATE, DATEDIFF 🧩 NULL handling with COALESCE 🔄 4. Joins 🔗 INNER JOIN 🔗 LEFT & RIGHT JOIN 🔗 FULL JOIN 🔗 SELF & CROSS JOIN 🧠 5. Subqueries 📍 Subqueries in SELECT, FROM, WHERE 📍 Correlated subqueries 🚀 6. Constraints 🔒 PRIMARY KEY, FOREIGN KEY 🔒 UNIQUE, NOT NULL 🔒 CHECK & DEFAULT ⚡ 7. Indexes & Views 📈 Indexing for performance ⚖️ Index trade-offs 👁️ Creating & using views 🔁 8. Normalization 📚 1NF, 2NF, 3NF 📉 Reduce redundancy ⚖️ When to denormalize 🔐 9. Transactions 💾 BEGIN, COMMIT, ROLLBACK 🛡️ ACID properties ⚙️ Isolation levels 🔥 10. Advanced SQL 💡 Window Functions 💡 Common Table Expressions (CTEs) 💡 Stored Procedures & Triggers 🎯 11. Practice & Next Steps 🛠️ Build real-world projects 🎤 Prepare for interviews ⚡ Practice query optimization 💬 Consistency + Practice = SQL Mastery! #SQL #DataAnalytics #DataScience #Learning #TechSkills #Programming #DataEngineering #Database #LearningPath

1 Comment
Like Comment
To view or add a comment, sign in
Md Saqlain Mustaque
3w Edited
Report this post
I used to avoid SQL window functions like the plague. Whenever I needed to compare a single row to a total (like calculating a percentage), I’d write these massive, messy subqueries or self-joins. It worked, but it was a nightmare to read and even worse to debug. Then I finally learned OVER() — and my code got 10x cleaner overnight. 🚀 If you’re still using subqueries for basic comparisons, here is the breakdown using the "Classic Models" dataset that finally made it "click" for me: 🔍 The Problem: "The Revenue Contribution" Imagine you have a payments table. You want to see: What % of total company revenue does each individual check represent? The "Old" Way: You’d have to write a subquery just to get that grand total revenue figure before you could divide. It’s bulky and easy to break. The Window Function Way: SQL SELECT customerNumber, checkNumber, amount, ROUND(amount * 100.0 / SUM(amount) OVER (), 4) AS pct_of_total_revenue FROM payments ORDER BY amount DESC LIMIT 10; Clean. Readable. One query. No extra joins required. 💡 The Magic: PARTITION BY This is where the real power kicks in. SUM() OVER () → Total across the entire table (as seen above). SUM() OVER (PARTITION BY customerNumber) → Total per customer. That one keyword — PARTITION — is a game changer. It allows you to group data for calculations without losing the granular detail of each individual row. I’m currently documenting my full data analytics journey — covering SQL, Python, Power BI, and everything in between. If you’re leveling up your data skills too, let's connect and build this together. 🤝 💬 What’s one SQL concept that finally "clicked" and changed how you write queries? Let's swap tips in the comments! ⬇️ #SQL #DataAnalytics #DataAnalyst #WindowFunctions #LearningInPublic #DataScience #MySQL #DataTips #CareerTransition
1 Comment
Like Comment
To view or add a comment, sign in
Pavan Raghavendra Sai chintala
1mo
Report this post
One of the biggest hurdles in moving from Basic to Intermediate SQL is mastering the Subquery. As I’ve been diving deeper into SQL for Data Engineering, I’ve realized that a subquery isn’t just "code inside a code"—it’s a completely independent mini-task that SQL runs to give your main query the data it needs. Here is the breakdown of how I’m using them to build cleaner logic: 🔹 Subqueries in WHERE (The Filter): Used to filter rows based on a value you don't know yet—like finding employees who earn more than the AVG salary. 🔹 Subqueries in SELECT (The Calculation): Acts like a "lookup" tool. Great for pulling in a specific count or calculation from another table for every row in your result set. 🔹 Subqueries in FROM (The Temporary Table): This is where the real Data Engineering magic happens. You can pre-aggregate or "clean" data in a subquery before treating it as a source for your main query. The Golden Rule: A subquery is an island. 🏝️ It doesn't know what the outer query is doing unless you specifically link them (Correlated Subqueries). To get an "apples-to-apples" comparison, you have to be explicit with your filters in both places! Mastering these "queries within queries" is a game-changer for keeping data accurate and preventing "fan-out" issues during complex joins. Next up on my roadmap: Common Table Expressions (CTEs) and Window Functions. 🚀 What’s your preference for complex logic: Nested Subqueries or CTEs? Let's discuss below! 👇 #SQL #DataEngineering #Database #LearningJourney #DataAnalytics #DataArchitecture #CodingTips
Like Comment
To view or add a comment, sign in
Ajay Kadiyala
2w
Report this post
Most people think SQL is just about writing queries. But real difference comes from 𝗸𝗻𝗼𝘄𝗶𝗻𝗴 𝘁𝗵𝗲 𝗿𝗶𝗴𝗵𝘁 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗮𝘁 𝘁𝗵𝗲 𝗿𝗶𝗴𝗵𝘁 𝘁𝗶𝗺𝗲. Over the years, I’ve seen one thing very clearly: The better your SQL patterns are, the better your thinking becomes as a Data Engineer. Whether you are building pipelines, debugging data issues, optimizing reports, or preparing for interviews, some SQL concepts come up again and again. That’s why I put together this quick visual on: Top 10 SQL Patterns Every Data Engineer Must Know It covers patterns like: **Joins, CTEs, Window Functions, Aggregations, Subqueries, CASE WHEN, Ranking Functions, Running Totals, Deduplication, and Date-based Analysis** These are practical patterns we use in real projects when working with messy data, business logic, reporting needs, and performance challenges. If your SQL foundation is strong, your data engineering work becomes much easier and much cleaner. A lot of people keep learning tools. But many times, better SQL itself can solve the problem faster. Which SQL pattern do you use the most in your day-to-day work? For me, CTEs and Window Functions are absolute game changers. Download Data Engineering 𝗦𝗤𝗟 𝗞𝗜𝗧 here: https://lnkd.in/g_V8gDg3? Join My Telegram Channel here: https://lnkd.in/g88ic2Ja #SQL #DataEngineering #DataEngineer #Analytics #ETL #BigData #Database #TechCareers #DataAnalytics #LearnSQL
1 Comment
Like Comment
To view or add a comment, sign in
Pavan Gaikwad
2w
Report this post
Most people think they know SQL… 🤔 Until they’re asked to join 4 tables or write a complex query. That’s where things start to break. 💥 I recently came across a 188-page “Top SQL Notes” guide that simplifies everything — from basics to advanced database concepts. 📘✨ Think of it as a complete cheat sheet covering: ➡️ The Core 4: SELECT, INSERT, UPDATE, DELETE 🧩 ➡️ Filtering: Mastering WHERE clauses & logical operators 🔍 ➡️ Joins: Clear understanding of INNER, LEFT & RIGHT joins 🔗 ➡️ Functions: Using AVG(), COUNT(), SUM() with confidence 📊 Whether you're a Data Analyst, Developer, or beginner, this resource makes SQL much easier to understand and apply. 🚀 If you found this helpful: 1️⃣ Like this post ❤️ 2️⃣ Comment “SQL” and I’ll share the guide 📩 3️⃣ Follow me for more high-value tech content 🔔 Let’s master data, one query at a time. 💡 <~#𝑷𝒍𝒂𝒚𝒘𝒓𝒊𝒈𝒉𝒕 #𝑻𝒆𝒔𝒕𝒊𝒏𝒈~> 𝑷𝒍𝒂𝒚𝒘𝒓𝒊𝒈𝒉𝒕 𝒘𝒊𝒕𝒉 𝑱𝒂𝒗𝒂𝑺𝒄𝒓𝒊𝒑𝒕& 𝑻𝒚𝒑𝒆𝑺𝒄𝒓𝒊𝒑𝒕 ( 𝑨𝑰 𝒊𝒏 𝑻𝒆𝒔𝒕𝒊𝒏𝒈, 𝑮𝒆𝒏𝑨𝑰, 𝑷𝒓𝒐𝒎𝒑𝒕 𝑬𝒏𝒈𝒊𝒏𝒆𝒆𝒓𝒊𝒏𝒈)—𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒕𝒂𝒓𝒕𝒔 𝒇𝒓𝒐𝒎 20𝒕𝒉 𝑨𝒑𝒓𝒊𝒍 𝑹𝒆𝒈𝒊𝒔𝒕𝒆𝒓 𝒏𝒐𝒘 𝒕𝒐 𝒂𝒕𝒕𝒆𝒏𝒅 𝑭𝒓𝒆𝒆 𝑫𝒆𝒎𝒐: https://lnkd.in/dR3gr3-4 𝑶𝑹 𝑱𝒐𝒊𝒏 𝒕𝒉𝒆 𝑾𝒉𝒂𝒕𝒔𝑨𝒑𝒑 𝒈𝒓𝒐𝒖𝒑 𝒇𝒐𝒓 𝒕𝒉𝒆 𝒍𝒂𝒕𝒆𝒔𝒕 𝑼𝒑𝒅𝒂𝒕𝒆: https://lnkd.in/ddHf2hdv : Follow Pavan Gaikwad for more helpful content. #SQL #DataScience #Coding #WebDevelopment #LearnSQL #TechCareer #TechInNilambari
Like Comment
To view or add a comment, sign in
Chandrika Tadikonda
4w
Report this post
BLOG 12 — SQL Subqueries Explained with Examples In this blog, I explained SQL Subqueries, which are used to write queries inside another query to solve complex problems. Topics covered: ✔ What is a Subquery ✔ Why we use Subqueries ✔ Where we use Subqueries (SELECT, WHERE, FROM) ✔ Types of Subqueries • Single Row • Multiple Row • Multiple Column • Correlated • Non-Correlated ✔ How subqueries work step by step ✔ When to use Subqueries vs JOINS Subqueries are very powerful and widely used in real-world data analysis and SQL interviews. Read here: https://lnkd.in/d5HJAeVR Grateful to Innomatics Research Labs for providing practical exposure and structured learning. Excited to continue building strong foundations in SQL, Data Analytics, and Data Science. Special thanks to the team for their guidance and support: Co-Founder & CEO – Kalpana Katiki Reddy Regional Head – VAMSI KRISHNA KANAGALA Trainer – Swathi Reddy Thatikonda Abhilash Manikanta Mentors: Gogula Vinay Koduri Srihari Dinesh Bodigadla Rahul Janjirala Program Manager – Raghu Ram Aduri Placement Team: Sigilipelli Yeshwanth Sravani Burma Rishita Bhargavi K Eswarkarthic M SQL | Python | Pandas | Data Analytics | Statistics #SQL #DataAnalytics #Database #LearningJourney #InnomaticsResearchLabs #CareerGrowth #Beginner #Portfolio #100DaysOfLearning

SQL Subqueries Explained with Examples (Types, Usage, and Importance) medium.com

2 Comments
Like Comment
To view or add a comment, sign in
Sudheer C
1w
Report this post
The biggest SQL mistake I used to make wasn’t syntax. It was wrong numbers that looked right. The query would run. No errors. Clean output. But something felt off. And most of the time, it came down to this: 🔹 Duplicate joins inflating counts 🔹 Missing filters changing totals 🔹 Aggregations at the wrong level Everything looked fine… until you actually questioned it. So I built one habit: I don’t trust the final output until I’ve checked the basics. ✔️ Row counts before and after joins ✔️ Distinct keys to catch duplication ✔️ Totals at each step, not just the end ✔️ Small sample checks to validate logic That one shift changed everything. Because in SQL, the real skill isn’t writing queries. It’s knowing when the result is wrong. 🔍 When was the last time you double-checked a “perfect-looking” output? #SQL #DataAnalytics #DataQuality #AnalyticsMindset #DataEngineering #QueryOptimization #DataValidation #ETL #Database #BigQuery #Snowflake #DataProfessionals #TechCareers #CleanData #AnalyticsCommunity #DataBestPractices
Like Comment
To view or add a comment, sign in

6,222 followers

286 Posts

View Profile Follow

SQL and Spark Patterns for Scalable Data Systems

More Relevant Posts

Explore content categories