🚀 SQL Joins – Deep Dive (Hands-on Learning) Spent some time strengthening my understanding of SQL Joins using real datasets. Sharing both concepts + queries 👇 🔹 INNER JOIN – Only matching records SELECT * FROM samples.bakehouse.sales_customers c INNER JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 LEFT JOIN / LEFT OUTER JOIN – All from left + matches SELECT * FROM samples.bakehouse.sales_customers c LEFT JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 LEFT ANTI JOIN – Records in left NOT in right SELECT * FROM samples.bakehouse.sales_customers c LEFT ANTI JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 LEFT SEMI JOIN – Records in left that HAVE matches SELECT * FROM samples.bakehouse.sales_customers c LEFT SEMI JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 RIGHT JOIN / RIGHT OUTER JOIN – All from right + matches SELECT * FROM samples.bakehouse.sales_customers c RIGHT JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 SEMI JOIN (engine-specific) SELECT * FROM samples.bakehouse.sales_customers c SEMI JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 CROSS JOIN – Cartesian product SELECT * FROM samples.bakehouse.sales_customers c CROSS JOIN samples.bakehouse.sales_transactions t; 💡 Key Learnings: Choosing the right join = better performance + correct results LEFT ANTI & SEMI joins are powerful for filtering datasets (especially in PySpark/Databricks) Joins are core to ETL pipelines & analytics workflows #SQL #DataEngineering #DataAnalytics #PySpark #Databricks #Learning #100DaysOfCode
SQL Joins Explained with Examples
More Relevant Posts
-
Understanding explode() in PySpark (and how to reverse it) Working with nested data is a common part of building data pipelines in PySpark, especially when dealing with arrays or semi-structured formats like JSON. One function that consistently proves useful in these scenarios is explode(). Here is a simple example to illustrate how it works. Input Data: Name Dept John [Sales, Marketing] Sarah [Finance, HR, IT] Using explode() The explode() function transforms each element in an array into a separate row, making the data easier to process and analyze. from pyspark.sql.functions import explode df_exploded = df.select("Name", explode("Dept").alias("Department")) df_exploded.show() Output: Name Department John Sales John Marketing Sarah Finance Sarah HR Sarah IT Reversing explode() using collect_list() After processing the flattened data, you may want to bring it back to its original structure. This can be done using collect_list(). from pyspark.sql.functions import collect_list df_grouped = df_exploded.groupBy("Name") \ .agg(collect_list("Department").alias("Dept")) df_grouped.show() Output (restored structure): Name Dept John [Sales, Marketing] Sarah [Finance, HR, IT] Key Takeaways explode() converts arrays into multiple rows collect_list() aggregates rows back into arrays These functions are essential when working with nested or semi-structured data in ETL pipelines. In many real-world scenarios, especially when working with JSON data or API responses, understanding how to flatten and reconstruct data can make your transformations much more efficient and readable. #PySpark #DataEngineering #BigData #ApacheSpark #ETL #LearnData
To view or add a comment, sign in
-
-
🔥 Free resource for data professionals: SQL ↔ PySpark cheat sheet 📊 If you use SQL daily and want to move into big data, this guide closes the gap fast. What's covered: 🔹 SELECT, WHERE, HAVING → PySpark equivalents 🔹 Aggregations — SUM, AVG, COUNT at scale 🔹 GROUP BY → groupBy() with agg() 🔹 JOINS — inner, left, anti, cross 🔹 Window functions — ranking, running totals, lag/lead 🔹 Subqueries → nested DataFrames 🔹 Performance tips — partitioning, broadcasting, caching Designed for: ✔ SQL analysts moving into data engineering ✔ PySpark users who need a SQL mental model ✔ Anyone preparing for data engineering interviews The format is clean and visual — SQL on one side, PySpark on the other, so you can compare at a glance without digging through docs.
To view or add a comment, sign in
-
When I started learning data, I thought SQL would eventually be replaced. Turns out, I was completely wrong. SQL isn’t dying. It’s quietly running the world. Every few years, something new promises to replace it—NoSQL, Spark, DataFrames, vector databases, AI-generated queries. And yet… here we are. SQL just turned 50—and it still powers more of the world’s data infrastructure than anything else. Here’s why it’s not going anywhere: → It’s declarative — you focus on what, not how → It’s universal — Postgres, Snowflake, BigQuery, Redshift… same language → It’s readable — queries written 10 years ago still make sense today → It bridges business logic and data better than any tool But here’s the real takeaway: Data engineering isn’t about the fanciest tools. It’s about building systems people can trust. And trust starts with understanding your data—where it comes from, how it moves, and where it breaks. SQL forces you to slow down and think. That’s what saves you from broken pipelines at 3am. The best engineers aren’t the ones chasing trends—they’re the ones who can write clean queries, explain them simply, and debug when things fail. That’s the craft. And SQL is where you learn it. Don’t skip fundamentals chasing shiny tools. They only work because of the fundamentals. Curious—do you think SQL will ever be replaced? #SQL #DataEngineering #Analytics #TechCareers #Data
To view or add a comment, sign in
-
🚀 Stop settling for the default "SortMergeJoin" just because Spark says so! If you’re working with Big Data in PySpark, understanding the "Join Strategy" happening under the hood is the difference between a 2-minute job and a 2-hour nightmare. Here is the breakdown of the ultimate heavyweight battle in Spark optimization: 🔵 Sort Merge Join (The Heavyweight Champion) This is Spark’s default join strategy (since Spark 2.3). It’s built for the "Big vs. Big" scenario. How it works: 🖍️ Shuffle: Data with the same keys are moved to the same partitions. 🖍️ Sort: Every partition is sorted by the join key. 🖍️ Merge: Spark iterates through both sorted sets and matches them. 🖍️ Best for: Two very large datasets that don't fit in memory. 🖍️ Why use it: It’s robust and avoids OutOfMemory (OOM) errors by spilling to disk if needed. 🟠 Shuffle Hash Join (The Speed Specialist) The challenger that thrives when there is a size imbalance. How it works: 🖍️ Shuffle: Like SMJ, data moves to the same partitions. 🖍️ Hash: Spark builds a Hash Table from the smaller side in memory. 🖍️ Probe: It loops through the larger side to find matches in the hash table. 🖍️ Best for: One large table + one "medium" table (small enough to build a hash map, but too big to broadcast). 🖍️ Why use it: It skips the expensive Sorting phase, making it significantly faster than SMJ if memory allows. 📊 Comparison at a Glance Sort Merge Join 🖍️ Default? Yes ✅ 🖍️ Shuffle-Sort-Merge 🖍️ Large vs Large tables 🖍️ Low OOM risk (uses disk) 🖍️ Performance - Stable/Reliable Shuffle Hash Join 🖍️ Default? No ❌ 🖍️ Shuffle-Hash-Probe 🖍️ Large vs Medium tables 🖍️ High OOM risk (Hash table must fit) 🖍️ Performance - Faster 💻 The Code Snippet # To force Shuffle Hash Join, you often need to disable Sort Merge spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") # Standard Join (Spark Planner decides) df_joined = df_large.join(df_medium, "user_id", "inner") 💡 Pro-Tip for the Interview: If you are asked "How do you optimize a slow join?", don't just say "add more executors." Mention that you'd check the Spark UI, look at the physical plan using .explain(), and see if switching from SortMergeJoin to ShuffleHashJoin (or Broadcast) would reduce the sorting overhead! 📍 Join my Personal Channels: WhatsApp AI Data Guild: https://lnkd.in/dG2tj8m4 Telegram AI Data Guild: https://t.me/aidataguild ♻️ Repost if you found it useful. Follow 👉 Aniket Ghodake for more Data Engineering content 🚀 #ApacheSpark #PySpark #BigData #DataEngineering #PerformanceTuning #CloudComputing #DataScience
To view or add a comment, sign in
-
-
🚀 Day 11 of 30 — SQL & PySpark Challenge Series 📌 String Functions — TRIM, SPLIT, REGEXP & CONCAT Raw data is always messy. Extra spaces, mixed case, comma-separated tags, emails as plain text. Here's the complete string-cleaning toolkit in SQL & PySpark. ───────────────────────────── 🗄️ #SQL: SELECT UPPER(TRIM(name)) AS clean_name, LOWER(TRIM(city)) AS clean_city, SPLIT_PART(tags, ',', 1) AS first_tag, REGEXP_EXTRACT(email, '@(.+)', 1) AS email_domain, CONCAT(UPPER(TRIM(name)), ' | ', LOWER(TRIM(city))) AS label FROM customers; ───────────────────────────── ⚡ #PySpark: df \ .withColumn("clean_name", F.upper(F.trim(F.col("name")))) \ .withColumn("clean_city", F.lower(F.trim(F.col("city")))) \ .withColumn("first_tag", F.split(F.col("tags"), ",").getItem(0)) \ .withColumn("email_domain", F.regexp_extract(F.col("email"), r'@(.+)', 1)) \ .withColumn("label", F.concat(F.upper(F.trim(F.col("name"))), F.lit(" | "), F.lower(F.trim(F.col("city"))))) ───────────────────────────── 💡 Quick reference: TRIM → strip leading/trailing spaces (always do this first!) SPLIT_PART → extract from delimited string by position REGEXP_EXTRACT → pull capture groups — most powerful string tool in SQL CONCAT → combine columns with literals (use F.lit() in PySpark) ⚠️ Dialect watch: SPLIT_PART → Snowflake, Redshift, DuckDB SPLIT()[0] → Spark SQL / BigQuery Always check your platform's string docs! #SQL #PySpark #Databricks #SparkSQL #Snowflake #BigQuery #DataEngineering #DataEngineer #ETL #dbt #ApacheSpark #30DayChallenge #SQLTips #DataQuality #DataPipeline
To view or add a comment, sign in
-
Yesterday I came across a simple comparison that made me pause for a moment. Not because it was new, but because it reminded me how far the journey from SQL to PySpark actually goes. When I first started working with data, everything revolved around SQL. It was clean, structured, and predictable. You write a query, run it, and get your result. Whether it was filtering rows, grouping data, or joining tables, SQL always felt like speaking a well-defined language. Then came PySpark. At first, it didn’t feel like just another tool. It felt like a shift in thinking. The same operations I used to write in SQL were now part of a programmatic flow. Instead of just querying data, I was building transformations step by step. A simple SELECT became a df.select(), filtering turned into df.filter(), and joins started looking more like logic than just syntax. What really changed for me was how I started looking at data pipelines. SQL helped me answer questions. PySpark helped me build systems. Over time, I realized it’s not about choosing one over the other. It’s about knowing when to use each. SQL is still the fastest way to explore and validate data. But when the scale grows, when pipelines become complex, and when automation matters, PySpark starts to shine. This comparison is a good reminder that the core concepts never change. Selecting columns, filtering rows, aggregating data - it’s all the same logic underneath. Only the way we express it evolves. And that’s the beauty of working in data. You don’t start from scratch when you learn a new tool. You just translate what you already know into a new language. #DataEngineering #SQL #PySpark #BigData #DataAnalytics #DataScience #Snowflake #Databricks #ETL #ELT #DataPipelines #AnalyticsEngineering #LearningJourney #TechCareer #Upskilling #DataCommunity
To view or add a comment, sign in
-
-
🚀 PySpark Commands I Use Daily as a Data Engineer Working with big data? These are some of the most commonly used PySpark commands I rely on in day-to-day data engineering tasks 👇 🔹 1. Create Spark Session from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate() 🔹 2. Read Data df = spark.read.csv("file.csv", header=True, inferSchema=True) df = spark.read.parquet("file.parquet") 🔹 3. Show Data df.show() df.printSchema() 🔹 4. Select & Filter df.select("name", "age") df.filter(df.age > 25) 🔹 5. Add/Modify Columns from pyspark.sql.functions import col df = df.withColumn("age_plus_1", col("age") + 1) 🔹 6. Group By & Aggregation df.groupBy("department").count() 🔹 7. Join DataFrames df.join(df2, df.id == df2.id, "inner") 🔹 8. Handle Missing Values df.dropna() df.fillna({"age": 0}) 🔹 9. Write Data df.write.csv("output.csv") df.write.parquet("output.parquet") 🔹 10. Cache Data (Performance Optimization) df.cache() 💡 Mastering these basics can significantly improve your efficiency when working with large-scale data pipelines. #DataEngineering #PySpark #BigData #ApacheSpark #ETL #LearningInPublic
To view or add a comment, sign in
-
-
🚀 PySpark Essential Commands Every Data Engineer Should Know If you're working with big data, pipelines, or distributed systems, PySpark is not optional anymore. Here are the core building blocks I use almost daily 👇 ------------------------------------------------------------ ⚙️ 1. Getting Started - SparkSession → Entry point to everything - read() → Load data from CSV, Parquet, etc. 🔍 2. Data Transformation Basics - select() → Pick columns - filter() → Apply conditions - withColumn() → Add/modify columns 👉 These are your bread & butter operations 📊 3. Aggregations & Joins - groupBy().agg() → Summarize data - join() → Combine datasets 💡 This is where most business logic lives 🧹 4. Data Cleaning - Handle nulls → na.fill(), na.drop() - Rename / drop columns 👉 Clean data = reliable pipelines ⚡ 5. Performance Optimization - cache() → Speed up repeated operations - count() → Trigger execution (lazy evaluation!) 💡 Remember: Spark is lazy — nothing runs until an action 💾 6. Output & Debugging - write() → Save results (Parquet, Delta) - show() / printSchema() → Quick checks 🔥 7. Advanced Concepts - Functions (col, lit, udf) - RDD access (for low-level control) 💡 Key Insight: PySpark is not just about code, it’s about thinking in distributed data transformations Curious - do you prefer DataFrame API or Spark SQL for your pipelines? #PySpark #DataEngineering #BigData #ApacheSpark #ETL #DataPipeline #MachineLearning #Analytics #DataScience
To view or add a comment, sign in
-
-
🧑💻 The Only Cheatsheet You Need For SQL ! This cheat sheet gives you the exact commands, functions, and patterns you’ll use daily as a Data Analyst or Data Engineer 👇 • Core Categories → DDL, DML, DQL, DCL, TCL - the foundation of how SQL works • Operators → Arithmetic, comparison, logical, and compound operators to build conditions • Database Objects → Tables, views, indexes, triggers - how databases are structured • Constraints → NOT NULL, PRIMARY KEY, FOREIGN KEY to enforce data integrity • Aggregation → SUM, AVG, COUNT, MAX, MIN to turn raw data into insights • Filtering & Grouping → WHERE, GROUP BY, HAVING to slice and analyze data • DDL Commands → CREATE, ALTER, DROP - designing and modifying tables • DML Commands → INSERT, UPDATE, DELETE - managing data inside tables • DQL Queries → SELECT, filtering, sorting, limiting - your everyday queries • Joins → INNER, LEFT, RIGHT, FULL - combining data across tables • Set Operations → UNION, INTERSECT, EXCEPT - merging result sets This is the difference between knowing SQL syntax and actually using SQL to solve problems. Save this - you’ll come back to it every time you write a query. 🔹 Useful resources to practice and level up: • SQL Practice → https://lnkd.in/esAx8CTH • Interactive SQL → https://sqlbolt.com/ • MySQL Docs → https://dev.mysql.com/doc/ ♻️ I share cloud , data analysis/data engineering tips, real world project breakdowns, and interview insights through my free newsletter. 🤝 Subscribe for free here → https://lnkd.in/ebGPbru9 Repost ♻️ if this helps Follow Abhisek Sahu for more practical Data & AI cheat sheets 🚀 #sql
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development