SQL & PySpark String Functions for Data Cleaning

🚀 Day 11 of 30 — SQL & PySpark Challenge Series 📌 String Functions — TRIM, SPLIT, REGEXP & CONCAT Raw data is always messy. Extra spaces, mixed case, comma-separated tags, emails as plain text. Here's the complete string-cleaning toolkit in SQL & PySpark. ───────────────────────────── 🗄️ #SQL: SELECT UPPER(TRIM(name)) AS clean_name, LOWER(TRIM(city)) AS clean_city, SPLIT_PART(tags, ',', 1) AS first_tag, REGEXP_EXTRACT(email, '@(.+)', 1) AS email_domain, CONCAT(UPPER(TRIM(name)), ' | ', LOWER(TRIM(city))) AS label FROM customers; ───────────────────────────── ⚡ #PySpark: df \ .withColumn("clean_name", F.upper(F.trim(F.col("name")))) \ .withColumn("clean_city", F.lower(F.trim(F.col("city")))) \ .withColumn("first_tag", F.split(F.col("tags"), ",").getItem(0)) \ .withColumn("email_domain", F.regexp_extract(F.col("email"), r'@(.+)', 1)) \ .withColumn("label", F.concat(F.upper(F.trim(F.col("name"))), F.lit(" | "), F.lower(F.trim(F.col("city"))))) ───────────────────────────── 💡 Quick reference: TRIM → strip leading/trailing spaces (always do this first!) SPLIT_PART → extract from delimited string by position REGEXP_EXTRACT → pull capture groups — most powerful string tool in SQL CONCAT → combine columns with literals (use F.lit() in PySpark) ⚠️ Dialect watch: SPLIT_PART → Snowflake, Redshift, DuckDB SPLIT()[0] → Spark SQL / BigQuery Always check your platform's string docs! #SQL #PySpark #Databricks #SparkSQL #Snowflake #BigQuery #DataEngineering #DataEngineer #ETL #dbt #ApacheSpark #30DayChallenge #SQLTips #DataQuality #DataPipeline

To view or add a comment, sign in

More Relevant Posts

Aniket Ghodake
1mo
Report this post
🚀 Stop settling for the default "SortMergeJoin" just because Spark says so! If you’re working with Big Data in PySpark, understanding the "Join Strategy" happening under the hood is the difference between a 2-minute job and a 2-hour nightmare. Here is the breakdown of the ultimate heavyweight battle in Spark optimization: 🔵 Sort Merge Join (The Heavyweight Champion) This is Spark’s default join strategy (since Spark 2.3). It’s built for the "Big vs. Big" scenario. How it works: 🖍️ Shuffle: Data with the same keys are moved to the same partitions. 🖍️ Sort: Every partition is sorted by the join key. 🖍️ Merge: Spark iterates through both sorted sets and matches them. 🖍️ Best for: Two very large datasets that don't fit in memory. 🖍️ Why use it: It’s robust and avoids OutOfMemory (OOM) errors by spilling to disk if needed. 🟠 Shuffle Hash Join (The Speed Specialist) The challenger that thrives when there is a size imbalance. How it works: 🖍️ Shuffle: Like SMJ, data moves to the same partitions. 🖍️ Hash: Spark builds a Hash Table from the smaller side in memory. 🖍️ Probe: It loops through the larger side to find matches in the hash table. 🖍️ Best for: One large table + one "medium" table (small enough to build a hash map, but too big to broadcast). 🖍️ Why use it: It skips the expensive Sorting phase, making it significantly faster than SMJ if memory allows. 📊 Comparison at a Glance Sort Merge Join 🖍️ Default? Yes ✅ 🖍️ Shuffle-Sort-Merge 🖍️ Large vs Large tables 🖍️ Low OOM risk (uses disk) 🖍️ Performance - Stable/Reliable Shuffle Hash Join 🖍️ Default? No ❌ 🖍️ Shuffle-Hash-Probe 🖍️ Large vs Medium tables 🖍️ High OOM risk (Hash table must fit) 🖍️ Performance - Faster 💻 The Code Snippet # To force Shuffle Hash Join, you often need to disable Sort Merge spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") # Standard Join (Spark Planner decides) df_joined = df_large.join(df_medium, "user_id", "inner") 💡 Pro-Tip for the Interview: If you are asked "How do you optimize a slow join?", don't just say "add more executors." Mention that you'd check the Spark UI, look at the physical plan using .explain(), and see if switching from SortMergeJoin to ShuffleHashJoin (or Broadcast) would reduce the sorting overhead! 📍 Join my Personal Channels: WhatsApp AI Data Guild: https://lnkd.in/dG2tj8m4 Telegram AI Data Guild: https://t.me/aidataguild ♻️ Repost if you found it useful. Follow 👉 Aniket Ghodake for more Data Engineering content 🚀 #ApacheSpark #PySpark #BigData #DataEngineering #PerformanceTuning #CloudComputing #DataScience
6 Comments
Like Comment
To view or add a comment, sign in
Sasikumar V
3w
Report this post
🚀 Day 10 of 30 — SQL & PySpark Challenge Series 📌 NULL Handling — COALESCE, NULLIF, IFNULL & fillna NULLs don't throw errors. They silently corrupt your results. ───────────────────────────── 🗄️ SQL: SELECT COALESCE(region, 'Unknown') AS region, -- first non-NULL wins IFNULL(discount, 0) AS discount, -- single fallback NULLIF(status, 'N/A') AS status, -- sentinel → NULL CASE WHEN region IS NULL OR discount IS NULL OR delivery_days IS NULL THEN TRUE ELSE FALSE END AS has_nulls FROM orders; ───────────────────────────── ⚡ PySpark: df \ .withColumn("region", F.coalesce(F.col("region"), F.lit("Unknown"))) \ .withColumn("discount", F.coalesce(F.col("discount"), F.lit(0))) \ .withColumn("status", F.nullif(F.col("status"), F.lit("N/A"))) # Bulk fill multiple columns at once df.fillna({"region": "Unknown", "discount": 0}) # Drop rows where critical columns are NULL df.dropna(subset=["region", "delivery_days"]) ───────────────────────────── 💡 Quick reference: COALESCE(a,b,c) → first non-NULL from the list (most portable) IFNULL / NVL → single fallback; dialect-specific NULLIF(a, b) → returns NULL if a = b (great for 'N/A', '', 0) ⚠️ NULL traps to memorise: → NULL + 5 = NULL (not 5) → NULL = NULL → FALSE (use IS NULL, never = NULL) → COUNT(col) skips NULLs — COUNT(*) does not #SQL #PySpark #Databricks #SparkSQL #Snowflake #BigQuery #DataEngineering #DataEngineer #ETL #dbt #ApacheSpark #30DayChallenge #DataQuality #SQLTips #DataPipeline
Like Comment
To view or add a comment, sign in
Ifeoma Sandra Orji
2w
Report this post
#sql #data #AI #datafam #postgres #mysql #SQL Twitch (Hard Level) #DAY65 "From users who had their first session as a viewer, how many streamer sessions have they had? Return the user id and number of sessions in descending order. In case there are users with the same number of sessions, order them by ascending user id". __________________________________________________________________________________ 𝐒𝐜𝐡𝐞𝐦𝐚 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 CREATE TABLE twitch_sessions (user_id BIGINT, session_start DATETIME, session_end DATETIME, session_id BIGINT PRIMARY KEY, session_type VARCHAR(20) CHECK (session_type IN ('viewer', 'streamer'))); INSERT INTO twitch_sessions (user_id, session_start, session_end, session_id, session_type) VALUES (101, '2024-02-01 10:00:00', '2024-02-01 11:00:00', 1, 'viewer'), (101, '2024-02-02 14:00:00', '2024-02-02 15:30:00', 2, 'streamer'), (102, '2024-02-01 09:30:00', '2024-02-01 10:30:00', 3, 'viewer'), (102, '2024-02-03 16:00:00', '2024-02-03 17:00:00', 4, 'streamer'), (102, '2024-02-05 18:00:00', '2024-02-05 19:30:00', 5, 'streamer'), (103, '2024-02-02 11:00:00', '2024-02-02 12:00:00', 6, 'viewer'), (104, '2024-02-01 08:30:00', '2024-02-01 09:00:00', 7, 'viewer'), (104, '2024-02-04 20:00:00', '2024-02-04 21:00:00', 8, 'streamer'), (104, '2024-02-06 22:00:00', '2024-02-06 23:00:00', 9, 'streamer'), (104, '2024-02-07 15:00:00', '2024-02-07 16:30:00', 10, 'streamer');
Like Comment
To view or add a comment, sign in
Lakshmanan M
3w
Report this post
🚀 SQL Joins – Deep Dive (Hands-on Learning) Spent some time strengthening my understanding of SQL Joins using real datasets. Sharing both concepts + queries 👇 🔹 INNER JOIN – Only matching records SELECT * FROM samples.bakehouse.sales_customers c INNER JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 LEFT JOIN / LEFT OUTER JOIN – All from left + matches SELECT * FROM samples.bakehouse.sales_customers c LEFT JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 LEFT ANTI JOIN – Records in left NOT in right SELECT * FROM samples.bakehouse.sales_customers c LEFT ANTI JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 LEFT SEMI JOIN – Records in left that HAVE matches SELECT * FROM samples.bakehouse.sales_customers c LEFT SEMI JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 RIGHT JOIN / RIGHT OUTER JOIN – All from right + matches SELECT * FROM samples.bakehouse.sales_customers c RIGHT JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 SEMI JOIN (engine-specific) SELECT * FROM samples.bakehouse.sales_customers c SEMI JOIN samples.bakehouse.sales_transactions t ON t.customerID = c.customerID; 🔹 CROSS JOIN – Cartesian product SELECT * FROM samples.bakehouse.sales_customers c CROSS JOIN samples.bakehouse.sales_transactions t; 💡 Key Learnings: Choosing the right join = better performance + correct results LEFT ANTI & SEMI joins are powerful for filtering datasets (especially in PySpark/Databricks) Joins are core to ETL pipelines & analytics workflows #SQL #DataEngineering #DataAnalytics #PySpark #Databricks #Learning #100DaysOfCode
Like Comment
To view or add a comment, sign in
Sasikumar V
2w
Report this post
🚀 Day 14 of 30 — SQL & PySpark Challenge Series 📌 Subquery Types — Scalar, Correlated & Lateral 3 types. 3 behaviours. All look similar — but perform completely differently at scale. ───────────────────────────── 🗄️ SQL: -- 1. Scalar: runs once, injects one value per row SELECT order_id, amount, (SELECT AVG(amount) FROM orders) AS overall_avg FROM orders; -- 2. Correlated: re-executes once PER outer row ⚠️ SELECT e.emp_name, (SELECT MAX(o.amount) FROM orders o WHERE o.emp_id = e.emp_id) AS personal_max FROM employees e; -- 3. Lateral: top-N per group (cleanest) SELECT e.emp_name, top2.amount FROM employees e, LATERAL ( SELECT amount FROM orders o WHERE o.emp_id = e.emp_id ORDER BY amount DESC LIMIT 2 ) top2; ───────────────────────────── ⚡ PySpark equivalents: # Scalar → collect once, inject as F.lit() avg_val = orders.agg(F.avg("amount")).collect()[0][0] orders.withColumn("overall_avg", F.lit(avg_val)) # Correlated → groupBy + join (never loop row-by-row!) emp_max = orders.groupBy("emp_id").agg(F.max("amount")) employees.join(emp_max, on="emp_id") # Lateral → ROW_NUMBER window + filter win = Window.partitionBy("emp_id").orderBy(F.col("amount").desc()) orders.withColumn("rn", F.row_number().over(win)).filter(F.col("rn") <= 2) ───────────────────────────── 💡 Quick rule: Scalar → global metric, runs once ✅ Correlated → runs N times — rewrite as JOIN at scale ⚠️ Lateral → top-N per group, most flexible ✅ ⚠️ A correlated subquery on 10M rows = 10M subquery executions. Always check your query plan! #SQL #PySpark #Databricks #SparkSQL #Snowflake #BigQuery #DataEngineering #DataEngineer #ETL #dbt #ApacheSpark #30DayChallenge #SQLTips #DataPipeline #QueryOptimization
Like Comment
To view or add a comment, sign in
Pooja Chavanpatil
3w
Report this post
Data is everywhere. Answers are not. Not because the data doesn't have them. But because most people can't speak SQL. I built something to fix that. And I thought — wait. Why does asking a question about data require you to know SQL? That frustration became my project. 🤓 I built DataBot. 🤖 A GenAI-powered data analysis chatbot where you just... talk to your data. No SQL. No Python. No data analyst needed. You type like : "Which region has the highest profit?" "Show me monthly sales trends for 2023" "Which customers returned the most products?" DataBot: 🌑 Converts your question into SQL automatically 🌒 Runs it on your database 🌓 Draws the right chart — bar, line, or pie 🌔 Explains what it means in plain English All in under 5 seconds.Here's what makes it special: You don't have to use my data. You can upload YOUR own data in 5 different ways: → Upload a CSV file → Paste a Google Sheets link → Drop in any public URL with a table → Copy-paste directly from Excel → Load multiple datasets and JOIN them 😊 Yes — JOIN. You can ask questions across 5 tables at once. "Show me sales by customer segment" pulls from the orders table AND the customers table — automatically. The tech stack that made this possible: ↳ Claude AI — my coding partner throughout this entire build. I used Claude to architect the app, debug errors, write the prompt engineering logic, and think through every design decision. Honestly could not have shipped this without it. ↳ GenAI for the intelligence — converting plain English into perfect SQL every single time ↳ Groq for the speed — free API key, responses in under 2 seconds ↳ LLaMA 3.3 as the language model — open source and incredibly accurate ↳ SQLite for the database — lightweight and zero setup ↳ Streamlit for the interface — a full chat UI in pure Python ↳ Plotly for auto-generated charts — picks bar vs line vs pie on its own What I learned building this: 📝 1.Prompt engineering is a real skill. Getting GenAI to return clean SQL — not markdown, not explanations, just raw SQL — took way more iterations than I expected. 2.Groq changed everything. I spent weeks stuck on API key issues with other providers. Groq gave me a free key in 2 minutes and responses faster than I ever expected. 3. Claude AI is genuinely a co-builder. Every time I hit a wall — a bug, a design question, a deployment issue — Claude helped me think through it clearly and ship faster. The best products solve one frustration really well. Mine solves: "I have data but I can't talk to it." I'm a recent MS Statistical Data Science grad from SFSU. This project is live, open source, and on my GitHub. If you're hiring for Data Analyst or Data Scientist roles — or just want to see a live demo — drop a comment or DM me. Thank You Link : https://lnkd.in/g6wE4Qbb #DataScience #GenAI #SQL #Groq #LLaMA #TextToSQL #Streamlit #Python #OpenToWork #SFSU #MachineLearning #DataAnalytics #ClaudeAI #DataBot
Like Comment
To view or add a comment, sign in
Hariprasad B.
1mo
Report this post
🔷 𝗗𝗲𝗹𝘁𝗮 𝗟𝗮𝗸𝗲 𝗼𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝘃𝘀 𝗟𝗼𝗰𝗮𝗹/𝗢𝗻-𝗣𝗿𝗲𝗺 𝗦𝗽𝗮𝗿𝗸 𝗪𝗵𝗮𝘁 𝗡𝗼𝗯𝗼𝗱𝘆 𝗧𝗲𝗹𝗹𝘀 𝗬𝗼𝘂 𝗪𝗵𝗲𝗻 𝗬𝗼𝘂 𝗦𝘄𝗶𝘁𝗰𝗵 𝗘𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁𝘀 If you've worked with Databricks, Delta Lake feels like magic, it just works. No setup, no configuration, no extra packages. But the moment you try to replicate the same workflow on a 𝗹𝗼𝗰𝗮𝗹 𝗦𝗽𝗮𝗿𝗸 𝗲𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁 𝗼𝗿 𝗼𝗻-𝗽𝗿𝗲𝗺 𝗰𝗹𝘂𝘀𝘁𝗲𝗿, things break. And it's not a bug, it's by design. Here's what's actually happening 👇 ⚡ 𝗢𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 Delta Lake is pre-installed and pre-configured. Your SparkSession is ready to go out of the box: 𝗢𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 - 𝗗𝗲𝗹𝘁𝗮 𝗷𝘂𝘀𝘁 𝘄𝗼𝗿𝗸𝘀 data = spark.range(0, 5) data.write.format("delta").save("/tmp/delta-table") No pip installs. No extra configs. Databricks handles all of it under the hood. --- 🖥️ 𝗢𝗻 𝗟𝗼𝗰𝗮𝗹 / 𝗢𝗻-𝗣𝗿𝗲𝗺 𝗦𝗽𝗮𝗿𝗸 You need to do three things explicitly: 𝗦𝘁𝗲𝗽 1 𝗜𝗻𝘀𝘁𝗮𝗹𝗹 𝘁𝗵𝗲 𝗽𝗮𝗰𝗸𝗮𝗴𝗲 terminal or notebook cell pip install pyspark==4.0.0 pip install delta-spark==4.0.0 ⚠️ Check the pyspark version to set delta spark version accordingly 𝗦𝘁𝗲𝗽 2 𝗖𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗲 𝗦𝗽𝗮𝗿𝗸𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁𝗹𝘆 python import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("deltaExample") \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") spark = configure_spark_with_delta_pip(builder).getOrCreate() Notice two critical differences: • You must explicitly register DeltaSparkSessionExtension • You must replace the default catalog with DeltaCatalog • The builder alone isn't enough, we wrap it with configure_spark_with_delta_pip(), which handles JAR resolution automatically 𝗦𝘁𝗲𝗽 3 𝗧𝗵𝗲𝗻 𝘆𝗼𝘂𝗿 𝗗𝗲𝗹𝘁𝗮 𝗰𝗼𝗱𝗲 𝘄𝗼𝗿𝗸𝘀 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝘄𝗮𝘆 python data = spark.range(0, 5) data.write.format("delta").save("/tmp/delta-table") --- 💡 𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆 Databricks abstracts away the complexity of Delta Lake setup which is great for productivity, but can create a false sense of simplicity. When you move to local development, CI/CD pipelines, or on-prem Spark clusters, understanding what Databricks was silently doing for you becomes critical. The feature set is identical. The setup is not. --- 𝗛𝗮𝘃𝗲 𝘆𝗼𝘂 𝗯𝗲𝗲𝗻 𝗰𝗮𝘂𝗴𝗵𝘁 𝗼𝗳𝗳 𝗴𝘂𝗮𝗿𝗱 𝘀𝘄𝗶𝘁𝗰𝗵𝗶𝗻𝗴 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗮𝗻𝗱 𝗮 𝗹𝗼𝗰𝗮𝗹 𝗦𝗽𝗮𝗿𝗸 𝗲𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁? 𝗗𝗿𝗼𝗽 𝘆𝗼𝘂𝗿 𝗲𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲 𝗶𝗻 𝘁𝗵𝗲 𝗰𝗼𝗺𝗺𝗲𝗻𝘁𝘀 👇 --- #DataEngineering #DeltaLake #ApacheSpark #Databricks #PySpark #BigData #DataPlatform #OpenSource

8 Comments
Like Comment
To view or add a comment, sign in
Srinivasan E
2w Edited
Report this post
Real-World Concept — Schema Evolution Columns get added. Columns go missing. If you don't handle this → you lose data silently. ❌ In production, data arrives from multiple parquet files — and schemas change over time. 𝐑𝐞𝐚𝐥-𝐖𝐨𝐫𝐥𝐝 𝐂𝐨𝐧𝐜𝐞𝐩𝐭: 𝐒𝐜𝐡𝐞𝐦𝐚 𝐄𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧 (𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐜 𝐇𝐚𝐧𝐝𝐥𝐢𝐧𝐠) 👉 Day 1 parquet file: id, first_name, last_name, salary, department_id 👉 Day 30 parquet file: id, first_name, last_name, salary, department_id, bonus (new added) 👉 Day 60 parquet file: id, first_name, last_name, department_id (salary missing!) Run this dataset first 👇 ---------------------------------------------------------------------- Dataset: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Schema_Evolution").getOrCreate() # ── Day 1: Old Schema ─────────────────── data1 = [ (101, "Alice", "Smith", 90000, "Engineering"), (102, "Bob", "Jones", 75000, "Marketing"), (103, "Charlie", "Brown", 85000, "Engineering"), ] df1 = spark.createDataFrame( data1, ["id", "first_name", "last_name", "salary", "department"] ) df1.write.mode("overwrite").parquet("/tmp/emp/day1") # ── Day 30: New Schema (bonus column added) ──────── data2 = [ (104, "Diana", "Prince", 95000, "Marketing", 8000), (105, "Ethan", "Hunt", 88000, "Engineering", 5000), (106, "Fiona", "Green", 70000, "Marketing", 3000), ] df2 = spark.createDataFrame( data2, ["id", "first_name", "last_name", "salary", "department", "bonus"] ) df2.write.mode("overwrite").parquet("/tmp/emp/day30") ---------------------------------------------------------------------- 👉 Missing values are filled with NULL ✔️ 👉 No manual column handling needed ✔️ 🧠 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐢𝐬 𝐂𝐫𝐢𝐭𝐢𝐜𝐚𝐥 • Handles schema evolution automatically • Prevents data loss • No need to write custom select/alias logic • Used heavily in data lakes (S3, ADLS, HDFS) ---------------------------------------------------------------------- 🔥 𝐈’𝗺 𝐩𝐨𝐬𝐭𝗶𝗻𝗴 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 𝐢𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧 𝐞𝐯𝐞𝐫𝐲 𝐝𝐚𝐲! Follow Srinivasan E for more 🚀 Try solving & comment your approach 👇 #PySpark #DataEngineering #BigData #Spark #ETL #DataEngineer #SQL #AnalyticsEngineering #LearnInPublic #TechInterview #CareerGrowth #SchemaEvolution #DataPipeline #LinkedInLearning
Like Comment
To view or add a comment, sign in
Srinivasan E
3w
Report this post
Most people are comfortable writing SQL. But in interviews, the real challenge is: 👉 “Can you implement the same logic in PySpark?” Let’s try one 👇 Month-over-Month Revenue Growth (PySpark) You are given a transactions dataset. Your task is to calculate the month-over-month percentage change in revenue. Output should have: year_month (YYYY-MM) total_revenue percentage_change (rounded to 2 decimal places) Formula: ((current_month_revenue - previous_month_revenue) / previous_month_revenue) * 100 Note: Percentage change should start from the second month. --------------------------------------------------------------- Dataset: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MoM Revenue").getOrCreate() data = [ (1, '2019-01-01 00:00:00', 172692, 43), (2, '2019-01-05 00:00:00', 177194, 36), (3, '2019-01-09 00:00:00', 109513, 30), (4, '2019-01-13 00:00:00', 164911, 30), (5, '2019-01-17 00:00:00', 198872, 39), (6, '2019-01-21 00:00:00', 184853, 31), (7, '2019-01-25 00:00:00', 186817, 26), (8, '2019-01-29 00:00:00', 137784, 22), (9, '2019-02-02 00:00:00', 140032, 25), (10,'2019-02-06 00:00:00', 116948, 43), (11,'2019-02-10 00:00:00', 162515, 25), (12,'2019-02-14 00:00:00', 114256, 12), (13,'2019-02-18 00:00:00', 197465, 48), (14,'2019-02-22 00:00:00', 120741, 20), (15,'2019-02-26 00:00:00', 100074, 49), (16,'2019-03-02 00:00:00', 157548, 19), (17,'2019-03-06 00:00:00', 105506, 16), (18,'2019-03-10 00:00:00', 189351, 46), (19,'2019-03-14 00:00:00', 191231, 29), (20,'2019-03-18 00:00:00', 120575, 44), (21,'2019-03-22 00:00:00', 151688, 47), (22,'2019-03-26 00:00:00', 102327, 18), (23,'2019-03-30 00:00:00', 156147, 25) ] columns = ["id", "created_at", "value", "purchase_id"] df = spark.createDataFrame(data, columns) --------------------------------------------------------------- Try solving this in: 1. SQL 2. PySpark If you are preparing for Data Engineering interviews, this is a very common pattern: * date handling * aggregation * window functions Bonus: How would you optimize this for large data? Drop your approach in comments 🙂 Follow Srinivasan E for more practical Data Engineering and PySpark interview content #PySpark #DataEngineering #GoogleInterview #MetaInterview #ApacheSpark #BigData #InterviewPrep #WindowFunctions #StreamingData
Like Comment
To view or add a comment, sign in

615 followers

21 Posts

View Profile Connect

SQL & PySpark String Functions for Data Cleaning

More Relevant Posts

Explore content categories