Jaswanth Thathireddy’s Post

🐍 Day 3/30 — Python for Data Engineers Dictionaries & Sets. The tools that make pipelines fast. Every Data Engineer works with dicts daily — whether parsing API responses, defining schemas, or managing configs. But here's the one that most beginners miss 👇 Sets are basically SQL operations: A & B → INNER JOIN (intersection) A | B → FULL OUTER JOIN (union) A - B → LEFT ANTI JOIN (difference) A ^ B → schema drift detector 🚨 That last one is genuinely useful in production: new_cols = incoming_cols - expected_cols # → {"total"} ← column you didn't expect. Alert! And remember: dict/set lookup is O(1) — hash table under the hood. List lookup is O(n) — it scans every element. On 10M rows, that difference is seconds vs milliseconds. 📌 Full cheat sheet in the image — methods, comprehensions, real DE patterns. Day 4 tomorrow: Functions & Lambda 🔧 What's your most-used dict method? .get() or .items()? Drop it below 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #SQL

To view or add a comment, sign in

More Relevant Posts

Jaswanth Thathireddy
1w
Report this post
🐍 Day 6/30 — Python for Data Engineers Error Handling. What separates scripts from production pipelines. I've seen pipelines crash in production because of one missing key in a JSON payload. No error handling. No logging. Just a silent failure at 2 AM. Here's what I learned the hard way 👇 The full try/except structure most people don't use: try: run_query(conn) except ConnectionError as e: log.error(f"DB failed: {e}") else: commit(conn) # ← only runs if NO error finally: conn.close() # ← ALWAYS runs Most engineers only write try/except. The else and finally blocks are gold. And the pattern that saved me the most — dead-letter queues: for row in records: try: validate(row) passed.append(row) except ValidationError: failed.append(row) # quarantine bad rows Don't crash the whole pipeline over one bad row. Isolate it. Today's cheat sheet covers: → Full try/except/else/finally anatomy → 12 common built-in exceptions → Multiple except, raise, re-raise, chaining → Custom exceptions (production standard) → Context managers with with → Dead-letter queue · retry backoff · traceback logging 📌 Save the cheat sheet above. Day 7 tomorrow: File I/O & CSV / JSON 📂 What's your go-to error handling pattern in pipelines? 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #DataAnalyst #Data #Software
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 15/20 — Python for Data Engineering Handling Missing Data (Pandas) In real-world data… 👉 Missing values are everywhere 👉 Ignoring them = wrong results So handling missing data is not optional 🔹 What is Missing Data? Data that is: empty null NaN 🔹 Detect Missing Values df.isnull() 👉 Shows missing values df.isnull().sum() 👉 Count missing values per column 🔹 Drop Missing Values df.dropna() 👉 Removes rows with missing data 🔹 Fill Missing Values df.fillna(0) 👉 Replace with default value df["salary"].fillna(df["salary"].mean(), inplace=True) 👉 Replace with meaningful value 🔹 Why This Matters Avoid incorrect analysis Improve data quality Make pipelines reliable 🔹 Real-World Flow 👉 Raw Data → Missing Values → Clean → Analysis 💡 Quick Summary Missing data must be handled before using data. 💡 Something to remember Bad data doesn’t break loudly… It silently gives wrong results. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Ygor Guerra
1w
Report this post
There are two ways to traverse hierarchies in SQL. Only one scales 👇 Recursive CTEs and self-joins solve the same problem: navigating hierarchical data. But they behave very differently as the data grows. Recursive CTEs let you define a single rule and let SQL iterate through the hierarchy until it reaches the end. No need to know the depth upfront. You also don’t need to keep adjusting the query every time the hierarchy changes, which makes it much more scalable in real-world systems. With recursive CTEs, the query adapts to the data. With self-joins, the query is fixed to the structure you assumed. For Python folks: think of recursive CTEs like a WHILE loop over a tree structure, with a termination condition to avoid infinite recursion. Got other SQL topics you want explained like this? Comment them 👇 📌Found it useful? Save it for later. #SQLTips #DataAnalytics #DataScience #SQL #Analytics #BusinessIntelligence #DataEngineer #LearnSQL
25 Comments
Like Comment
To view or add a comment, sign in
Dinesh Kumar
3w
Report this post
🚀 Day 8/20 — Python for Data Engineering Data Transformation Basics After reading data, the next step is not storing it… 👉 It’s transforming it into usable form Raw data is often: messy inconsistent not analysis-ready That’s where data transformation comes in. 🔹 What is Data Transformation? Changing data into a cleaner, structured, and useful format. 🔹 Common Transformations 📌 Selecting Columns df = df[["name", "salary"]] 👉 Keep only required data 📌 Filtering Rows df = df[df["salary"] > 50000] 👉 Focus on relevant records 📌 Creating New Columns df["bonus"] = df["salary"] * 0.1 👉 Add derived data 📌 Renaming Columns df.rename(columns={"salary": "income"}, inplace=True) 👉 Improve readability 🔹 Why This Matters Converts raw → usable data Prepares data for analysis Makes pipelines meaningful 🔹 Real-World Flow 👉 Raw Data → Clean → Transform → Store 💡 Quick Summary Transformation is where data becomes valuable. 💡 Something to remember Raw data is useless… Until you transform it into something meaningful. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Abhishek Koppal
3d
Report this post
A question I had when starting out: should I use Pandas or SQL for data transformation? Here's how I now think about it: Use SQL when: → Data lives in a database or warehouse → The dataset is large (millions of rows) → You need joins across multiple tables → You want the transformation to run server-side Use Pandas when: → Data is in files (CSV, Excel, JSON) → You need complex Python logic → You're doing exploratory analysis → The dataset fits comfortably in memory In data engineering, you'll use both. SQL for the heavy lifting, Pandas for the finishing touches. What's your go-to for data transformation? #Python #Pandas #SQL #DataEngineering
Like Comment
To view or add a comment, sign in
Prathyusha K.
2w
Report this post
Swipe through the slides first 👉 then read below 👇 🚀 Day 25 of 30 — Learning PySpark from Scratch A pipeline that crashes with no error message is worse than a pipeline that doesn't run at all. 😬 Here's how I write robust PySpark pipelines now 👇 ⚡ The 3-layer defence system Layer 1 → try/except around each stage Layer 2 → Python logging module (not print()) Layer 3 → Data quality assertions between stages 💻 The production pipeline template import logging logging.basicConfig(level=logging.INFO, format="%(asctime)s — %(levelname)s — %(message)s") logger = logging.getLogger(__name__) def run_pipeline(): try: # Stage 1: Read df = spark.read.option("badRecordsPath", "output/bad/").csv("data.csv", header=True) logger.info(f"Read: {df.count()} rows") # Quality check — fail fast on bad data assert df.count() > 100, "Too few rows — data may be missing" # Stage 2: Clean df = df.dropna(subset=["revenue"]) logger.info(f"After clean: {df.count()} rows") except Exception as e: logger.error(f"Pipeline failed at stage: {e}") raise ✅ 3 things I didn't know before today → badRecordsPath saves corrupt rows to a separate folder instead of crashing → print() has no timestamps or log levels — always use Python logging module → Asserting row counts between stages catches silent data loss early 💡 My Day 25 takeaway Anyone can write a pipeline that works on good data. A data engineer writes pipelines that handle bad data gracefully. ❓ Has a pipeline failure ever caused a wrong report to reach stakeholders? Drop it in the comments 👇 Follow me for Day 26 tomorrow → Testing PySpark code with pytest 🔔 #PySpark #DataEngineering #BigData #Python #LearnInPublic #30DaysOfPySpark
Like Comment
To view or add a comment, sign in
Vishwanath T L
2w
Report this post
🚀 Stop scrolling and start shipping: PySpark DataFrame operations made simple. PySpark DataFrame Cheat Sheet for Data Engineers: 1. Handling nulls efficiently: df.fillna({'col_name': 0}).dropna(subset=['id']) 2. Conditional logic with when/otherwise: df.withColumn('status', F.when(F.col('val') > 100, 'high').otherwise('low')) 3. Aggregating with multiple metrics: df.groupBy('category').agg(F.sum('sales'), F.avg('price')) 4. Window functions for row numbers: win = Window.partitionBy('dept').orderBy(F.desc('salary')) df.withColumn('rank', F.row_number().over(win)) 5. String manipulation one-liners: df.withColumn('clean_name', F.trim(F.upper(F.col('name')))) 6. Renaming columns in bulk: df.select([F.col(c).alias(c.lower()) for c in df.columns]) I have used these snippets in our production pipelines to reduce boilerplate and keep our transformations readable. They save me hours of documentation digging every single week. Save this for your next project! What is the one PySpark function you find yourself typing out from memory every single day? #PySpark #DataEngineering #BigData #Python #DataPipelines
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 17/20 — Python for Data Engineering Building a Simple Data Pipeline So far, we’ve learned: reading data transforming data working with APIs Now it’s time to connect everything together. 👉 That’s called a data pipeline 🔹 What is a Data Pipeline? A pipeline is a sequence of steps: 👉 Ingest → Process → Store 🔹 Simple Example import pandas as pd import requests # Step 1: Fetch data response = requests.get("https://lnkd.in/gTtgvXhZ") data = response.json() # Step 2: Convert to DataFrame df = pd.DataFrame(data) # Step 3: Transform df["salary"] = df["salary"] * 1.1 # Step 4: Store df.to_csv("output.csv", index=False) 🔹 Pipeline Flow 👉 API → Python → Transform → Output 🔹 Why This Matters Automates data flow Reduces manual work Scalable processing Foundation of data engineering 🔹 Real-World Use ETL pipelines Data ingestion systems Batch processing jobs 💡 Quick Summary A pipeline connects all steps into one flow. 💡 Something to remember Individual steps are code… Connected steps become a system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Mantu Kumar Deka
6d
Report this post
SQL vs PySpark vs Pandas cheat sheet If you’re working in Data Engineering or switching between tools on the fly during projects/interviews, this can save you a lot of time. 📌 What’s included: 13 structured sections 70+ commonly used concepts SELECT, JOINs, CTEs, Window Functions Aggregations, Date & String operations, Pivot Read/Write patterns + data quality checks Everything is shown side-by-side across SQL, PySpark, and Pandas, so you don’t have to keep searching for syntax differences every time. 💡 The idea is simple — faster recall, fewer mistakes, and more confidence in interviews and real projects. If you want the PDF, just drop a comment — I’ll share it for free. Feel free to repost if it helps someone in your network 👍 #DataEngineering #SQL #PySpark #Pandas #Python #BigData #DataEngineer #InterviewPrep #CheatSheet
Like Comment
To view or add a comment, sign in
Ninad Patil
2w
Report this post
𝗦𝗽𝗮𝗿𝗸 𝗱𝗼𝗲𝘀𝗻’𝘁 𝗿𝘂𝗻 𝘆𝗼𝘂𝗿 𝗰𝗼𝗱𝗲. 𝗜𝘁 𝗯𝘂𝗶𝗹𝗱𝘀 𝗮 𝗽𝗹𝗮𝗻. A lot of Spark confusion comes from thinking it executes “line by line” like a normal program. In reality, Spark mostly does this: 𝗬𝗼𝘂𝗿 𝗰𝗼𝗱𝗲 -> 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗽𝗹𝗮𝗻 So when you write PySpark or Spark SQL, Spark isn’t “running Python” or “running SQL”. It’s building a plan for a distributed engine to execute. Here’s the simplified mental model I use: 𝟭) 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀 𝗯𝘂𝗶𝗹𝗱 𝘁𝗵𝗲 𝗽𝗹𝗮𝗻 (𝗹𝗮𝘇𝘆) select, filter, join, groupBy... These don’t immediately run a job. They describe what should happen. 𝟮) 𝗔𝗰𝘁𝗶𝗼𝗻𝘀 𝘁𝗿𝗶𝗴𝗴𝗲𝗿 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 count, show, collect, write… This is when Spark says: “ok, now I need to execute the plan”. 𝟯) 𝗧𝗵𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗿 𝗿𝗲𝘄𝗿𝗶𝘁𝗲𝘀 𝘆𝗼𝘂𝗿 𝘄𝗼𝗿𝗸 Before running, Spark tries to make it cheaper: • push filters earlier • prune unused columns • reorder operations • pick join strategies 𝟰) 𝗧𝗵𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 𝗶𝘀 𝘄𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗿𝘂𝗻𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗰𝗹𝘂𝘀𝘁𝗲𝗿 This is where you’ll see the real cost drivers: • join strategy (broadcast vs shuffle) • number of stages/tasks • shuffles, scans, exchanges • partitioning decisions That’s why two bits of Spark code that look similar can behave completely differently. 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆: If you can read the plan, you can explain most performance issues without guessing. Share your favourite Spark “aha” moment in the comments. #Spark #PySpark #SparkSQL #DataEngineering #BigData #Databricks #PerformanceTuning #SQL
Like Comment
To view or add a comment, sign in

2,975 followers

228 Posts

View Profile Connect

Jaswanth Thathireddy’s Post

More Relevant Posts

Explore content categories