Jaswanth Thathireddy’s Post

🐍 Day 6/30 — Python for Data Engineers Error Handling. What separates scripts from production pipelines. I've seen pipelines crash in production because of one missing key in a JSON payload. No error handling. No logging. Just a silent failure at 2 AM. Here's what I learned the hard way 👇 The full try/except structure most people don't use: try: run_query(conn) except ConnectionError as e: log.error(f"DB failed: {e}") else: commit(conn) # ← only runs if NO error finally: conn.close() # ← ALWAYS runs Most engineers only write try/except. The else and finally blocks are gold. And the pattern that saved me the most — dead-letter queues: for row in records: try: validate(row) passed.append(row) except ValidationError: failed.append(row) # quarantine bad rows Don't crash the whole pipeline over one bad row. Isolate it. Today's cheat sheet covers: → Full try/except/else/finally anatomy → 12 common built-in exceptions → Multiple except, raise, re-raise, chaining → Custom exceptions (production standard) → Context managers with with → Dead-letter queue · retry backoff · traceback logging 📌 Save the cheat sheet above. Day 7 tomorrow: File I/O & CSV / JSON 📂 What's your go-to error handling pattern in pipelines? 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #DataAnalyst #Data #Software

To view or add a comment, sign in

More Relevant Posts

Jaswanth Thathireddy
3w
Report this post
🐍 Day 3/30 — Python for Data Engineers Dictionaries & Sets. The tools that make pipelines fast. Every Data Engineer works with dicts daily — whether parsing API responses, defining schemas, or managing configs. But here's the one that most beginners miss 👇 Sets are basically SQL operations: A & B → INNER JOIN (intersection) A | B → FULL OUTER JOIN (union) A - B → LEFT ANTI JOIN (difference) A ^ B → schema drift detector 🚨 That last one is genuinely useful in production: new_cols = incoming_cols - expected_cols # → {"total"} ← column you didn't expect. Alert! And remember: dict/set lookup is O(1) — hash table under the hood. List lookup is O(n) — it scans every element. On 10M rows, that difference is seconds vs milliseconds. 📌 Full cheat sheet in the image — methods, comprehensions, real DE patterns. Day 4 tomorrow: Functions & Lambda 🔧 What's your most-used dict method? .get() or .items()? Drop it below 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #SQL
Like Comment
To view or add a comment, sign in
Dinesh Kumar
4w
Report this post
🚀 Day 6/20 — Python for Data Engineering Reading & Writing CSV / JSON (Deep Dive) Now that we know basic file handling, let’s go one step deeper into real data formats. 👉 In data engineering, most data comes as: CSV (structured) JSON (semi-structured) 🔹 Working with CSV (Structured Data) import pandas as pd df = pd.read_csv("data.csv") print(df.head()) 👉 Used when data is in rows & columns (tables) 🔹 Working with JSON (Semi-Structured) import json with open("data.json") as f: data = json.load(f) print(data) 👉 Common in APIs and nested data 🔹 Writing Data Back df.to_csv("output.csv", index=False) 👉 Save cleaned or transformed data 🔹 Real-World Flow 👉 CSV / JSON → Python → Process → Output file 🔹 Why This Matters Data ingestion pipelines API data handling Data transformation workflows Exporting processed data 💡 Quick Summary CSV = structured data JSON = flexible data Python helps you handle both easily. 💡 Something to remember Data engineers don’t just read data… They shape it for the next system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Abdulelah Muhmin
2w
Report this post
Python Chaos to dbt Clarity: Why I Upgraded My Data Pipeline Architecture We’ve all been there. A "simple" Python script that starts with extracting data, and ends up being a 1,000-line monster handling cleaning, joining, testing, and documentation. It works... until it doesn't. In my latest project, "SME-Modern-Sales-DWH," I decided to move away from the Monolithic ETL approach (Level 1) to a Modern ELT framework (Level 2). The Shift: Decoupling the Logic 🏗️ Instead of forcing Python to do everything, I redistributed the workload to where it belongs: 🔹 Python (The Mover): Now only handles Extract & Load. It moves raw data from CSVs to the Bronze layer. Simple, fast, and easy to maintain. 🔹 dbt-core (The Brain): Once the data is in SQL Server, dbt takes over for the Transformations. Why this is a game-changer for SMEs: 1. Automated Testing: I implemented 47 data quality tests. If the data isn't right, the build fails. No more "guessing" if the report is accurate. 2. Modular Modeling: Using Staging, Intermediate, and Marts layers. It’s built like LEGO—modular and scalable. 3. Documentation on Autopilot: dbt docs now provide a full lineage of the data, making the system transparent for everyone. 4. Surrogate Keys & Hashing: Used MD5 hashing to merge CRM and ERP data seamlessly. The Result? A reliable "Single Source of Truth" that turns fragmented data into actionable sales insights. No more "nuclear explosions" in the codebase! 💥✅ Check out the full architecture and code on GitHub: https://lnkd.in/d-BB9b9R #DataEngineering #dbt #Python #ModernDataStack #DataAnalytics #SQL #ELT #SME
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 17/20 — Python for Data Engineering Building a Simple Data Pipeline So far, we’ve learned: reading data transforming data working with APIs Now it’s time to connect everything together. 👉 That’s called a data pipeline 🔹 What is a Data Pipeline? A pipeline is a sequence of steps: 👉 Ingest → Process → Store 🔹 Simple Example import pandas as pd import requests # Step 1: Fetch data response = requests.get("https://lnkd.in/gTtgvXhZ") data = response.json() # Step 2: Convert to DataFrame df = pd.DataFrame(data) # Step 3: Transform df["salary"] = df["salary"] * 1.1 # Step 4: Store df.to_csv("output.csv", index=False) 🔹 Pipeline Flow 👉 API → Python → Transform → Output 🔹 Why This Matters Automates data flow Reduces manual work Scalable processing Foundation of data engineering 🔹 Real-World Use ETL pipelines Data ingestion systems Batch processing jobs 💡 Quick Summary A pipeline connects all steps into one flow. 💡 Something to remember Individual steps are code… Connected steps become a system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Fabio Luiz Meireles Caffarello
6d
Report this post
My load test pipeline spent 4 minutes generating 1M rows of test data. The system under test ran in 38 seconds. I wasn't benchmarking our system. I was benchmarking Faker. So I replaced the Python generator with a Rust binary. Now it does 1M rows in under 2 seconds. ~1.47M rows/sec on the hot path. ~400K rows/sec streaming to Kafka, network I/O included. But the speedup wasn't really about Rust. It was three decisions made before writing any generation code. The hot path isn't what people think. It isn't random generation, it's field lookup, memory allocation, and string handling. So instead of HashMap<String, Value>, I used Vec<Option<DataValue>> with a precomputed field index. No hash lookups. No string comparisons per row. No per-field allocations. At 1M rows × N fields, that difference is everything. The generator doesn't know where data goes. Kafka, Parquet, JSON, S3 — none of those exist in the core engine. Everything sits behind port traits: StreamingSinkPort, DataExporterPort, ObjectStoragePort. Adding Postgres or Snowflake later means implementing a trait. Zero changes to generation. Configuration is data, not code. Schemas are YAML, versioned in Git, reviewed like code, executable in CI. The system is driven by config, not by branching logic. Data engineers who don't write Rust still own the pipelines. Clean Architecture, enforced by the compiler. The core crate has zero infrastructure dependencies. If it's not in Cargo.toml, it's impossible to import. Not convention. Physics. The pattern I keep coming back to: - You don't optimize your way out of the wrong data model. - You don't refactor your way out of tight coupling. - You don't scale your way out of architectural leakage. Most systems don't degrade because they're slow. They degrade because they become impossible to change safely. Question for the senior folks: what's a design decision you've seen lock a system in place years later? Repository: https://lnkd.in/dzSAYBeF Medium Article: https://lnkd.in/dGqvPYtz #DataEngineering #Rust #SoftwareArchitecture #Performance #SyntheticData

I got tired of waiting for Python to generate test data. fabio-caffarello.medium.com
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 15/20 — Python for Data Engineering Handling Missing Data (Pandas) In real-world data… 👉 Missing values are everywhere 👉 Ignoring them = wrong results So handling missing data is not optional 🔹 What is Missing Data? Data that is: empty null NaN 🔹 Detect Missing Values df.isnull() 👉 Shows missing values df.isnull().sum() 👉 Count missing values per column 🔹 Drop Missing Values df.dropna() 👉 Removes rows with missing data 🔹 Fill Missing Values df.fillna(0) 👉 Replace with default value df["salary"].fillna(df["salary"].mean(), inplace=True) 👉 Replace with meaningful value 🔹 Why This Matters Avoid incorrect analysis Improve data quality Make pipelines reliable 🔹 Real-World Flow 👉 Raw Data → Missing Values → Clean → Analysis 💡 Quick Summary Missing data must be handled before using data. 💡 Something to remember Bad data doesn’t break loudly… It silently gives wrong results. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Jaswanth Thathireddy
2w
Report this post
🐍 Day 4/30 — Python for Data Engineers Functions. The building blocks of every pipeline. Every Airflow DAG task, every dbt macro, every ETL step — they're all just functions under the hood. Here's what separates beginner Python from production-grade DE code 👇 3 things I use in every pipeline: 1. Type hints — makes your code self-documenting def extract(table: str) -> list: 2. **kwargs — flexible config without breaking the signature def load_data(table, schema="public", **opts): load_data("orders", limit=1000, dry_run=True) 3. Lambda with sorted() — one of the most used patterns sorted(jobs, key=lambda j: j["priority"]) And if you use Airflow, you already use decorators daily: @task def run_dbt_model(model: str): ... That @task is just a decorator — a function that wraps your function. Today's cheat sheet covers: → Function anatomy with type hints → All 4 parameter types (positional, default, *args, **kwargs) → Lambda syntax + real examples → map(), filter(), reduce() → LEGB scope rule → Decorators → Real ETL pipeline patterns 📌 Full cheat sheet above — save it. Day 5 tomorrow: Conditionals & Loops 🔁 What's your go-to function pattern in pipelines? Drop it below 👇 #Python #DataEngineering #30DaysOfPython #Airflow #LearnPython #DataEngineer
Like Comment
To view or add a comment, sign in
Jaswanth Thathireddy
1w
Report this post
🐍 Day 7/30 — Python for Data Engineers File I/O, CSV & JSON. The bread and butter of every ingestion pipeline. Before you touch pandas or Spark — you need to know how Python handles raw files. Because in real pipelines, you'll deal with: → CSVs dropped by vendors in S3 → JSON payloads from REST APIs → JSONL files in your data lake raw layer → Config files that drive your pipeline logic The #1 mistake I see beginners make: # ❌ Wrong — file never closes if an error occurs f = open("data.csv", "r") data = f.read() # ✅ Right — auto-closes even on exceptions with open("data.csv", "r") as f: data = f.read() And the thing that confused me for weeks: json.load(f) # reads from a FILE object json.loads(s) # parses a STRING json.dump(d, f) # writes to a FILE json.dumps(d) # returns a STRING The "s" = string. Once you know that, it sticks forever. For data lake files, JSONL is king: # One JSON object per line — memory efficient with open("events.jsonl") as f: events = [json.loads(line) for line in f if line.strip()] Today's cheat sheet covers: → open() with context managers → All 6 file modes explained → Key file methods (with memory warnings) → csv.DictReader / DictWriter → Common CSV gotchas (encoding, newline, delimiter) → json.load / loads / dump / dumps → JSONL pattern + CSV → JSON transform 📌 Every section has a plain-English explanation — save it. Day 8 tomorrow: OS & Pathlib — Navigate the Filesystem Like a Pro 📁 Which format do you deal with most in your pipelines — CSV or JSON? 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #ETL #DataAnalyst #DataAnalysis #Data #PythonDev
Like Comment
To view or add a comment, sign in
Devin Meunier
2w Edited
Report this post
Most data pipelines overwrite records. When something changes, the old version is gone. I wanted to build something that preserves history so you can actually ask: “what did this repo look like 3 months ago?” and get a reliable answer. So I built a GitHub trend tracker using Python, Postgres, and dbt. - Pulls repositories across multiple queries (data engineering, LLMs, Airflow, dbt, machine learning) How it works: Python handles ingestion (rate limiting, deduplication, controlled extraction across queries) Data lands in a Postgres staging layer first (ELT pattern, raw data is loaded before transformations) A fingerprint of key attributes detects meaningful changes without overwriting records A Slowly Changing Dimension Type 2 pattern versions every change (old record is closed, new one is opened) Set-based SQL handles the merge logic efficiently instead of row-by-row updates dbt is being layered in to structure transformations, manage dependencies, and move toward snapshot-based modeling Still evolving, but the core pipeline is working: raw API data flowing into a clean, versioned dataset. Building in iterations…more updates as it develops.
Like Comment
To view or add a comment, sign in
ZAKARIA ZEBBARA
3w Edited
Report this post
Data Engineering starts with robust Data Ingestion. 🕸️ If you are a data analyst relying on pre-packaged Kaggle datasets, you are missing out on the most valuable data available: the live web. However, writing web scrapers from scratch for every project is incredibly frustrating—between handling messy HTML, managing rate limits, and formatting the output, it's a massive time sink. I hate manual data entry, so I built a production-ready Python scraping script to automate the collection process. Instead of fighting with boilerplate code, this script handles the heavy lifting and directly exports clean, structured data into CSV or JSON formats, ready to be ingested into a database or analyzed in Pandas. #Python #DataEngineering #WebScraping #DataAnalytics #Automation
1 Comment
Like Comment
To view or add a comment, sign in

2,975 followers

229 Posts

View Profile Follow

Jaswanth Thathireddy’s Post

More Relevant Posts

Explore content categories