Jaswanth Thathireddy’s Post

🐍 Day 5/30 — Python for Data Engineers Conditionals & Loops. How pipelines make decisions. Every pipeline does two things constantly: 1. Makes decisions → skip bad rows, branch on job status, alert on failure 2. Iterates → loop over files, tables, API pages, batches Today's cheat sheet covers both — and a few patterns I use in production every day. The one most engineers miss 👇 for...else — the else block runs only if the loop completed without a break: for stage in pipeline: if stage.failed: break else: notify("All stages passed ✅") And the chunked insert pattern — essential for large loads: for i in range(0, len(rows), 1000): db_insert(rows[i : i + 1000]) Sending 1M rows in one shot will crash your DB. Send them in chunks of 1000. Always. Today's sheet covers: → if / elif / else → Ternary + walrus operator := → match/case (Python 3.10+) → for loops with enumerate, zip, break, continue → while loop + retry with backoff → All 3 comprehension types → 4 real DE pipeline patterns 📌 Save the cheat sheet above. Day 6 tomorrow: Error Handling & Exceptions 🛡️ Which loop pattern do you use most in your pipelines? 👇 #Python #DataEngineering #Python #DataEngineering #DataEngineer #LearnPython #BigData #ETL #Coding #TechCommunity #SoftwareEngineering #BackendDevelopment #CloudComputing #AWS #OpenToWork #JobsInFrance #TechJobsFrance #LearnPython #DataEngineer

To view or add a comment, sign in

More Relevant Posts

Dinesh Kumar
3w
Report this post
🚀 Day 9/20 — Python for Data Engineering Working with Large Files (Memory Optimization) By now, we know how to read, write, and transform data. But in real-world scenarios… 👉 Data is not small 👉 Files can be GBs in size If we try to load everything at once → ❌ crash / slow performance 🔹 The Problem df = pd.read_csv("large_file.csv") 👉 Loads entire file into memory 👉 Not scalable 🔹 Solution: Read in Chunks import pandas as pd for chunk in pd.read_csv("large_file.csv", chunksize=1000): process(chunk) 👉 Processes data piece by piece 👉 Memory efficient 👉 Scalable 🔹 Another Approach: Line-by-Line with open("large_file.txt") as f: for line in f: process(line) 👉 Useful for logs and streaming data 🔹 Why This Matters Prevent memory issues Handle large datasets smoothly Build scalable pipelines 🔹 Where You’ll Use This Log processing Batch pipelines Streaming systems ETL workflows 💡 Quick Summary Don’t load everything at once. Process data in parts. 💡 Something to remember Efficient data handling is not about power… It’s about smart processing. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Taylor Hart
1mo
Report this post
Pick one. You can only use it for the rest of your career: SQL or Python? I'll go first: SQL. Not because it's better. Because every company I've walked into — from startups to enterprise — the first thing anyone asks is "can you write a query?" But here's the thing most people miss: SQL isn't just a query language. It's the language of data architecture. Every table you design, every join you write, every view that powers a dashboard — you're making architectural decisions. You're defining how data lives, moves, and gets consumed. Python opens doors. SQL keeps you in the room. Data architects think in systems. SQL is how you speak that language fluently. #DataEngineering #SQL #Python #DataArchitecture #TechCareer
58 Comments
Like Comment
To view or add a comment, sign in
Abhinav Reddy Mettu
1mo
Report this post
Code is the easy part. Requirements are the hard part. 🧠 I’ve seen $1M data projects fail not because the #Python script broke, but because the "Business Logic" wasn't actually what the business needed. As Data Engineers, our job is 40% building and 60% translating. A "Real-time" requirement usually only needs 15-minute latency. A "Single Source of Truth" usually just needs a better Data Catalog. "AI" usually just needs a well-cleaned SQL table. The best engineers I know are the ones who ask "Why?" three times before they write a single line of #Spark code. #DataEngineering #SoftwareEngineering #TechStrategy #DataIntegration
Like Comment
To view or add a comment, sign in
Jaswanth Thathireddy
1mo
Report this post
🐍 Day 1/30 — Python for Data Engineers Starting from scratch. No fluff. Before you build Airflow DAGs, dbt models, or Spark pipelines — you need to speak Python. Today's foundation: → Variables & Assignment → 8 Core Data Types → Type Conversion → Arithmetic, Comparison & Logical Operators → Strings (the most used type in pipelines) → Truthy/Falsy + None → Naming conventions that actually matter One thing I wish I knew earlier: x is None ✅ — not x == None ❌ 📌 Saved the full cheat sheet below — bookmark it. This is Day 1 of my #30DaysOfPython series. I'm documenting everything I know as a Data Engineer in 30 posts. Follow Jaswanth Thathireddy along if you're learning Python for Data Engineering 👇 #Python #DataEngineering #30DaysOfPython #DataEngineer #LearnPython #SQL #DataAnalyst #Software #Dev #Development #IT #Learning #Students
Like Comment
To view or add a comment, sign in
Omer Khan
2w
Report this post
Ever stuck with unstructured data in Excel sheets or spreadsheets and needed to push that messy data into a structured database? 🤯 Recently, I faced a similar challenge, a large spreadsheet filled with inconsistent, unstructured data that needed to be transformed into multiple clean tables. Doing it manually would’ve been time consuming and error prone. Here comes Python 🐍 Instead of struggling with manual cleanup, I built a small data pipeline using Python to automate the entire process from parsing and structuring the data to inserting it directly into a PostgreSQL Supabase database. What could’ve taken hours was reduced to minutes with better accuracy and scalability. As software engineers, knowing the right tool can turn a messy problem into an elegant solution. #Python #DataEngineering #Automation #PostgreSQL #Supabase #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Dinesh Kumar
4w
Report this post
🚀 Day 5/20 — Python for Data Engineering Error Handling (try / except) When working with real-world data, things don’t always go as expected. 👉 Files may be missing 👉 Data may be corrupted 👉 APIs may fail If your code crashes every time something goes wrong, that’s not data engineering. 🔹 What is Error Handling? Error handling allows your program to: 👉 handle unexpected situations 👉 continue running without crashing 🔹 Basic Syntax try: # code that might fail except: # code to handle error 🔹 Example try: df = pd.read_csv("data.csv") print(df.head()) except: print("File not found") 👉 If the file is missing, your program won’t crash 🔹 Handling Specific Errors (Better Practice) try: value = int("abc") except ValueError: print("Invalid number") 👉 More precise and professional 🔹 Why This Matters in Data Engineering Prevent pipeline failures Handle bad data gracefully Improve reliability Build production-ready systems 💡 Quick Summary Error handling makes your code: safer more stable production-ready 💡 Something to remember Good engineers don’t just write code that works… They write code that doesn’t break. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 17/20 — Python for Data Engineering Building a Simple Data Pipeline So far, we’ve learned: reading data transforming data working with APIs Now it’s time to connect everything together. 👉 That’s called a data pipeline 🔹 What is a Data Pipeline? A pipeline is a sequence of steps: 👉 Ingest → Process → Store 🔹 Simple Example import pandas as pd import requests # Step 1: Fetch data response = requests.get("https://lnkd.in/gTtgvXhZ") data = response.json() # Step 2: Convert to DataFrame df = pd.DataFrame(data) # Step 3: Transform df["salary"] = df["salary"] * 1.1 # Step 4: Store df.to_csv("output.csv", index=False) 🔹 Pipeline Flow 👉 API → Python → Transform → Output 🔹 Why This Matters Automates data flow Reduces manual work Scalable processing Foundation of data engineering 🔹 Real-World Use ETL pipelines Data ingestion systems Batch processing jobs 💡 Quick Summary A pipeline connects all steps into one flow. 💡 Something to remember Individual steps are code… Connected steps become a system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Vimal Yarraguntla
3w
Report this post
🚀 Python Generators – A Must-Know for Data Engineers & Developers Ever worked with large datasets and faced memory issues? 🤯 👉 That’s where Generators come into play! ✅ What are Generators? 👉 Generators are functions that use yield instead of return to produce values one at a time ✔️ Lazy evaluation ✔️ Memory efficient ✔️ Ideal for big data processing 🔍 Example def my_generator(): yield 1 yield 2 yield 3 gen = my_generator() print(next(gen)) # 1 print(next(gen)) # 2 👉 The function pauses at each yield and resumes later 🔄 Generator vs Normal Function 🔹 Normal Function: def normal(): return [1, 2, 3] 🔹 Generator: def gen(): yield 1 yield 2 yield 3 👉 return → all at once 👉 yield → one by one ⚡ Generator Expression (Shortcut) gen = (x*x for x in range(5)) 🚀 Real-Time Use Case (Data Engineering) 👉 Processing large files: def read_file(file): for line in file: yield line ✔️ Reads data line by line ✔️ Avoids memory overflow 🔥 Why Generators? ✔️ Saves memory ✔️ Improves performance ✔️ Perfect for streaming & ETL pipelines 💡 Interview One-Liner 👉 “Generators in Python use yield to produce values lazily, making them memory-efficient for large-scale data processing.” #Python #DataEngineering #Coding #ETL #BigData #InterviewPrep #LearnPython
Like Comment
To view or add a comment, sign in
Prathyusha K.
2w
Report this post
Swipe through the slides first 👉 then read below 👇 🚀 Day 25 of 30 — Learning PySpark from Scratch A pipeline that crashes with no error message is worse than a pipeline that doesn't run at all. 😬 Here's how I write robust PySpark pipelines now 👇 ⚡ The 3-layer defence system Layer 1 → try/except around each stage Layer 2 → Python logging module (not print()) Layer 3 → Data quality assertions between stages 💻 The production pipeline template import logging logging.basicConfig(level=logging.INFO, format="%(asctime)s — %(levelname)s — %(message)s") logger = logging.getLogger(__name__) def run_pipeline(): try: # Stage 1: Read df = spark.read.option("badRecordsPath", "output/bad/").csv("data.csv", header=True) logger.info(f"Read: {df.count()} rows") # Quality check — fail fast on bad data assert df.count() > 100, "Too few rows — data may be missing" # Stage 2: Clean df = df.dropna(subset=["revenue"]) logger.info(f"After clean: {df.count()} rows") except Exception as e: logger.error(f"Pipeline failed at stage: {e}") raise ✅ 3 things I didn't know before today → badRecordsPath saves corrupt rows to a separate folder instead of crashing → print() has no timestamps or log levels — always use Python logging module → Asserting row counts between stages catches silent data loss early 💡 My Day 25 takeaway Anyone can write a pipeline that works on good data. A data engineer writes pipelines that handle bad data gracefully. ❓ Has a pipeline failure ever caused a wrong report to reach stakeholders? Drop it in the comments 👇 Follow me for Day 26 tomorrow → Testing PySpark code with pytest 🔔 #PySpark #DataEngineering #BigData #Python #LearnInPublic #30DaysOfPySpark
Like Comment
To view or add a comment, sign in
Subhash B Gowda
1w
Report this post
Over the past few days, I’ve been diving into PySpark and distributed data processing concepts. Coming from a background in Python, SQL, and data-driven backend systems, it’s been interesting to see how similar data transformations scale when working with large datasets. I’ve been exploring how Spark handles data processing across clusters and how it fits into real-world data pipelines. Currently focusing on: • Working with Spark DataFrames • Performing transformations (filter, groupBy, joins) • Understanding ETL workflows at scale Still early in the learning process, but it’s a valuable step toward building more scalable data solutions. #PySpark #DataEngineering #BigData #Python #LearningJourney
Like Comment
To view or add a comment, sign in

2,975 followers

229 Posts

View Profile Connect

Jaswanth Thathireddy’s Post

More Relevant Posts

Explore content categories