Dinesh Kumar’s Post

🚀 Day 6/20 — Python for Data Engineering Reading & Writing CSV / JSON (Deep Dive) Now that we know basic file handling, let’s go one step deeper into real data formats. 👉 In data engineering, most data comes as: CSV (structured) JSON (semi-structured) 🔹 Working with CSV (Structured Data) import pandas as pd df = pd.read_csv("data.csv") print(df.head()) 👉 Used when data is in rows & columns (tables) 🔹 Working with JSON (Semi-Structured) import json with open("data.json") as f: data = json.load(f) print(data) 👉 Common in APIs and nested data 🔹 Writing Data Back df.to_csv("output.csv", index=False) 👉 Save cleaned or transformed data 🔹 Real-World Flow 👉 CSV / JSON → Python → Process → Output file 🔹 Why This Matters Data ingestion pipelines API data handling Data transformation workflows Exporting processed data 💡 Quick Summary CSV = structured data JSON = flexible data Python helps you handle both easily. 💡 Something to remember Data engineers don’t just read data… They shape it for the next system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks

To view or add a comment, sign in

More Relevant Posts

Dinesh Kumar
3w
Report this post
🚀 Day 7/20 — Python for Data Engineering Writing / Exporting Data Reading data is only half the job. 👉 In data engineering, we often: clean data transform it then store it for further use That’s where writing/exporting data becomes important. 🔹 Why Exporting Data Matters After processing, data needs to: be stored be shared be used by another system 👉 Output is what makes your pipeline useful. 🔹 Writing to CSV (Structured Data) import pandas as pd df.to_csv("output.csv", index=False) 👉 Saves data in tabular format 👉 Common for reporting and analysis 🔹 Writing to JSON (Flexible Data) import json with open("output.json", "w") as f: json.dump(data, f) 👉 Used for APIs and nested data 👉 Flexible and widely supported 🔹 Real-World Flow 👉 Raw Data → Processing → Clean Data → Export 🔹 Where You’ll Use This Data pipelines Reporting systems Data sharing between services Machine learning inputs 💡 Quick Summary CSV → structured output JSON → flexible output Python makes exporting simple and efficient. 💡 Something to remember Writing data is not the end… It’s what makes your pipeline useful. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Jaswanth Thathireddy
2w
Report this post
🐍 Day 6/30 — Python for Data Engineers Error Handling. What separates scripts from production pipelines. I've seen pipelines crash in production because of one missing key in a JSON payload. No error handling. No logging. Just a silent failure at 2 AM. Here's what I learned the hard way 👇 The full try/except structure most people don't use: try: run_query(conn) except ConnectionError as e: log.error(f"DB failed: {e}") else: commit(conn) # ← only runs if NO error finally: conn.close() # ← ALWAYS runs Most engineers only write try/except. The else and finally blocks are gold. And the pattern that saved me the most — dead-letter queues: for row in records: try: validate(row) passed.append(row) except ValidationError: failed.append(row) # quarantine bad rows Don't crash the whole pipeline over one bad row. Isolate it. Today's cheat sheet covers: → Full try/except/else/finally anatomy → 12 common built-in exceptions → Multiple except, raise, re-raise, chaining → Custom exceptions (production standard) → Context managers with with → Dead-letter queue · retry backoff · traceback logging 📌 Save the cheat sheet above. Day 7 tomorrow: File I/O & CSV / JSON 📂 What's your go-to error handling pattern in pipelines? 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #DataAnalyst #Data #Software
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 17/20 — Python for Data Engineering Building a Simple Data Pipeline So far, we’ve learned: reading data transforming data working with APIs Now it’s time to connect everything together. 👉 That’s called a data pipeline 🔹 What is a Data Pipeline? A pipeline is a sequence of steps: 👉 Ingest → Process → Store 🔹 Simple Example import pandas as pd import requests # Step 1: Fetch data response = requests.get("https://lnkd.in/gTtgvXhZ") data = response.json() # Step 2: Convert to DataFrame df = pd.DataFrame(data) # Step 3: Transform df["salary"] = df["salary"] * 1.1 # Step 4: Store df.to_csv("output.csv", index=False) 🔹 Pipeline Flow 👉 API → Python → Transform → Output 🔹 Why This Matters Automates data flow Reduces manual work Scalable processing Foundation of data engineering 🔹 Real-World Use ETL pipelines Data ingestion systems Batch processing jobs 💡 Quick Summary A pipeline connects all steps into one flow. 💡 Something to remember Individual steps are code… Connected steps become a system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Dinesh Kumar
3w
Report this post
🚀 Day 8/20 — Python for Data Engineering Data Transformation Basics After reading data, the next step is not storing it… 👉 It’s transforming it into usable form Raw data is often: messy inconsistent not analysis-ready That’s where data transformation comes in. 🔹 What is Data Transformation? Changing data into a cleaner, structured, and useful format. 🔹 Common Transformations 📌 Selecting Columns df = df[["name", "salary"]] 👉 Keep only required data 📌 Filtering Rows df = df[df["salary"] > 50000] 👉 Focus on relevant records 📌 Creating New Columns df["bonus"] = df["salary"] * 0.1 👉 Add derived data 📌 Renaming Columns df.rename(columns={"salary": "income"}, inplace=True) 👉 Improve readability 🔹 Why This Matters Converts raw → usable data Prepares data for analysis Makes pipelines meaningful 🔹 Real-World Flow 👉 Raw Data → Clean → Transform → Store 💡 Quick Summary Transformation is where data becomes valuable. 💡 Something to remember Raw data is useless… Until you transform it into something meaningful. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Khushboo Gupta
1w
Report this post
🧩 Building Strong Foundations: Python for Data Validation Before jumping into ETL testing, mastering the basics is critical. Here’s what the first phase of the journey looks like 👇 🔹 Start with core Python concepts: - Data types, lists, dictionaries - Loops and conditional logic - Functions for reusable validation rules 🔹 Move into data handling: - Reading CSV/JSON files - Using Pandas for data manipulation - Handling missing values & duplicates 💡 Detect duplicate records data = [1,2,2,3] print(len(data) != len(set(data))) 💡 Basic data validation rule def validate_null(val): return val is None These simple checks are the building blocks of real-world data quality frameworks. 🎯 The goal here is not just coding… …it’s thinking like a data tester. What can go wrong with data? How do I catch it early? Next step → Applying these skills to ETL validation scenarios. Follow Khushboo Gupta for more. #PythonForData #DataValidation #Pandas #DataAnalytics #ETL #DataEngineering #TechSkills #Upskilling #LearningJourney
Like Comment
To view or add a comment, sign in
Jaswanth Thathireddy
1w
Report this post
🐍 Day 7/30 — Python for Data Engineers File I/O, CSV & JSON. The bread and butter of every ingestion pipeline. Before you touch pandas or Spark — you need to know how Python handles raw files. Because in real pipelines, you'll deal with: → CSVs dropped by vendors in S3 → JSON payloads from REST APIs → JSONL files in your data lake raw layer → Config files that drive your pipeline logic The #1 mistake I see beginners make: # ❌ Wrong — file never closes if an error occurs f = open("data.csv", "r") data = f.read() # ✅ Right — auto-closes even on exceptions with open("data.csv", "r") as f: data = f.read() And the thing that confused me for weeks: json.load(f) # reads from a FILE object json.loads(s) # parses a STRING json.dump(d, f) # writes to a FILE json.dumps(d) # returns a STRING The "s" = string. Once you know that, it sticks forever. For data lake files, JSONL is king: # One JSON object per line — memory efficient with open("events.jsonl") as f: events = [json.loads(line) for line in f if line.strip()] Today's cheat sheet covers: → open() with context managers → All 6 file modes explained → Key file methods (with memory warnings) → csv.DictReader / DictWriter → Common CSV gotchas (encoding, newline, delimiter) → json.load / loads / dump / dumps → JSONL pattern + CSV → JSON transform 📌 Every section has a plain-English explanation — save it. Day 8 tomorrow: OS & Pathlib — Navigate the Filesystem Like a Pro 📁 Which format do you deal with most in your pipelines — CSV or JSON? 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #ETL #DataAnalyst #DataAnalysis #Data #PythonDev
Like Comment
To view or add a comment, sign in
Abdulelah Muhmin
2w
Report this post
Python Chaos to dbt Clarity: Why I Upgraded My Data Pipeline Architecture We’ve all been there. A "simple" Python script that starts with extracting data, and ends up being a 1,000-line monster handling cleaning, joining, testing, and documentation. It works... until it doesn't. In my latest project, "SME-Modern-Sales-DWH," I decided to move away from the Monolithic ETL approach (Level 1) to a Modern ELT framework (Level 2). The Shift: Decoupling the Logic 🏗️ Instead of forcing Python to do everything, I redistributed the workload to where it belongs: 🔹 Python (The Mover): Now only handles Extract & Load. It moves raw data from CSVs to the Bronze layer. Simple, fast, and easy to maintain. 🔹 dbt-core (The Brain): Once the data is in SQL Server, dbt takes over for the Transformations. Why this is a game-changer for SMEs: 1. Automated Testing: I implemented 47 data quality tests. If the data isn't right, the build fails. No more "guessing" if the report is accurate. 2. Modular Modeling: Using Staging, Intermediate, and Marts layers. It’s built like LEGO—modular and scalable. 3. Documentation on Autopilot: dbt docs now provide a full lineage of the data, making the system transparent for everyone. 4. Surrogate Keys & Hashing: Used MD5 hashing to merge CRM and ERP data seamlessly. The Result? A reliable "Single Source of Truth" that turns fragmented data into actionable sales insights. No more "nuclear explosions" in the codebase! 💥✅ Check out the full architecture and code on GitHub: https://lnkd.in/d-BB9b9R #DataEngineering #dbt #Python #ModernDataStack #DataAnalytics #SQL #ELT #SME
Like Comment
To view or add a comment, sign in
Danial raza
4w
Report this post
🚀 Automating Data Workflows with Python & Pandas I’ve been diving deeper into Python for data analysis, and I just built a script that automates a common (and often tedious) task: cleaning CSV data and converting it into multiple formats for different stakeholders. 🛠️ The Problem: CSV files often come with "messy" formatting—like stray spaces after commas—that can break standard data pipelines. Plus, different teams need the same data in different formats (Web devs want JSON, Managers want Excel, and Data Engineers want CSV). 💡 The Solution: Using pandas and os, I created a script that: Cleans on the fly: Used skipinitialspace=True to automatically trim whitespace issues that usually cause KeyErrors. Performs Vectorized Math: Calculated total sales across the entire dataset in a single line of code. Automates File Management: Dynamically creates output directories and exports the results into JSON, Excel, and CSV simultaneously. 📦 Key Tools Used: Pandas: For high-performance data manipulation. OS Module: For robust file path handling. Openpyxl: To bridge the gap between Python and Excel. It’s a simple script, but it’s a foundational step toward building more complex, automated data pipelines! Check out the logic below: 👇 Python import pandas as pd import os # Read & Clean: skipinitialspace=True is a lifesaver for messy CSVs! df = pd.read_csv('data/sales.csv', skipinitialspace=True) # Transform: Vectorized calculation for 'total' df['total'] = df['quantity'] * df['price'] # Automate: Exporting to 3 different formats at once os.makedirs('output', exist_ok=True) df.to_json('output/sales_data.json', orient='records', indent=2) df.to_excel('output/sales_data.xlsx', index=False) df.to_csv('output/sales_with_totals.csv', index=False) #Python #DataAnalysis #Pandas #Automation #CodingJourney #DataScience
Like Comment
To view or add a comment, sign in
Jaswanth Thathireddy
3w
Report this post
🐍 Day 3/30 — Python for Data Engineers Dictionaries & Sets. The tools that make pipelines fast. Every Data Engineer works with dicts daily — whether parsing API responses, defining schemas, or managing configs. But here's the one that most beginners miss 👇 Sets are basically SQL operations: A & B → INNER JOIN (intersection) A | B → FULL OUTER JOIN (union) A - B → LEFT ANTI JOIN (difference) A ^ B → schema drift detector 🚨 That last one is genuinely useful in production: new_cols = incoming_cols - expected_cols # → {"total"} ← column you didn't expect. Alert! And remember: dict/set lookup is O(1) — hash table under the hood. List lookup is O(n) — it scans every element. On 10M rows, that difference is seconds vs milliseconds. 📌 Full cheat sheet in the image — methods, comprehensions, real DE patterns. Day 4 tomorrow: Functions & Lambda 🔧 What's your most-used dict method? .get() or .items()? Drop it below 👇 #Python #DataEngineering #30DaysOfPython #LearnPython #DataEngineer #SQL
Like Comment
To view or add a comment, sign in
Dinesh Kumar
3w
Report this post
🚀 Day 9/20 — Python for Data Engineering Working with Large Files (Memory Optimization) By now, we know how to read, write, and transform data. But in real-world scenarios… 👉 Data is not small 👉 Files can be GBs in size If we try to load everything at once → ❌ crash / slow performance 🔹 The Problem df = pd.read_csv("large_file.csv") 👉 Loads entire file into memory 👉 Not scalable 🔹 Solution: Read in Chunks import pandas as pd for chunk in pd.read_csv("large_file.csv", chunksize=1000): process(chunk) 👉 Processes data piece by piece 👉 Memory efficient 👉 Scalable 🔹 Another Approach: Line-by-Line with open("large_file.txt") as f: for line in f: process(line) 👉 Useful for logs and streaming data 🔹 Why This Matters Prevent memory issues Handle large datasets smoothly Build scalable pipelines 🔹 Where You’ll Use This Log processing Batch pipelines Streaming systems ETL workflows 💡 Quick Summary Don’t load everything at once. Process data in parts. 💡 Something to remember Efficient data handling is not about power… It’s about smart processing. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in

71 followers

65 Posts

View Profile Connect

Dinesh Kumar’s Post

More Relevant Posts

Explore content categories