Nabila T.’s Post

1mo

One shift that helped me understand data engineering better: 👉 Stop thinking in scripts. 👉 Start thinking in pipelines. Instead of writing a single piece of code that does everything, I try to think in stages: Extract → Transform → Load For example: Extract Pull raw data from a source (CSV, API, database) Transform Clean and reshape it using Python, Pandas, or SQL Load Store it somewhere reliable (data warehouse or lake) This structure makes things easier to: ✔ debug ✔ scale ✔ maintain It also mirrors how real-world data platforms are built. Still practicing this mindset shift. #DataEngineering #ETL #Python #SQL

To view or add a comment, sign in

More Relevant Posts

Soumava Sarkar
3w Edited
Report this post
🐍 ##learn with soumava | Series 02: Python Fundamentals for Data Engineers Transitioning from traditional ETL to AI-driven engineering requires more than just writing code—it requires choosing the right Data Structures for performance and integrity. i’ve realized that the "Basics" are actually the most powerful tools in our kit. Today, I’m sharing my personal notebook on the building blocks of Python. What’s inside this guide? ✅ Variables & Dynamic Typing: How Python infers types (and how to verify them). ✅ Lists: Why being "Mutable" and "Dynamic" makes them an ETL engineer's best friend. ✅ NumPy Arrays: The secret to high-speed mathematical operations over large datasets. ✅ Tuples: How to use "Immutability" to protect your database credentials and constants. Key Takeaway from the Guide: 🔹 Use a List for flexibility and changing data. 🔹 Use a Tuple for security and read-only data. 🔹 Use an Array for raw performance and math. Swipe through my Colab notes below to see the code snippets and real-world ETL use cases! 👇 #LearnWithSoumava #PythonProgramming #DataEngineering #NumPy #ETL #TechCommunity #DataAnalytics

1 Comment
Like Comment
To view or add a comment, sign in
Dinesh Kumar
3w
Report this post
🚀 Day 9/20 — Python for Data Engineering Working with Large Files (Memory Optimization) By now, we know how to read, write, and transform data. But in real-world scenarios… 👉 Data is not small 👉 Files can be GBs in size If we try to load everything at once → ❌ crash / slow performance 🔹 The Problem df = pd.read_csv("large_file.csv") 👉 Loads entire file into memory 👉 Not scalable 🔹 Solution: Read in Chunks import pandas as pd for chunk in pd.read_csv("large_file.csv", chunksize=1000): process(chunk) 👉 Processes data piece by piece 👉 Memory efficient 👉 Scalable 🔹 Another Approach: Line-by-Line with open("large_file.txt") as f: for line in f: process(line) 👉 Useful for logs and streaming data 🔹 Why This Matters Prevent memory issues Handle large datasets smoothly Build scalable pipelines 🔹 Where You’ll Use This Log processing Batch pipelines Streaming systems ETL workflows 💡 Quick Summary Don’t load everything at once. Process data in parts. 💡 Something to remember Efficient data handling is not about power… It’s about smart processing. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Prathamesh Gadgil
1w
Report this post
Here are 5 Python libraries I use every week that I never learned about in grad school. Not pandas. Not scikit-learn. The ones nobody tells you about until you're debugging something at 11 PM. 1. pydantic — I used to validate data with if-else chains. Now I define data models that catch bad records before they hit my pipeline. One config change saved me hours of debugging clinical data feeds. 2. missingno — One visualization that shows every missing value pattern in your dataset. In healthcare data, the pattern of what's missing matters more than the percentage. This library makes it obvious. 3. pandera — Schema validation for dataframes. Define what your columns should look like and it yells at you before bad data propagates downstream. Essential when your data comes from multiple sources. 4. rich — Better logging and console output. Sounds trivial. But when you're running a pipeline on a remote server and need to quickly understand what went wrong, pretty output saves real time. 5. janitor (pyjanitor) — Clean column names, remove empty rows, handle Excel messiness. The boring data cleaning that eats 30% of every project. What's a library that changed how you work? The more niche, the better. #Python #DataScience #MachineLearning
Like Comment
To view or add a comment, sign in
Deepika K S
4w
Report this post
What is Lazy Evaluation PySpark uses lazy evaluation → transformations are not executed immediately Operations like filter(), select() only build a logical plan (DAG) Execution happens only when an action is called (show(), count(), collect()) Spark optimizes the entire plan before execution → better performance Avoids unnecessary computations and improves efficiency 💡 Example: Python df = spark.read.csv("data.csv", header=True) # Transformations (no execution yet) df_filtered = df.filter(df.salary > 5000) df_selected = df_filtered.select("name", "salary") # Action → triggers execution df_selected.show() ⚡ Without lazy evaluation, each step would execute separately → slower performance. With lazy evaluation, Spark optimizes everything and runs it efficiently. Still learning and exploring more PySpark concepts! 🚀 #PySpark #BigData #DataEngineering #PerformanceTuning #Learning

1 Comment
Like Comment
To view or add a comment, sign in
Dinesh Kumar
2w
Report this post
🚀 Day 17/20 — Python for Data Engineering Building a Simple Data Pipeline So far, we’ve learned: reading data transforming data working with APIs Now it’s time to connect everything together. 👉 That’s called a data pipeline 🔹 What is a Data Pipeline? A pipeline is a sequence of steps: 👉 Ingest → Process → Store 🔹 Simple Example import pandas as pd import requests # Step 1: Fetch data response = requests.get("https://lnkd.in/gTtgvXhZ") data = response.json() # Step 2: Convert to DataFrame df = pd.DataFrame(data) # Step 3: Transform df["salary"] = df["salary"] * 1.1 # Step 4: Store df.to_csv("output.csv", index=False) 🔹 Pipeline Flow 👉 API → Python → Transform → Output 🔹 Why This Matters Automates data flow Reduces manual work Scalable processing Foundation of data engineering 🔹 Real-World Use ETL pipelines Data ingestion systems Batch processing jobs 💡 Quick Summary A pipeline connects all steps into one flow. 💡 Something to remember Individual steps are code… Connected steps become a system. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Shivam Mishra
2w
Report this post
🚀 Top 5 Pandas Codes Every Data Scientist Should Know From loading datasets to performing powerful aggregations, these essential Pandas commands form the backbone of real-world data analysis. Whether you're a beginner or sharpening your skills, mastering these basics can significantly boost your productivity and confidence in handling data. 📌 Key Highlights: • Efficient data loading • Quick data insights & summary • Smart filtering techniques • Handling missing values • Grouping & aggregating like a pro 💡 Small commands, big impact — this is where every Data Science journey begins. If you're learning Data Science, don’t just read—practice daily. #DataScience #Python #Pandas #MachineLearning #DataAnalytics #Coding #LearnToCode #CareerGrowth
Like Comment
To view or add a comment, sign in
R Kishore Reddy
1w
Report this post
𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗹𝗮𝗿𝗴𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 𝗶𝗻 𝗣𝗮𝗻𝗱𝗮𝘀 𝘁𝗮𝘂𝗴𝗵𝘁 𝗺𝗲 𝗼𝗻𝗲 𝘀𝗶𝗺𝗽𝗹𝗲 𝗹𝗲𝘀𝘀𝗼𝗻 — 𝗺𝗲𝗺𝗼𝗿𝘆 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝘄𝗲 𝘁𝗵𝗶𝗻𝗸. In the beginning, I used to load dataframes without even thinking about how much memory they consume. Everything looked fine… until one day my script slowed down, and sometimes even crashed. That’s when I realized it’s not always about the data size, it’s about how efficiently we handle it. One simple habit that changed things for me is checking memory usage of a dataframe. In Pandas, you can do this very easily: df.info() This gives a quick summary of your dataframe, including memory usage. But if you want a more detailed view, you can use: df.memory_usage(deep=True) This shows how much memory each column is using. Adding deep=True helps you get accurate results, especially for object-type columns like strings. What I found interesting is that sometimes a few columns consume most of the memory. Especially object columns they silently take up a lot of space. Once you know where the memory is going, you can start optimizing: * Convert object columns to category if they have repeated values * Use smaller data types like int32 instead of int64 * Drop unnecessary columns early These small steps make a big difference, especially when working with large datasets. For me, this was a small learning, but very powerful. Now, before doing any heavy operations, I just take a few seconds to check memory usage and it saves me minutes (sometimes hours) later. If you’re working with Pandas, give this a try. It might look small, but it can completely change how your code performs. #BigData #Python #Pandas #DataAnalytics
Like Comment
To view or add a comment, sign in
Dinesh Kumar
3w
Report this post
🚀 Day 8/20 — Python for Data Engineering Data Transformation Basics After reading data, the next step is not storing it… 👉 It’s transforming it into usable form Raw data is often: messy inconsistent not analysis-ready That’s where data transformation comes in. 🔹 What is Data Transformation? Changing data into a cleaner, structured, and useful format. 🔹 Common Transformations 📌 Selecting Columns df = df[["name", "salary"]] 👉 Keep only required data 📌 Filtering Rows df = df[df["salary"] > 50000] 👉 Focus on relevant records 📌 Creating New Columns df["bonus"] = df["salary"] * 0.1 👉 Add derived data 📌 Renaming Columns df.rename(columns={"salary": "income"}, inplace=True) 👉 Improve readability 🔹 Why This Matters Converts raw → usable data Prepares data for analysis Makes pipelines meaningful 🔹 Real-World Flow 👉 Raw Data → Clean → Transform → Store 💡 Quick Summary Transformation is where data becomes valuable. 💡 Something to remember Raw data is useless… Until you transform it into something meaningful. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in
Mukesh Boolani
2w
Report this post
I finally understand why data scientists say they spend 80% of their time on data. 📊 This week, instead of just reading about the ML lifecycle, I actually did the second step: Data Collection. 🎯 I built my own dataset called "TMDB Top Rated Movies" using their public API. 🎬 It was interesting to see how data can come from different sources some datasets are already available in formats like CSV and JSON, while others can be retrieved using SQL databases. I also learned that data can be collected through APIs or even web scraping depending on the use case. Nothing fancy. Just: 🐍 Python 📡 A bunch of API calls 🔄 Figuring out how to loop through pages without breaking everything In the end, I pulled together 10,000+ movie records clean, structured, and ready for actual analysis or ML. 📁✅ This part felt more like real engineering than anything I have done in a notebook. 🛠️ Small step. But it's real. 🚀 dataset link: https://lnkd.in/dG7EcE5q #MachineLearning #DataScience #Python #LearningByDoing
1 Comment
Like Comment
To view or add a comment, sign in
Mohit Kaushik
2w
Report this post
I got paid to NOT build an ML model. Here’s why. SQL > fancy ML models. Fight me. 🫵 Okay hear me out - I've seen teams spend months building ML pipelines... when a 10-line SQL query would've answered the question in 10 minutes. My actual toolkit after 4 years: 🗄️ SQL - find the truth in the data 🐍 Python - automate everything else 🤖 ML - deploy it when SQL genuinely can't do the job The aha moment? They work best in that exact order. Most people jump straight to ML. The pros start with SQL. Where are you in your data journey? 👇 #SQL #Python #MachineLearning #DataScience #HotTake #DataEngineering #TechOpinion #LearningInPublic #BuildingInPublic #DataAnalytics
1 Comment
Like Comment
To view or add a comment, sign in

809 followers

27 Posts

View Profile Connect

Nabila T.’s Post

More Relevant Posts

Explore content categories