Python Pipeline Optimization: Yield for Efficient Data Processing

View organization page for Big Data Expertise

2 followers

1mo

Your Python pipeline loads 10 million rows. Then it crashes. Not because your code is wrong — because it loads everything into memory at once. The fix? One word: yield Here's the before/after that every data engineer needs to see. --- ❌ BEFORE — loads all rows into RAM at once: ```python def read\_records\(filepath\): records = \[\] with open\(filepath\) as f: for line in f: records.append\(line.strip\(\)\) return records # 10M rows sitting in memory for record in read\_records\("data.csv"\): process\(record\) ``` With 10M rows, this can eat GBs of RAM before processing even starts. --- ✅ AFTER — processes one row at a time with a generator: ```python def read\_records\(filepath\): with open\(filepath\) as f: for line in f: yield line.strip\(\) # produces one row, pauses, waits for record in read\_records\("data.csv"\): process\(record\) ``` Same logic. Same output. Near-zero memory overhead. --- Why does this work? → A generator doesn't compute all values upfront → It produces one item, pauses, and resumes only when the next is needed → Memory stays flat — whether you process 1K or 100M rows This is the foundation behind Spark's lazy evaluation, Kafka consumers, and ETL streaming pipelines. Master this pattern in Python first — and distributed systems start making a lot more sense. #DataEngineering #Python #BigData #PythonForDataEngineers #ETL #LearnData #DataPipelines

To view or add a comment, sign in

More Relevant Posts

Denys K.
1mo
Report this post
UDF in PySpark is the most expensive 3 letters in your job config Every row crosses the JVM-Python boundary twice. Spark serializes data out to Python, your function runs, then it serializes back. On a billion-row table, that overhead doesn't add minutes - it adds hours Replaced them with native pyspark.sql.functions expressions. Same logic, zero infrastructure changes. Runtime: 45 minutes The fix order when you see udf() in production: → Check if pyspark.sql.functions already has what you need - it usually does → If you need custom logic, use @pandas_udf instead, vectorized, no row-by-row serialization overhead → Only keep scalar UDFs when there is genuinely no other path The issue isn't that people write bad UDFs. It's that Python UDFs look identical to native functions in the code, so nobody questions them until the job becomes a problem Have you run .explain(True) on a job with UDFs recently, or are you still trusting that the runtime is "just how Spark works"?
Like Comment
To view or add a comment, sign in
AKSHAYA GOVIND
1mo
Report this post
Excited to share my latest article on modern data processing! I recently published "Polars: A High-Performance DataFrame Library in Python", where I dive into how Polars is emerging as a powerful alternative to traditional data manipulation libraries. As datasets continue to grow in size and complexity, performance becomes critical. In this article, I explore how Polars addresses these challenges with a highly efficient architecture built on Apache Arrow, enabling faster computation and reduced memory usage. Here’s what discuss in the article: ▪️ What Polars is and why it’s gaining traction in the data ecosystem ▪️ Its core design principles, including lazy execution, which optimizes queries before execution ▪️ Built-in parallel processing, allowing operations to run significantly faster compared to traditional approaches ▪️ How Polars handles large datasets more efficiently with lower memory overhead ▪️ Practical examples showcasing its performance benefits in real-world data workflows One of the most interesting aspects I found is how Polars shifts the mindset from step-by-step execution to an optimized query plan, making data pipelines not just faster, but smarter. If you're working in data science, data engineering, or analytics, and dealing with performance bottlenecks, Polars is definitely worth exploring. I’d love to hear your thoughts, have you tried Polars yet? How does it compare with your current tools? #Python #DataScience #BigData #Analytics #Polars #MachineLearning Read the full article here:

Polars: A High-Performance DataFrame Library in Python medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Apeksha Gourshete
1mo
Report this post
📘 Python for PySpark Series – Day 9 📂 Generators in Python (Memory Efficient Data Processing) ✨ What are Generators in Python? Generators are functions that return values one at a time instead of all at once. They use the yield keyword instead of return. ➡️ This makes them very useful when working with large datasets. ⚙️ Why Do We Need Generators? In data engineering: ❓ What if we are processing huge data (millions of records)? ➡️ Storing everything in memory can cause performance issues. ✔ Generators solve this by producing data on demand ✔ Save memory ✔ Improve performance 🔹 Normal Function vs Generator Normal Function: def numbers(): return [1, 2, 3] Generator: def numbers(): yield 1 yield 2 yield 3 ➡️ Generator returns values one by one, not all together. 🔹 Using Generator Example: def count_up(n): for i in range(n): yield i for num in count_up(3): print(num) ➡️ Output: 0 1 2 🔗 Why Generators Matter in Data Engineering When working with big data: ✔ Avoid loading entire dataset into memory ✔ Process data in chunks ✔ Efficient streaming of data ➡️ Very useful in ETL pipelines and PySpark concepts. 🏫 Real-Life Analogy (Water Tap 🚰) Imagine: 🚰 Tap → Water flows as needed 🪣 Bucket → Stores all water at once ➡️ Generator = Tap (on-demand flow) ➡️ List = Bucket (stores everything) 🧠 Interview Key Points ✔ Generators use yield instead of return ✔ Produce values one at a time ✔ Memory efficient ✔ Useful for large datasets ✔ Support iteration using loops 🧠 Key Takeaway Generators enable efficient data processing by producing values on demand, making them essential for handling large-scale data in data engineering workflows. 🔖 Hashtags #python #pyspark #dataengineering #bigdata #generators #pythonbasics #learningjourney #dataprocessing
Like Comment
To view or add a comment, sign in
Anthony DiFranco, M.B.A.
1mo
Report this post
Python Data Source API — worth using? Most data engineers have written the same pipeline at least once. Call an API. Handle pagination. Land the data. Repeat. One of the more common challenges in data engineering is working with applications that expose APIs but don’t have out-of-the-box connectors. No native integration. No supported ingestion pattern. So you end up building it yourself. Most teams follow a similar approach. Write Python code to call the API. Handle authentication, pagination, and rate limits. Transform the response. Land the data. Schedule it. Maintain it. It works, but over time it becomes a collection of custom pipelines that are difficult to standardize and scale. This is where the Python Data Source API becomes interesting. At a high level, it allows you to define a data source directly in Python and integrate it into your data workflows more natively. Instead of treating API-based data as something external that needs to be pulled in and managed separately, it becomes part of a more consistent ingestion pattern. What stands out to me is the shift in how external data is handled. Rather than writing one-off ingestion scripts, you can start to define reusable, structured access patterns for API-based sources. That has implications for maintainability, consistency, and how teams scale their data platforms over time. It also raises some architectural questions. Should API data be treated the same as file-based ingestion? How tightly should ingestion logic be coupled to processing? Where does this fit relative to patterns like landing raw data and processing downstream? It’s still early, but it feels like a meaningful step toward standardizing a problem most data teams have been solving in an ad hoc way. Curious how others are thinking about this. In what scenarios would you use the Python Data Source API over more traditional ingestion patterns? #Databricks #DataEngineering #Python #DataArchitecture
Like Comment
To view or add a comment, sign in
Abhishek kumar sinha
2w
Report this post
𝗦𝗽𝗮𝗿𝗸 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝘀 #2: 𝗨𝗗𝗙𝘀 — 𝗧𝗵𝗲 𝗦𝗺𝗮𝗿𝘁 𝗖𝗼𝗱𝗲 𝗧𝗵𝗮𝘁 𝗕𝗿𝗲𝗮𝗸𝘀 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲. I used to think UDFs were the cleanest way to write Spark code… Clean. Reusable. Easy. Until a discussion with my architect changed my perspective. “This is why your job is slow.” Then I looked under the hood… and everything clicked. 👉 UDFs (User Defined Functions) look powerful but inside Spark, they break optimization. ⚠️ What actually happens when you use a #UDF: ❌ #Spark treats it as a black box ❌ #CatalystOptimizer can’t analyze your logic ❌ No predicate pushdown ❌ No #WholeStageCodegen And it gets worse… 💥 #JVM ↔ #Python serialization overhead 💥 Execution becomes row-by-row 💥 Vectorized (batch) execution is lost 🧠 What’s happening internally (real reason) Spark doesn’t execute your Python code directly. It follows this pipeline: 1️⃣ Build Logical Plan 2️⃣ Optimize using Catalyst 3️⃣ Convert to Physical Plan 4️⃣ Generate JVM bytecode (WholeStageCodegen) 5️⃣ Execute in a distributed manner ✅ With Built-in Functions: Spark understands expressions like: → when, filter, join, agg So it can: ✔ Apply rule-based + cost-based optimization ✔ Push filters down to data source ✔ Reorder joins ✔ Eliminate unnecessary columns ✔ Combine multiple operations into a single stage 👉 Result: Fewer stages + less I/O + faster execution ❌ With UDF: Your logic lives in Python so for Spark, It becomes an opaque expression, Meaning: Spark don’t know what this function does. Because of this: 🚫 Catalyst cannot rewrite or optimize it 🚫 Filters cannot be pushed below UDF 🚫 Column pruning stops at UDF boundary 🚫 WholeStageCodegen cannot include it 💥 The Real Bottleneck 👉 Crossing JVM ↔ Python boundary, Spark runs in JVM, but UDF runs in Python. So for every row (or batch): → Serialize data (JVM → Python) → Execute function → Deserialize back (Python → JVM) This causes: 💥 High CPU overhead 💥 Serialization cost 💥 Loss of #vectorizedexecution 💥 More GC pressure 🔍 Example ❌ Using UDF: from pyspark.sql.functions import udf from pyspark.sql.types import StringType def categorize(age): return "minor" if age < 18 else "adult" df = df.withColumn("category", udf(categorize, StringType())(df.age)) ✅ Using Built-in Functions: from pyspark.sql.functions import when df = df.withColumn( "category", when(df.age < 18, "minor").otherwise("adult") ) 💡 Same logic. Completely different execution plan. ✔ Built-in → optimized DAG + codegen + vectorized ❌ UDF → isolated, row-based, non-optimizable 🚀 What to do instead: ✔ Prefer #SparkSQL functions (when, expr, concat) ✔ Think in columnar transformations ✔ Use #PandasUDF only when unavoidable 🧠 Spark is not a Python engine, It’s a distributed SQL engine with a Python interface. Happy to share more Databricks tutorials & Spark insights — just DM me #ApacheSpark #SparkInternals #DataEngineering #Databricks #BigData
Like Comment
To view or add a comment, sign in
Alan Oliveira
3w Edited
Report this post
🐍 Stop Writing "Spaghetti" Data Science Code We’ve all been there: a Jupyter Notebook with 47 cells, variables named df2, df_final, and df_final_v2_FIXED, and a loop that takes three hours to run. Data analysis is about insights, but your code quality determines how fast (and how reliably) you get them. Here are 4 Python best practices to move from "it works on my machine" to "production-ready." 1. Embrace Vectorization (Forget the for loops) If you’re iterating over a Pandas DataFrame with a loop, you’re likely doing it wrong. Python’s numpy and pandas are built on C—let them do the heavy lifting. Bad: Using .iterrows() to calculate a new column. Good: Use vectorized operations like df['new_col'] = df['a'] * df['b']. It’s orders of magnitude faster. 2. The Magic of Method Chaining Clean code is readable code. Instead of creating five intermediate DataFrames, chain your operations. It keeps your namespace clean and your logic linear. Python # Instead of multiple assignments, try this: df_clean = (df .query('age > 18') .assign(name=lambda x: x['name'].str.upper()) .groupby('region') .agg({'salary': 'mean'}) ) 3. Type Hinting & Docstrings Data types in Python are flexible, which is a blessing and a curse. Use Type Hints to tell your team exactly what a function expects. def process_data(df: pd.DataFrame) -> pd.DataFrame: It saves hours of debugging when someone tries to pass a list into a function expecting a Series. 4. Memory Management Matters Working with "Big-ish" data? Downcast your numerics (e.g., float64 to float32). Convert object columns with low cardinality to category types. Your RAM (and your IT department) will thank you. The Bottom Line: Great data analysis isn't just about the model accuracy; it's about the maintainability of the pipeline. Which Python habit changed your workflow the most? Let’s swap tips in the comments! 👇 #Python #DataScience #Pandas #DataAnalysis #CodingBestPractices #MachineLearning
Like Comment
To view or add a comment, sign in
Pritesh Zaveri
3w
Report this post
Nothing humbles you quite like a PySpark Exit code 137 (Out of Memory) on a massive EMR cluster. 😅 Recently, while processing millions of financial position records, I watched a seemingly simple query bring a cluster to its knees. The culprit? Using a standard Python @udf and collect_list() on a highly skewed dataset. It’s easy to just throw more hardware at a problem, but real scale requires understanding what the JVM is actually doing under the hood. Three architectural shifts that completely changed the game for our pipeline: 1️⃣ Ditching Python UDFs: Rewriting string logic into native PySpark functions (F.when().rlike()) keeps execution entirely within the JVM. The speed difference is staggering. 2️⃣ Swapping Dropping duplicates in memory during the shuffle phase prevents a single skewed partition from blowing up the node's RAM. 3️⃣ Leveraging Window Functions: Replacing massive .groupby().max() aggregations with Window.partitionBy() prevents destructive shuffles and keeps row-level data intact. Data Engineering isn't just about moving data from A to B. It’s about understanding the execution plan. What’s a performance tuning lesson you had to learn the hard way? 👇
Like Comment
To view or add a comment, sign in
Gopikrishna Ravipati
2w
Report this post
🔥 𝐒𝐜𝐚𝐥𝐚 𝐯𝐬 𝐏𝐲𝐭𝐡𝐨𝐧 𝐟𝐨𝐫 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 — 𝐖𝐡𝐢𝐜𝐡 𝐎𝐧𝐞 𝐒𝐡𝐨𝐮𝐥𝐝 𝐘𝐨𝐮 𝐂𝐡𝐨𝐨𝐬𝐞 𝐢𝐧 𝟐𝟎𝟐𝟔? ⚡ Scala vs Python (Spark Edition): 🔹 Performance 🚀 👉 Scala = Faster (native Spark language) 👉 Python = Slightly slower (PySpark overhead) 🔹 Learning Curve 📚 👉 Scala = Harder (functional + JVM concepts) 👉 Python = Easy (beginner-friendly) 🔹 Concurrency ⚙️ 👉 Scala = Strong (Akka, parallelism) 👉 Python = Limited (GIL constraints) 🔹 Type System 🧠 👉 Scala = Statically typed 👉 Python = Dynamically typed 🔹 Ease of Use 💡 👉 Scala = Verbose 👉 Python = Clean & simple 🔹 Ecosystem 🔥 👉 Scala = Deep Spark integration 👉 Python = Huge ML/AI libraries (Pandas, TensorFlow) 🎯 So… Which one should YOU choose? 👉 Choose Scala if: ✔ You want performance & deep Spark optimization ✔ Working on large-scale production pipelines 👉 Choose Python if: ✔ You want faster development ✔ Working on Data Science / ML 💬 Question: Which one do you prefer for Spark — Scala or Python? 👇 #ApacheSpark #Scala #Python #PySpark #BigData #DataEngineering #DataEngineer #MachineLearning #Coding #TechCareers #LearnToCode
Like Comment
To view or add a comment, sign in
Dinesh Kumar
1mo
Report this post
🚀 Day 2/20 — Python for Data Engineering Understanding Data Types (Lists, Tuples, Sets, Dictionaries) After understanding why Python is important, the next step is knowing how Python stores and works with data. 🔹 Why Data Types Matter? In data engineering, we constantly deal with: structured data collections of records key-value mappings 👉 Choosing the right data type makes processing easier and efficient. 🔹 Common Data Types: 📌 Lists numbers = [3, 7, 1, 9] names = ["Alice", "Bob"] 👉 Ordered and changeable 👉 Useful for processing sequences 📌 Tuples point = (3, 4) values = ("Alice", 95) 👉 Ordered but immutable 👉 Useful for fixed data 📌 Sets unique_numbers = {3, 7, 1, 9} 👉 Unordered, no duplicates 👉 Useful for removing duplicates 📌 Dictionaries employee = {"name": "Alice", "salary": 50000} 👉 Key-value pairs 👉 Useful for lookup and mapping 🔹 Where You’ll Use Them Lists → processing rows of data Tuples → fixed records Sets → removing duplicates Dictionaries → mapping & transformations 💡 Quick Summary Different data types serve different purposes. Choosing the right one helps you write better and cleaner code. 💡 Something to remember Data types are not just syntax. They define how efficiently you handle data. #Python #DataEngineering #DataAnalytics #LearningInPublic #TechLearning #Databricks
Like Comment
To view or add a comment, sign in

2 followers

View Profile Connect

Python Pipeline Optimization: Yield for Efficient Data Processing

More Relevant Posts

Explore content categories