Migrating to Polars: 70% Memory Reduction and 45 Minute ETL

I migrated our 50GB Pandas pipeline to Polars. The difference shocked me: Our daily ETL was taking 4+ hours and burning through memory like crazy. The team was getting frustrated with constant OOM errors. I'd heard whispers about Polars but was skeptical. Another "revolutionary" tool? 🙄 But desperate times called for desperate measures. Here's what I learned during the 3-week migration: 1. **Memory usage dropped 70%** - Polars' lazy evaluation only loads what it needs 2. **Query optimization is automatic** - No more manual .query() tweaking 3. **Parallel processing works out of the box** - Unlike Pandas' single-threaded operations 4. **The .lazy() API feels familiar** - Most Pandas logic translated smoothly 5. **Arrow backend makes file I/O lightning fast** - Parquet reads went from 20min to 4min ⚡ The real game-changer? Our pipeline now runs in 45 minutes instead of 4+ hours. My manager asked why we didn't switch sooner 😅 The syntax learning curve was maybe 2 days. The performance gains were immediate. Sure, Pandas has a massive ecosystem. But for pure data processing at scale, Polars is becoming my go-to. One warning though - debugging can be trickier with lazy evaluation. Plan accordingly! 🚨 What's been your experience with Polars? Still team Pandas or making the switch? 🤔 #DataEngineering #Python #Polars #Pandas #ETL #DataProcessing #BigData #Performance #DataScience #Analytics #TechMigration #DataPipeline

4 Comments

Atish Lahiri 3w

Happy to read this practical migration story. If you could post a practical debugging case study with any sensitive data suitably scrubbed, that would be very helpful.

Nanda Gowri Ganta 3w

I will definitely try Polars! Thanks for sharing your experience Naveen Kumar

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Ninad Patil
2w
Report this post
𝗦𝗽𝗮𝗿𝗸 𝗱𝗼𝗲𝘀𝗻’𝘁 𝗿𝘂𝗻 𝘆𝗼𝘂𝗿 𝗰𝗼𝗱𝗲. 𝗜𝘁 𝗯𝘂𝗶𝗹𝗱𝘀 𝗮 𝗽𝗹𝗮𝗻. A lot of Spark confusion comes from thinking it executes “line by line” like a normal program. In reality, Spark mostly does this: 𝗬𝗼𝘂𝗿 𝗰𝗼𝗱𝗲 -> 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗽𝗹𝗮𝗻 So when you write PySpark or Spark SQL, Spark isn’t “running Python” or “running SQL”. It’s building a plan for a distributed engine to execute. Here’s the simplified mental model I use: 𝟭) 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀 𝗯𝘂𝗶𝗹𝗱 𝘁𝗵𝗲 𝗽𝗹𝗮𝗻 (𝗹𝗮𝘇𝘆) select, filter, join, groupBy... These don’t immediately run a job. They describe what should happen. 𝟮) 𝗔𝗰𝘁𝗶𝗼𝗻𝘀 𝘁𝗿𝗶𝗴𝗴𝗲𝗿 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 count, show, collect, write… This is when Spark says: “ok, now I need to execute the plan”. 𝟯) 𝗧𝗵𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗿 𝗿𝗲𝘄𝗿𝗶𝘁𝗲𝘀 𝘆𝗼𝘂𝗿 𝘄𝗼𝗿𝗸 Before running, Spark tries to make it cheaper: • push filters earlier • prune unused columns • reorder operations • pick join strategies 𝟰) 𝗧𝗵𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 𝗶𝘀 𝘄𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗿𝘂𝗻𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗰𝗹𝘂𝘀𝘁𝗲𝗿 This is where you’ll see the real cost drivers: • join strategy (broadcast vs shuffle) • number of stages/tasks • shuffles, scans, exchanges • partitioning decisions That’s why two bits of Spark code that look similar can behave completely differently. 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆: If you can read the plan, you can explain most performance issues without guessing. Share your favourite Spark “aha” moment in the comments. #Spark #PySpark #SparkSQL #DataEngineering #BigData #Databricks #PerformanceTuning #SQL
Like Comment
To view or add a comment, sign in
Arunkumar Palanisamy
1mo
Report this post
𝗣𝗮𝗻𝗱𝗮𝘀 𝗶𝘀 𝘁𝗵𝗲 𝗱𝗲𝗳𝗮𝘂𝗹𝘁. 𝗣𝗼𝗹𝗮𝗿𝘀 𝗶𝘀 𝘁𝗵𝗲 𝘀𝗵𝗶𝗳𝘁. 𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 𝗶𝘀𝗻'𝘁 𝘄𝗵𝗶𝗰𝗵 𝗶𝘀 "𝗯𝗲𝘁𝘁𝗲𝗿" 𝗶𝘁'𝘀 𝘄𝗵𝗶𝗰𝗵 𝗳𝗶𝘁𝘀 𝘆𝗼𝘂𝗿 𝘄𝗼𝗿𝗸𝗹𝗼𝗮𝗱. Pandas has been the default DataFrame library for over a decade. But as datasets grow and pipelines move toward production, its single-threaded, eager execution model starts to show cracks. That's where Polars enters. 𝗣𝗮𝗻𝗱𝗮𝘀: 𝘁𝗵𝗲 𝗳𝗮𝗺𝗶𝗹𝗶𝗮𝗿 𝗱𝗲𝗳𝗮𝘂𝗹𝘁: → Single-threaded, eager execution processes data immediately, step by step → Massive ecosystem every tutorial, every library, every StackOverflow answer → Ideal for exploration, prototyping, and datasets that fit comfortably in memory → Limitation: performance degrades on larger datasets. Memory usage can be 5-10x the raw data size. 𝗣𝗼𝗹𝗮𝗿𝘀: 𝘁𝗵𝗲 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝘀𝗵𝗶𝗳𝘁: → Multi-threaded, lazy evaluation builds a query plan and optimizes before executing → Written in Rust significantly faster on aggregations, joins, and group-bys → Native Parquet support and Apache Arrow columnar memory format → Limitation: smaller ecosystem. Fewer tutorials. Some libraries still expect Pandas DataFrames. 𝗪𝗵𝗲𝗿𝗲 𝗲𝗮𝗰𝗵 𝗳𝗶𝘁𝘀: → Exploration and prototyping → Pandas (ecosystem wins) → Production transforms on medium-large data → Polars (speed wins) → ML workflows with scikit-learn → Pandas (integration wins) → CI/CD and automated pipelines → Polars (performance wins) → SQL analytics → DuckDB (Ep 29) 𝗧𝗵𝗲 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗿𝘂𝗹𝗲: The shift isn't "replace Pandas." It's knowing when the workload has outgrown single-threaded, eager execution and choosing the right tool instead of the default one. Where in your stack are you treating DataFrames like scripts, when they should be treated like query plans? #DataEngineering #Python #DataArchitecture
26 Comments
Like Comment
To view or add a comment, sign in
Lloyd Soldatt
1mo
Report this post
Team, have a look #DBT #Snowflake -> Built something new last weekend: dbt-vectorize It turns dbt models into semantic search indexes in Postgres. The idea is simple: instead of running separate pipelines for embeddings, you define everything in dbt and run: dbt-vectorize build --select my_model Then query it: dbt-vectorize search --select my_model --query "oauth callback issues" Example: dbt-vectorize build --select vector_knowledge_base --profile default --target dev #DataScientist #Boston
Maria Dubyaga
1mo

Built something new last weekend: dbt-vectorize It turns dbt models into semantic search indexes in Postgres. The idea is simple: instead of running separate pipelines for embeddings, you define everything in dbt and run: dbt-vectorize build --select my_model Then query it: dbt-vectorize search --select my_model --query "oauth callback issues" Example: dbt-vectorize build --select vector_knowledge_base --profile default --target dev … embedded 30 rows into public.knowledge_base Search: dbt-vectorize search --select vector_knowledge_base --query "oauth callback issues" --top-k 3 --include-distance Example search output below ↓ Under the hood: - dbt builds the dataset - embeddings are generated (Rust, no Python required) - vectors are stored in Postgres via pgvector - search runs nearest-neighbor lookup directly in SQL Why I built this: embedding workflows often live outside the data stack (Python scripts, jobs, etc.). This keeps it closer to dbt: define once → build → search Still early, but already useful for: - semantic search over internal data - lightweight RAG pipelines - similarity search Curious if others are trying to bring embeddings into dbt workflows or keeping them separate. And same energy for next week: continue volleyball practices, continue builds. 🏐 #Rust #DataEngineering #dbt #PostgreSQL #BuildInPublic
Like Comment
To view or add a comment, sign in
Ygor Guerra
1w
Report this post
There are two ways to traverse hierarchies in SQL. Only one scales 👇 Recursive CTEs and self-joins solve the same problem: navigating hierarchical data. But they behave very differently as the data grows. Recursive CTEs let you define a single rule and let SQL iterate through the hierarchy until it reaches the end. No need to know the depth upfront. You also don’t need to keep adjusting the query every time the hierarchy changes, which makes it much more scalable in real-world systems. With recursive CTEs, the query adapts to the data. With self-joins, the query is fixed to the structure you assumed. For Python folks: think of recursive CTEs like a WHILE loop over a tree structure, with a termination condition to avoid infinite recursion. Got other SQL topics you want explained like this? Comment them 👇 📌Found it useful? Save it for later. #SQLTips #DataAnalytics #DataScience #SQL #Analytics #BusinessIntelligence #DataEngineer #LearnSQL
25 Comments
Like Comment
To view or add a comment, sign in
Nasiff Kazeem
2w
Report this post
📊 #M4aceLearningChallenge – Day 16 Deep Dive into Pandas: Series & DataFrames Yesterday, I discussed Pandas as a powerful tool for data analysis. Today, we’re going deeper into its two core data structures: Series and DataFrames. 🔹 1. Pandas Series A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.). Think of it like a single column in a table. Example: import pandas as pd data = [10, 20, 30, 40] series = pd.Series(data) print(series) You can also assign custom labels (index): series = pd.Series(data, index=['a', 'b', 'c', 'd']) 🔍 Key Features: - Has both values and index - Supports vectorized operations - Easy to manipulate and analyze --- 🔹 2. Pandas DataFrame A DataFrame is a two-dimensional table (like Excel or SQL tables). It consists of rows and columns. Example: data = { "Name": ["Nasiff", "John", "Aisha"], "Age": [25, 30, 22], "Score": [85, 90, 88] } df = pd.DataFrame(data) print(df) 🔍 Key Features: - Multiple columns (each column is a Series) - Labeled rows and columns - Handles missing data efficiently --- 🔹 3. Basic Operations Preview your data: df.head() # First 5 rows df.tail() # Last 5 rows Get structure and summary: df.info() df.describe() Select a column: df["Name"] --- 💡 Why This Matters Understanding Series and DataFrames is crucial because: - Every data analysis task in Pandas revolves around them - They make data manipulation fast and intuitive - They are widely used in Machine Learning workflows --- #DataScience #MachineLearning #Python #Pandas #LearningJourney #TechSkills #M4ace
Like Comment
To view or add a comment, sign in
Vishwanath T L
2w
Report this post
🚀 Stop scrolling and start shipping: PySpark DataFrame operations made simple. PySpark DataFrame Cheat Sheet for Data Engineers: 1. Handling nulls efficiently: df.fillna({'col_name': 0}).dropna(subset=['id']) 2. Conditional logic with when/otherwise: df.withColumn('status', F.when(F.col('val') > 100, 'high').otherwise('low')) 3. Aggregating with multiple metrics: df.groupBy('category').agg(F.sum('sales'), F.avg('price')) 4. Window functions for row numbers: win = Window.partitionBy('dept').orderBy(F.desc('salary')) df.withColumn('rank', F.row_number().over(win)) 5. String manipulation one-liners: df.withColumn('clean_name', F.trim(F.upper(F.col('name')))) 6. Renaming columns in bulk: df.select([F.col(c).alias(c.lower()) for c in df.columns]) I have used these snippets in our production pipelines to reduce boilerplate and keep our transformations readable. They save me hours of documentation digging every single week. Save this for your next project! What is the one PySpark function you find yourself typing out from memory every single day? #PySpark #DataEngineering #BigData #Python #DataPipelines
Like Comment
To view or add a comment, sign in
Nikhil Sontakke
3w Edited
Report this post
So we were ingesting data from a 20-year-old system into a data lake. No soft deletes. No proper IDs. Composite keys smeared across 4 columns. The upstream systems happily overwriting records with the same ID like nothing happened and this mess had to be served, cleanly, to downstream consumers. 🤯 If it was a SQL warehouse, fine. Upsert, merge, done. But it was a lake. And lakes don't do that. At least, not the old way. Which sent me down a rabbit hole I didn't expect: do we actually need Spark for this? 🧐 Turns out, no! :) 🙅♂️ delta-rs + Polars + PyArrow can now do everything Spark used to do for Delta Lake. Reads, writes, MERGE, vacuum, time travel. All on a single Python process. No cluster. No JVM. No Databricks bill. But that opens a weird new question. If you're not using Spark, who does your transformations? Polars? PyArrow? Both? 🤔 Spent some time figuring this out and wrote it all up. Architecture, code comparisons, performance gotchas, when to use which. If you're working with Delta Lake in Python, this might save you some time. Link👇 #DataEngineering #DeltaLake #Polars #PyArrow #Python

Transformations on Delta Lake Without Spark: Polars vs PyArrow nikhil-sontakke.medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Prathyusha K.
2w
Report this post
Swipe through the slides first 👉 then read below 👇 🚀 Day 26 of 30 — Learning PySpark from Scratch I always knew I should write tests. Today I actually did it. Here's how. 🧪 Here's what I learned on Day 26 👇 ⚡ The PySpark testing stack pip install pytest chispa → pytest — runs your tests → chispa — compares DataFrames easily (like assertEqual for DataFrames) 💻 My first real PySpark test # transforms.py — write functions, not scripts def clean_names(df): return df.withColumn("name", upper(trim(col("name")))) # test_transforms.py from chispa import assert_df_equality from transforms import clean_names def test_clean_names(spark): input_df = spark.createDataFrame([("alice smith", 1)], ["name", "id"]) expected = spark.createDataFrame([("ALICE SMITH", 1)], ["name", "id"]) assert_df_equality(clean_names(input_df), expected) # conftest.py — shared SparkSession for all tests @pytest.fixture(scope="session") def spark(): spark = SparkSession.builder.master("local[2]").appName("pytest").getOrCreate() yield spark spark.stop() # Run it pytest test_transforms.py -v ✅ 3 things I didn't know before today → Test each transformation function independently — not the whole pipeline → Always test with null inputs and empty DataFrames — edge cases break in production → scope="session" on the fixture = one SparkSession shared across all tests 💡 My Day 26 takeaway Writing tests felt like extra work. Then I caught 3 bugs before they hit production. Worth it. ❓ Do you write unit tests for your data pipelines at work? Drop it in the comments 👇 Follow me for Day 27 tomorrow → Running PySpark on the Cloud for free 🔔 #PySpark #DataEngineering #BigData #Python #LearnInPublic #30DaysOfPySpark

1 Comment
Like Comment
To view or add a comment, sign in
Devin Meunier
2w Edited
Report this post
Most data pipelines overwrite records. When something changes, the old version is gone. I wanted to build something that preserves history so you can actually ask: “what did this repo look like 3 months ago?” and get a reliable answer. So I built a GitHub trend tracker using Python, Postgres, and dbt. - Pulls repositories across multiple queries (data engineering, LLMs, Airflow, dbt, machine learning) How it works: Python handles ingestion (rate limiting, deduplication, controlled extraction across queries) Data lands in a Postgres staging layer first (ELT pattern, raw data is loaded before transformations) A fingerprint of key attributes detects meaningful changes without overwriting records A Slowly Changing Dimension Type 2 pattern versions every change (old record is closed, new one is opened) Set-based SQL handles the merge logic efficiently instead of row-by-row updates dbt is being layered in to structure transformations, manage dependencies, and move toward snapshot-based modeling Still evolving, but the core pipeline is working: raw API data flowing into a clean, versioned dataset. Building in iterations…more updates as it develops.
Like Comment
To view or add a comment, sign in
Priya Vishwakarma
2w
Report this post
🔄 From Pandas to PySpark — One Cheat Sheet to Rule Them All! Navigating between different data tools can be overwhelming, especially when switching between Pandas, Polars, SQL, and PySpark. This handy comparison simplifies everyday data operations like: ✔ Reading data ✔ Filtering & sorting ✔ Joins & aggregations ✔ Handling missing values ✔ Grouping & transformations 💡 Whether you're a beginner in data analytics or transitioning into big data tools, understanding these parallels helps you: Learn faster 🚀 Work smarter 💡 Adapt across technologies 🔁 In today’s data-driven world, flexibility across tools is a superpower! 📌 Save this for quick reference and level up your data skills. #DataAnalytics #DataScience #Python #Pandas #PySpark #SQL #Polars #BigData #DataEngineering #Learning #CareerGrowth #AnalyticsJourney #DataTools
Like Comment
To view or add a comment, sign in

3,084 followers

68 Posts

View Profile Follow

Migrating to Polars: 70% Memory Reduction and 45 Minute ETL

More Relevant Posts

Explore content categories