Using Pandas, Polars, and DuckDB for the same task

Roman Klimenko

Published Sep 14, 2024

As a regular Pandas user, I am usually satisfied with its performance. It can handle millions of records in minutes on my humble laptop, and I'm okay with waiting. However, new fancy tools are coming out, and I am curious to explore them.

I usually use IMDb datasets for testing and demos, so today, I undertook a familiar routine: loading each TSV (tab-separated values) file in a directory and saving it as Parquet. The only difference was that I did it with three different tools: Pandas, Polars, and DuckDB.

The IMDb's TSVs have their specifics:

While they are mostly like normal CSV files, the separator is a tab instead of a comma.
Numeric values (years) are mixed with '\N' values, which we can replace with nulls or change the columns' types to string (I go with the second option).
Double quotes appear in column values, so we need to ignore them explicitly.

Pandas

I'm starting with Pandas as the most known tool. A simple version of the code looks like this:

The names.basic.tsv contains 13.774.705 rows, and it takes 15-20 seconds for my four-year-old Apple MacBook M1 to run the script.

Polars

Now, let's take a look at Polars. The Polars API is intuitive after Pandas, and I had no problems using it. The code appeared a little bit longer because there was no dtype parameter in the read_csv method to override all column types at once, but this is not a problem:

This code runs for 14-15 seconds.

DuckDB

With DuckDB, you can use SQL:

Alternatively, you can use the SDK methods:

Recommended by LinkedIn

Useful Pandas Function for Nested (List of) Dicts -…

Brian Han, M.Sc., CFA, FRM, CIPM 5 years ago

Python tips: 7 most useful operations after Pandas…

Minh Nguyen, PhD 3 years ago

Creating a Machine Learning Web API with Flask

Jonathan Wood 7 years ago

In the last example, the table is not created: duckdb.read_csv returns a lazy object that is only iterated when we call .to_parquet, which makes the execution even faster: around six seconds if we create a table and four seconds if we don't.

The DuckDB's API is sometimes surprising. For example, read_csv has sep and delim parameters for the same thing, which is confusing when you scroll through the list of the method's parameters, especially because the Python SDK has no docstrings. Also, when something goes wrong, you see an output like this:

  delimiter =    (Set By User)

So you have sep, delim, but it's called delimiter in the error message.

But there's nothing to complain about seriously – the API is rich, and the tool works great.

Loading all IMDb datasets

In the next step, I prepared scripts to load each IMDb TSV file in a directory and save them to Parquet. I'm not going to overload this article with code, but you can find the scripts here: https://github.com/romaklimenko/imdb

The performance results were surprising. I expected Polars to do the job faster than Pandas, but it was significantly slower. The average running times for each script are:

Pandas: 196 seconds
Polars: 257 seconds
DuckDB: 46 seconds

DuckDB is way ahead of the others, but Pandas and Polars win and lose in different categories. For example, title.principals.tsv has 87.769.634 rows:

Pandas: load TSV - 49 seconds, write to Parquet - 32 seconds.
Polars: load TSV - 140 seconds, write to Parquet - 10 seconds.

But take title.basics.tsv, which is 11.054.773 rows:

Pandas: load TSV - 14 seconds, write to Parquet - 6 seconds.
Polars: load TSV - 6 seconds, write to Parquet - 4 seconds.

Conclusion

Benchmarking tools on a single scenario is a thankless task. Sure, there are cases when Polars are faster than Pandas: https://duckdblabs.github.io/db-benchmark/, but the difference is not so obvious in "small" datasets of several millions of records.

I hope these products find their niche, as Pandas' and Polars' APIs are similar.

DuckDB shows impressive performance and solves many problems, whereas I'd have to run a database server otherwise. Its API is not very polished yet, but I can tolerate it for its good features.

I'm looking forward to using these new tools for my daily tasks.

Pedro Holanda 1y

Indeed, DuckDB's CSV Reader delimiter names should be more consistent. I should make delimiter the main one in all the APIs and have sep and delim as aliases. 😅

Using Pandas, Polars, and DuckDB for the same task

Roman Klimenko

Pandas

Polars

DuckDB

Recommended by LinkedIn

Loading all IMDb datasets

Conclusion

More articles by Roman Klimenko

Others also viewed

How to : Analyse BigQuery with DataLab & Python(3 of 3)

Clean up a .CSV file with Regular Expressions, Pandas, and Python

How to Write More Efficient Pandas Code in Python

Beyond Pandas: How to tame your large Datasets in Python

Leveraging Python’s Built-in collections Module for Efficient Data Structures

Out-of-core Analytics using MonetDB

Pandas 101: Turning Raw Data into Powerful Insights with Python - pt 1

Python Program to fetch NVD Vuln Feeds

Python for Data processing: How fast is too fast.

Dataclasses vs. Pydantic: Choosing the Right Tool for the Job

Explore content categories

Pandas

Polars

DuckDB

Recommended by LinkedIn

Loading all IMDb datasets

Conclusion

More articles by Roman Klimenko

CluedIn Data Modeling 101

The magic of CluedIn

Using CluedIn GraphQL API to automate UI actions

Using Postman to interact with CluedIn API

Search Entities with CluedIn Python SDK

Recording sleep sounds

Others also viewed

How to : Analyse BigQuery with DataLab & Python(3 of 3)

Clean up a .CSV file with Regular Expressions, Pandas, and Python

How to Write More Efficient Pandas Code in Python

Beyond Pandas: How to tame your large Datasets in Python

Leveraging Python’s Built-in collections Module for Efficient Data Structures

Out-of-core Analytics using MonetDB

Pandas 101: Turning Raw Data into Powerful Insights with Python - pt 1

Python Program to fetch NVD Vuln Feeds

Python for Data processing: How fast is too fast.

Dataclasses vs. Pydantic: Choosing the Right Tool for the Job

Explore content categories