Avoid using apply() on big dataframes

Tom Reid

Published Nov 27, 2022

I've been using Pandas for a while now, but recently I learned something new which is, if you're dealing with large data sets stored as dataframes, you should try and avoid the use of the apply() method whenever possible.

As a reminder, the Pandas apply() API takes a function as a parameter and applies the function logic to each row or column in a data frame. The result of applying the function on the rows or columns can then be assigned to a new column. On the face of it, a really useful built-in method ... but there is a catch. Why? Because it can be really sloooooooow.

In my particular use case, I was processing large climate-related data sets that were stored in a pandas data frame. I then needed to add an additional column to the data frame that was based on a somewhat complicated manipulation of two other existing columns.

The main issue I had though was that my input data set had almost 25 million records for one day's worth of data and I needed to do this for ten years worth of data. When I used the apply method it took around 7 minutes to perform the calculation for just one day! For a ten-year period, it meant I needed to perform this calculation around 10*12*30 = 3,600 times, and that wasn't going to fly. So what did I do? Simple, I just performed the operations I needed to do directly on the data frame columns rather than via a function call - this is sometimes known as vectorisation.

The following, simpler, example shows what I mean. First. let's get a decent-sized sample of input test data set to work with. Ten million records ought to be enough. It has two columns of random numeric data between the values of 5 and 10. We'll populate a Pandas data frame with this data and create a third column in the data frame that is the product of the other two.

import numpy as np
import pandas as pd
import datetime

# Our input dataframe - 10 million records, 2 columns of random numeric data

arr_random = np.random.default_rng().uniform(low=5,high=10,size=[10000000,2])

df=pd.DataFrame(arr_random, columns=["x","y"])

df

	x			y

0	6.093981	5.023192
1	6.982621	5.229063
2	8.987897	5.881186

...	...	...

9999997	8.356264	6.955952
9999998	8.007880	9.766712
9999999	8.006118	6.511490

10000000 rows × 2 columns

First, the apply() method which takes just over 82 seconds to complete.


# We just want to create a third column that is the product
# of the other two

# Using apply - slow method of creating a new column based                            
# on the values of the existing columns

print(datetime.datetime.now())


df['prod'] = df.apply(lambda df: df['x']*df['y'], axis=1)

print(datetime.datetime.now())

df


2022-11-22 20:52:52.133315
2022-11-22 20:54:14.838901

     x        y        prod

0    6.093981 5.023192 30.611233
1    6.982621 5.229063 36.512570
2    8.987897 5.881186 52.859491

...      ...    ...    ...

9999997 8.356264 6.955952 58.125775
9999998 8.007880 9.766712 78.210654
9999999 8.006118 6.511490 52.131762


10000000 rows x 3 columns

Now use the vectorisation method which takes well under 1 second.


# Use vectorisation - runs in < 1 second

print(datetime.datetime.now())

df['prod'] =df['x'] * df['y']

print(datetime.datetime.now())

df


2022-11-22 20:55:24.501812
2022-11-22 20:55:24.534479

     x        y        prod

0    6.093981 5.023192 30.611233
1    6.982621 5.229063 36.512570
2    8.987897 5.881186 52.859491

...      ...    ...    ...

9999997 8.356264 6.955952 58.125775
9999998 8.007880 9.766712 78.210654
9999999 8.006118 6.511490 52.131762


10000000 rows × 3 columns

So does that mean I just replace my apply() methods everywhere with vectorisation code? In general, I would say yes if you can, but it all depends on your own use case and in particular, the data volumes you're having to deal with and how complicated your coding logic is.

For me it was a no-brainer; processing thousands of data frames, each containing around 25 million records had to be done quickly and changing to vectorisation reduced my data frame processing time from 7 minutes per day to a handful of seconds.

Ok, that's all I have for now - a short but hopefully illuminating article. If you found it useful please like, comment and share to help spread the knowledge around.

To view or add a comment, sign in

Avoid using apply() on big dataframes

Tom Reid

More articles by Tom Reid

Explore content categories

More articles by Tom Reid

Speeding up Pandas code using NumPy

Boost Your Python Code with CUDA

Free ebook

OpenAI GPT-4o vision capabilities are good

Altman talks GPT-5

Develop scalable AWS Python jobs with Glue, Ray & the SDK for Pandas

Using Partitioning and Bucketing to improve query performance and reduce costs in AWS Athena

A first look at ChatGPT

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

Running geospatial queries on AWS Athena

Explore content categories