Avoid using apply() on big dataframes

Avoid using apply() on big dataframes

I've been using Pandas for a while now, but recently I learned something new which is, if you're dealing with large data sets stored as dataframes, you should try and avoid the use of the apply() method whenever possible.

As a reminder, the Pandas apply() API takes a function as a parameter and applies the function logic to each row or column in a data frame. The result of applying the function on the rows or columns can then be assigned to a new column. On the face of it, a really useful built-in method ... but there is a catch. Why? Because it can be really sloooooooow.

In my particular use case, I was processing large climate-related data sets that were stored in a pandas data frame. I then needed to add an additional column to the data frame that was based on a somewhat complicated manipulation of two other existing columns.

The main issue I had though was that my input data set had almost 25 million records for one day's worth of data and I needed to do this for ten years worth of data. When I used the apply method it took around 7 minutes to perform the calculation for just one day! For a ten-year period, it meant I needed to perform this calculation around 10*12*30 = 3,600 times, and that wasn't going to fly. So what did I do? Simple, I just performed the operations I needed to do directly on the data frame columns rather than via a function call - this is sometimes known as vectorisation.

The following, simpler, example shows what I mean. First. let's get a decent-sized sample of input test data set to work with. Ten million records ought to be enough. It has two columns of random numeric data between the values of 5 and 10. We'll populate a Pandas data frame with this data and create a third column in the data frame that is the product of the other two.

import numpy as np
import pandas as pd
import datetime

# Our input dataframe - 10 million records, 2 columns of random numeric data

arr_random = np.random.default_rng().uniform(low=5,high=10,size=[10000000,2])

df=pd.DataFrame(arr_random, columns=["x","y"])

df

	x			y

0	6.093981	5.023192
1	6.982621	5.229063
2	8.987897	5.881186

...	...	...

9999997	8.356264	6.955952
9999998	8.007880	9.766712
9999999	8.006118	6.511490

10000000 rows × 2 columns        

First, the apply() method which takes just over 82 seconds to complete.


# We just want to create a third column that is the product
# of the other two

# Using apply - slow method of creating a new column based                            
# on the values of the existing columns

print(datetime.datetime.now())


df['prod'] = df.apply(lambda df: df['x']*df['y'], axis=1)

print(datetime.datetime.now())

df


2022-11-22 20:52:52.133315
2022-11-22 20:54:14.838901

     x        y        prod

0    6.093981 5.023192 30.611233
1    6.982621 5.229063 36.512570
2    8.987897 5.881186 52.859491

...      ...    ...    ...

9999997 8.356264 6.955952 58.125775
9999998 8.007880 9.766712 78.210654
9999999 8.006118 6.511490 52.131762


10000000 rows x 3 columns         

Now use the vectorisation method which takes well under 1 second.


# Use vectorisation - runs in < 1 second

print(datetime.datetime.now())

df['prod'] =df['x'] * df['y']

print(datetime.datetime.now())

df


2022-11-22 20:55:24.501812
2022-11-22 20:55:24.534479

     x        y        prod

0    6.093981 5.023192 30.611233
1    6.982621 5.229063 36.512570
2    8.987897 5.881186 52.859491

...      ...    ...    ...

9999997 8.356264 6.955952 58.125775
9999998 8.007880 9.766712 78.210654
9999999 8.006118 6.511490 52.131762


10000000 rows × 3 columns        


So does that mean I just replace my apply() methods everywhere with vectorisation code? In general, I would say yes if you can, but it all depends on your own use case and in particular, the data volumes you're having to deal with and how complicated your coding logic is.

For me it was a no-brainer; processing thousands of data frames, each containing around 25 million records had to be done quickly and changing to vectorisation reduced my data frame processing time from 7 minutes per day to a handful of seconds.

Ok, that's all I have for now - a short but hopefully illuminating article. If you found it useful please like, comment and share to help spread the knowledge around.

To view or add a comment, sign in

More articles by Tom Reid

Explore content categories