The Importance of Benchmarking in Data Engineering
As a data engineering, you're concerned with the performance of jobs. People will ask you how long a given job will take and you are expected to have an answer.
On a recent project, we were using magic commands to load data directly from a BigQuery table to Pandas DataFrame (the data scientists were only comfortable using the Pandas API). However, the latter took over 1 hour for small amount of data. To get around the issue, due to permission issues and time constraints, we first downloaded the BigQuery table as Parquet files. Then, we moved the data from a GCS bucket to a Kubernetes PVC. From there, we read the data into a Pandas DataFrame.
We don't want to just benchmark scheduled ETL jobs, but any kind of data transfer. By benchmarking the time each step took, we were able to further optimize them. However, we don't want to just keep track of how long something takes, we want to retain other information like:
Suppose we were trying to load a 12 GB parquet file on disk with Pandas. Would a 50 GB node be sufficient? After all, a 12 GB Parquet file on disk contains a lot more data than a 12 GB CSV file.
Recommended by LinkedIn
When we attempt to read the latter into a Pandas DataFrame, it will blow up significantly and could result in a OOM (Out Of Memory) exception. Not only is the data in a Parquet file compressed, but if Pandas doesn't know the type of a field, it will allocate a lot more memory. It's for this reason that it's important to ensure we specify the data types. A string with distinct values should be stored as a category type.
If we are encountering OOM exceptions, we can try reading a subset of the columns, increasing the RAM available to the server, reading a subset of the rows, etc... It's important to keep track of all the different approaches we've tried.
How do you approach benchmarking? Leave a comment below.