The Importance of Benchmarking in Data Engineering

Cory Maklin

Published May 8, 2023

As a data engineering, you're concerned with the performance of jobs. People will ask you how long a given job will take and you are expected to have an answer.

On a recent project, we were using magic commands to load data directly from a BigQuery table to Pandas DataFrame (the data scientists were only comfortable using the Pandas API). However, the latter took over 1 hour for small amount of data. To get around the issue, due to permission issues and time constraints, we first downloaded the BigQuery table as Parquet files. Then, we moved the data from a GCS bucket to a Kubernetes PVC. From there, we read the data into a Pandas DataFrame.

We don't want to just benchmark scheduled ETL jobs, but any kind of data transfer. By benchmarking the time each step took, we were able to further optimize them. However, we don't want to just keep track of how long something takes, we want to retain other information like:

File format
File size on disk
Number of rows in the dataset
Number of columns in the dataset
Number of cores on the node
Amount of RAM on the node

Suppose we were trying to load a 12 GB parquet file on disk with Pandas. Would a 50 GB node be sufficient? After all, a 12 GB Parquet file on disk contains a lot more data than a 12 GB CSV file.

Recommended by LinkedIn

Towards Easy and Fast Data Science Workflows with…

Favio Vazquez 8 years ago

"R for Data Science" by Hadley Wickham

Yauheni Semianiuk 4 years ago

"Large" data analysis - a programmer's perspective

Chris Noden 5 years ago

When we attempt to read the latter into a Pandas DataFrame, it will blow up significantly and could result in a OOM (Out Of Memory) exception. Not only is the data in a Parquet file compressed, but if Pandas doesn't know the type of a field, it will allocate a lot more memory. It's for this reason that it's important to ensure we specify the data types. A string with distinct values should be stored as a category type.

If we are encountering OOM exceptions, we can try reading a subset of the columns, increasing the RAM available to the server, reading a subset of the rows, etc... It's important to keep track of all the different approaches we've tried.

How do you approach benchmarking? Leave a comment below.

#dataengineering #dataengineer #datascientist

The Importance of Benchmarking in Data Engineering

Cory Maklin

Recommended by LinkedIn

Big Data Walk

238 followers

More articles by Cory Maklin

Others also viewed

How to make your data telling a story?

Unraveling the World of Data Structures: Vectors to Arrays Explained!

THE NUMPY FOR DATA ANALYST- PART:-2

Had Your Treats? Time for Data Science Tricks

Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers

Data Science at Miniclip

Which Data Science Skills are core and which are hot/emerging ones?

# Creating a Collection in Milvus: A Step-by-Step Guide

Every data pipeline ever built has one thing in common: a DAG.

Explore content categories

Recommended by LinkedIn

Big Data Walk

238 followers

More articles by Cory Maklin

Ownership In A Work Context

Commodity Labor

Get More Done In Less Time

Capitalizing On Your Intellectual Capital

Approximate Join On Timestamps

Senior Data Engineer Interview: Late Arriving Data

Repository Design Pattern

Senior Data Engineer Interview: Right To Be Forgotten In Practice

Senior Data Engineer Interview: Backfilling Data

Top 5 Data Engineer Interview SQL Questions

Others also viewed

How to make your data telling a story?

Unraveling the World of Data Structures: Vectors to Arrays Explained!

THE NUMPY FOR DATA ANALYST- PART:-2

Had Your Treats? Time for Data Science Tricks

Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers

Data Science at Miniclip

Which Data Science Skills are core and which are hot/emerging ones?

# Creating a Collection in Milvus: A Step-by-Step Guide

Every data pipeline ever built has one thing in common: a DAG.

Explore content categories