The Importance of Benchmarking in Data Engineering
Photo by Markus Spiske on Unsplash

The Importance of Benchmarking in Data Engineering

As a data engineering, you're concerned with the performance of jobs. People will ask you how long a given job will take and you are expected to have an answer.

On a recent project, we were using magic commands to load data directly from a BigQuery table to Pandas DataFrame (the data scientists were only comfortable using the Pandas API). However, the latter took over 1 hour for small amount of data. To get around the issue, due to permission issues and time constraints, we first downloaded the BigQuery table as Parquet files. Then, we moved the data from a GCS bucket to a Kubernetes PVC. From there, we read the data into a Pandas DataFrame.

We don't want to just benchmark scheduled ETL jobs, but any kind of data transfer. By benchmarking the time each step took, we were able to further optimize them. However, we don't want to just keep track of how long something takes, we want to retain other information like:

  • File format
  • File size on disk
  • Number of rows in the dataset
  • Number of columns in the dataset
  • Number of cores on the node
  • Amount of RAM on the node

Suppose we were trying to load a 12 GB parquet file on disk with Pandas. Would a 50 GB node be sufficient? After all, a 12 GB Parquet file on disk contains a lot more data than a 12 GB CSV file.

When we attempt to read the latter into a Pandas DataFrame, it will blow up significantly and could result in a OOM (Out Of Memory) exception. Not only is the data in a Parquet file compressed, but if Pandas doesn't know the type of a field, it will allocate a lot more memory. It's for this reason that it's important to ensure we specify the data types. A string with distinct values should be stored as a category type.

If we are encountering OOM exceptions, we can try reading a subset of the columns, increasing the RAM available to the server, reading a subset of the rows, etc... It's important to keep track of all the different approaches we've tried.

How do you approach benchmarking? Leave a comment below.

#dataengineering #dataengineer #datascientist

To view or add a comment, sign in

More articles by Cory Maklin

  • Ownership In A Work Context

    As you start to think about moving up and taking on more responsibility, you'll need to have a solid grasp of the…

  • Commodity Labor

    I'm interested by the fact that you could be a director at one of the largest financial institutions in Canada…

  • Get More Done In Less Time

    Most people approach their investments in time like they approach their financial investments. That is, they focus on…

  • Capitalizing On Your Intellectual Capital

    If you're like the average person, you need to work a day job in order to pay the bills. However, working a day job…

  • Approximate Join On Timestamps

    I was working on a project where we needed join a reporting table containing data pertaining to cases with another…

  • Senior Data Engineer Interview: Late Arriving Data

    Interviewer: Suppose you were responsible for maintaining the data warehouse pipelines. How would you partition the…

  • Repository Design Pattern

    I had worked with frameworks like SpringBoot before, but I had never given too much thought into the reasoning behind…

  • Senior Data Engineer Interview: Right To Be Forgotten In Practice

    Interviewer: Let's say you were responsible for designing the data warehouse and associated pipelines. How would you…

  • Senior Data Engineer Interview: Backfilling Data

    Let's say you were responsible for maintaining the data warehouse pipelines. There was a bug in the code and now you…

  • Top 5 Data Engineer Interview SQL Questions

    As a data engineer, in addition to your standard LeetCode like programming problems, companies will ask you to solve…

Others also viewed

Explore content categories