The Art of Testing: Ensuring Reliability in Data Science & Machine Learning

The Art of Testing: Ensuring Reliability in Data Science & Machine Learning

Testing plays a crucial role in any software project. It is the key to ensuring the reliability, accuracy, and performance of our solutions. Data science and machine learning projects should not be different in that regard. Nevertheless, in practice I often see data science projects that struggle to set up decent test automation.

Some of possible reasons that I have observed:

  • Lack of Proper Knowledge: Data scientists, while skilled in various aspects of data analysis and modeling, may not always possess a comprehensive understanding of software testing methodologies, frameworks, or techniques.
  • Usage of Difficult to test tools: Popular tools like Jupyter Notebooks or Point-and-Click services by Azure, AWS or GCP were not inherently designed with testing in mind.
  • Difficulty in Defining Testable Units: In traditional software development, code units are often well-defined, and tests can be written to validate specific functions or modules. In data science and machine learning, the units may not be as clearly defined if the initial code was derived from Jupyter Notebooks, for example.
  • Focus on Exploratory Analysis: Data science projects often involve exploratory analysis, where the primary goal is to gain insights and uncover patterns in the data. This exploratory code is not a good fit for automated testing processes.
  • Emphasis on Validation through Metrics: Data science projects typically involve evaluating model performance using metrics such as accuracy or RMSE. The focus is often on validating the model's output against these metrics rather than implementing extensive unit or integration tests.

There already exists a wealth of knowledge and resources regarding testing principles and best practices in software development that is applicable to data projects as well. Rather than reiterating this existing knowledge, I will focus on providing examples of how we approach testing data science and machine learning solutions at BettercallPaul.

Fast, Narrow & Sociable

I am a fan of fast, narrow and sociable tests: Tests should run quickly, allowing for rapid feedback. Instead of creating broad and all-encompassing tests, narrow tests verify specific behaviors or functions. At the same time, they should be sociable. This means that when possible, they should avoid mocks to ensure realistic integration and interaction between components in the testing process.

Crucial Code: Unit Tests using PyTest

Any code that is crucial for the project and will be run regularly should be tested. We prefer normal python modules for this kind of code instead of Jupyter notebooks. The PyTest library works great for us. We normally test all relevant code: feature creation, model training, validation and serving. As we aim for fast and reliable tests, these are normally done on small, synthetic data sets.

Notebooks: Ensure they run without errors

Normally we do not "test" our notebooks. However, we check that they still execute without errors using nbmake. Otherwise, they would normally stop working after a few months because code or libraries have evolved in the meantime. To ensure fast tests, we normally have a feature switch on the top of the notebook (e.g. `ACTIVATE_FAST_TEST=True`). When activated (the default), data will be massively filtered to allow for fast execution.

Automate Testing using CI/CD pipelines

An automated CI/CD pipeline (e.g. using GitLab CI/CD or GitHub Actions) ensures that all tests run whenever someone changes the code. It is important to fix a failing pipeline as soon as possible. Otherwise, the problems will only get bigger with each commit.

Measure Test Coverage

Fixating on test coverage as a KPI can result in really bad tests. But when used properly, it can give good hints about parts of code that might still need some testing. We like to use PyTest-cov.

Fight the nemesis: Data Leakage

A simple test case can save you from this enemy. The concrete implementation depends on the individual case. For a forecasting model, for example, one can randomly modify the raw data starting at a point in time and check whether features before this point in time remain unaffected by the modification.

Separate Testing Code and Testing Data

Tools like Pandera or Great Expectations can be used inside pipelines to check your data. However, I strongly advise to separate these pipelines that might load data into feature stores or train models from pipelines that test your code. Loading and processing large amounts of data will inevitably take a lot of time and should be separate from the fast tests of your code.

Embrace proper testing, but start simple!

This article provided a glimpse into some good testing practices in data science projects. It's important to note that there are other aspects, such as test-driven development or non-functional tests for runtime/model performance, privacy, and biases, which were not covered here. However, starting with basic automation and embracing proper testing practices is a crucial step towards building reliable and robust data science projects. As the field continues to evolve, it is my hope to see more and more data science projects incorporating automated tests, ensuring the delivery of high-quality and trustworthy solutions in the future.

Ich schätze die Kreativität und das Engagement, die in deine Arbeit einfließen. 💡👏

To view or add a comment, sign in

More articles by Thomas Bierhance

Others also viewed

Explore content categories