Let's test the Data pipeline

Sanjay Dubey

Published Oct 11, 2020

Just came across a cool python package to test the Data called Great Expectations. We can use it to any test/validations for data like ETL, DB schema tests, can be used in Data pipelines like Airflow etc.

Data Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Pipeline tests are like unit tests for datasets: they help you guard against upstream data changes and monitor data quality.

Why do we need pipeline tests?

Test-driven development has become a standard approach in the field of software development. Writing unit tests is part of good style and helps keep code quality high and avoid errors.

In the meantime, however, complexity is often no longer only manifested in the code itself, but in the data as-well. Unexpected behaviour occurring during running a machine-learning model may actually reflect true anomalous behavior, or the data may be erroneous. Often, the cause is an uncommunicated change in the data model of the source system.

Machine-learning models themselves now consist of a pipeline with several transformation steps like loading of data (from a raw-data source or a data warehouse / another DBMS), Cleaning and preparation, Aggregations, Features, Dimensionality reduction, Normalization or scaling of data

Great Expectations helps teams save time and promote analytic integrity by offering a unique approach to automated testing: pipeline tests.

Python has established itself as a development language in the area of machine learning. The Great Expectations tool is a Python package, installable via pip or conda.

Key features:

GE supports multiple data sources like Pandas/s3, Pandas/filesystem, Spark dataframe, Databricks data source and SQL (Big query, Redshift, MYSQL, Postgres, BigQuery, MSSQL, Snowflake, etc)
We can define expectations in a configurable way without changing code.
Automated data profiling.
We can configure metadata stores of GE into the SQL database or filesystem.

pip install great-expectations

conda install conda-forge::great-expectations

After installation of the Python package, each new project is started by:

great-expectations init

It will create a subdirectory called great expectations. It basically creates a skeleton of dir for GE which holds metadata stores. Which contains expectations, validations results, and configuration files.

Initialize a Data Context

Configuring Data source

great_expectations datasource new

configuring data source for flat file .csv

Configuring data source for Relational database: You can configure any relational DB suported by sqlalchemy like MySQL, Postgres, redshift, Snowflake, BigQuery etc

conda install sqlalchemy

pip install sqlalchemy

Auto data profiling : This will create a default expectation Suite with validation according do data automatically

This will open a jupyter notebook just run all the cells in the same sequence - in cell 2 you can uncomment the columns which you want to profile/test

Auto Data profiling report : Below report will be created and open in default browser based on profiling

Edit the default validation suite - Run below command.

> great_expectations suite edit taxi_suite

You can refine the validations, you can remove the unwanted validations and add validation, you want to add - This will open a jupyter notebook, you can just follow the instructions and run the cells to edit your validation suite and run it against the same data. This will also open the validation report for the same. In great_expectations it called data doc.

Validation report - Data Doc

Explained exercise is shared @ https://github.com/sanjaydub/GreatExpectations_DataPipelineTest

Integrations:

Learning Note :

Great Expectations is a fast growing tool allowing comprehensive use to ensure data quality for the operation of a machine-learning model. The tool has been developed with special importance attached to providing the most generic possible framework and offering users many interfaces which allow them to adapt Great Expectations to their own project and extend it according to their own needs. Available, for example, is an action operator which can automatically generate slack notification after validation

there is a great need to take data quality testing into the world of modern data sources and development approaches with the help of a modern tool. It is therefore worth getting to know Great Expectations and integrating it into your own data pipelines

To view or add a comment, sign in

Let's test the Data pipeline

Sanjay Dubey

Why do we need pipeline tests?

Key features:

More articles by Sanjay Dubey

Explore content categories

Why do we need pipeline tests?

Key features:

More articles by Sanjay Dubey

How RAG Unlocks Dynamic, Context-Aware AI

No Code Test Automation using Playwright MCP

Basic Semantic Search with Embedded Models

Test Automation with Javascript based tools, libs & frameworks

Karate Framework for API tests

Jenkins shared lib for pipelines

Explore content categories