Let's test the Data pipeline
https://greatexpectations.io/

Let's test the Data pipeline

Just came across a cool python package to test the Data called Great Expectations. We can use it to any test/validations for data like ETL, DB schema tests, can be used in Data pipelines like Airflow etc.

Data Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Pipeline tests are like unit tests for datasets: they help you guard against upstream data changes and monitor data quality.

Why do we need pipeline tests?

Test-driven development has become a standard approach in the field of software development. Writing unit tests is part of good style and helps keep code quality high and avoid errors.

In the meantime, however, complexity is often no longer only manifested in the code itself, but in the data as-well. Unexpected behaviour occurring during running a machine-learning model may actually reflect true anomalous behavior, or the data may be erroneous. Often, the cause is an uncommunicated change in the data model of the source system.

Machine-learning models themselves now consist of a pipeline with several transformation steps like loading of data (from a raw-data source or a data warehouse / another DBMS), Cleaning and preparation, Aggregations, Features, Dimensionality reduction, Normalization or scaling of data

No alt text provided for this image

Great Expectations helps teams save time and promote analytic integrity by offering a unique approach to automated testing: pipeline tests.

Python has established itself as a development language in the area of machine learning. The Great Expectations tool is a Python package, installable via pip or conda.

Key features:

  • GE supports multiple data sources like Pandas/s3, Pandas/filesystem, Spark dataframe, Databricks data source and SQL (Big query, Redshift, MYSQL, Postgres, BigQuery, MSSQL, Snowflake, etc)
  • We can define expectations in a configurable way without changing code.
  • Automated data profiling.
  • We can configure metadata stores of GE into the SQL database or filesystem.
pip install great-expectations


conda install conda-forge::great-expectations

 After installation of the Python package, each new project is started by:

great-expectations init

It will create a subdirectory called great expectations. It basically creates a skeleton of dir for GE which holds metadata stores. Which contains expectations, validations results, and configuration files.

Initialize a Data Context

No alt text provided for this image

Configuring Data source

great_expectations datasource new

configuring data source for flat file .csv

configuring data source for flat file .csv

Configuring data source for Relational database: You can configure any relational DB suported by sqlalchemy like MySQL, Postgres, redshift, Snowflake, BigQuery etc

conda install sqlalchemy

pip install sqlalchemy

No alt text provided for this image

Auto data profiling : This will create a default expectation Suite with validation according do data automatically

No alt text provided for this image

This will open a jupyter notebook just run all the cells in the same sequence - in cell 2 you can uncomment the columns which you want to profile/test

Auto Data profiling report : Below report will be created and open in default browser based on profiling

Auto data profiling report

Edit the default validation suite - Run below command.

> great_expectations suite edit taxi_suite

You can refine the validations, you can remove the unwanted validations and add validation, you want to add - This will open a jupyter notebook, you can just follow the instructions and run the cells to edit your validation suite and run it against the same data. This will also open the validation report for the same. In great_expectations it called data doc.

No alt text provided for this image

Validation report - Data Doc

No alt text provided for this image

Explained exercise is shared @ https://github.com/sanjaydub/GreatExpectations_DataPipelineTest

Integrations:

No alt text provided for this image

Learning Note :

Great Expectations is a fast growing tool allowing comprehensive use to ensure data quality for the operation of a machine-learning model. The tool has been developed with special importance attached to providing the most generic possible framework and offering users many interfaces which allow them to adapt Great Expectations to their own project and extend it according to their own needs. Available, for example, is an action operator which can automatically generate slack notification after validation

there is a great need to take data quality testing into the world of modern data sources and development approaches with the help of a modern tool. It is therefore worth getting to know Great Expectations and integrating it into your own data pipelines



To view or add a comment, sign in

More articles by Sanjay Dubey

  • How RAG Unlocks Dynamic, Context-Aware AI

    🚀 Building a Wikipedia Q&A app with Retrieval-Augmented Generation (RAG) and LangChain I’m excited to share my latest…

    3 Comments
  • No Code Test Automation using Playwright MCP

    🚀 Excited to share my first experience using Playwright MCP with Visual Studio Code! With the help of VS Code Copilot,…

    2 Comments
  • Basic Semantic Search with Embedded Models

    🚀 Excited to Share my recent hands-on with : "AI LangChain Basic Semantic Search with Embedded Models"! I’m excited to…

  • Test Automation with Javascript based tools, libs & frameworks

    Test Automation with Javascript based tools, libs & frameworks matrix by features : As JavaScript automated testing…

  • Karate Framework for API tests

    Karate is a relatively new open source framework for testing Web services. Even though Karate is written in Java, its…

    2 Comments
  • Jenkins shared lib for pipelines

    It is useful to share parts of Pipelines between various projects to reduce redundancies and keep code "DRY". Pipeline…

Explore content categories