Let's test the Data pipeline
Just came across a cool python package to test the Data called Great Expectations. We can use it to any test/validations for data like ETL, DB schema tests, can be used in Data pipelines like Airflow etc.
Data Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Pipeline tests are like unit tests for datasets: they help you guard against upstream data changes and monitor data quality.
Why do we need pipeline tests?
Test-driven development has become a standard approach in the field of software development. Writing unit tests is part of good style and helps keep code quality high and avoid errors.
In the meantime, however, complexity is often no longer only manifested in the code itself, but in the data as-well. Unexpected behaviour occurring during running a machine-learning model may actually reflect true anomalous behavior, or the data may be erroneous. Often, the cause is an uncommunicated change in the data model of the source system.
Machine-learning models themselves now consist of a pipeline with several transformation steps like loading of data (from a raw-data source or a data warehouse / another DBMS), Cleaning and preparation, Aggregations, Features, Dimensionality reduction, Normalization or scaling of data
Great Expectations helps teams save time and promote analytic integrity by offering a unique approach to automated testing: pipeline tests.
Python has established itself as a development language in the area of machine learning. The Great Expectations tool is a Python package, installable via pip or conda.
Key features:
- GE supports multiple data sources like Pandas/s3, Pandas/filesystem, Spark dataframe, Databricks data source and SQL (Big query, Redshift, MYSQL, Postgres, BigQuery, MSSQL, Snowflake, etc)
- We can define expectations in a configurable way without changing code.
- Automated data profiling.
- We can configure metadata stores of GE into the SQL database or filesystem.
pip install great-expectations
conda install conda-forge::great-expectations
After installation of the Python package, each new project is started by:
great-expectations init
It will create a subdirectory called great expectations. It basically creates a skeleton of dir for GE which holds metadata stores. Which contains expectations, validations results, and configuration files.
Initialize a Data Context
Configuring Data source
great_expectations datasource new
configuring data source for flat file .csv
Configuring data source for Relational database: You can configure any relational DB suported by sqlalchemy like MySQL, Postgres, redshift, Snowflake, BigQuery etc
conda install sqlalchemy pip install sqlalchemy
Auto data profiling : This will create a default expectation Suite with validation according do data automatically
This will open a jupyter notebook just run all the cells in the same sequence - in cell 2 you can uncomment the columns which you want to profile/test
Auto Data profiling report : Below report will be created and open in default browser based on profiling
Edit the default validation suite - Run below command.
> great_expectations suite edit taxi_suite
You can refine the validations, you can remove the unwanted validations and add validation, you want to add - This will open a jupyter notebook, you can just follow the instructions and run the cells to edit your validation suite and run it against the same data. This will also open the validation report for the same. In great_expectations it called data doc.
Validation report - Data Doc
Explained exercise is shared @ https://github.com/sanjaydub/GreatExpectations_DataPipelineTest
Integrations:
Learning Note :
Great Expectations is a fast growing tool allowing comprehensive use to ensure data quality for the operation of a machine-learning model. The tool has been developed with special importance attached to providing the most generic possible framework and offering users many interfaces which allow them to adapt Great Expectations to their own project and extend it according to their own needs. Available, for example, is an action operator which can automatically generate slack notification after validation
there is a great need to take data quality testing into the world of modern data sources and development approaches with the help of a modern tool. It is therefore worth getting to know Great Expectations and integrating it into your own data pipelines