Data Quality on AWS
AWS provide cloud Data Quality solution so that data can be monitored, measured against KPI and remediated before processing the data using AWS Glue pipelines or reporting or making it usable for analytics solution.
AWS provides two out of the box solutions as of now (AWS Glue Data Quality is still available as beta version on selected region) - AWS Glue DataBrew and a utility called Deequ. Deequ is a package with few inbuilt technical DQ checks for completeness, uniqueness etc. However, customization is limited and full-fledged DQ solution, especially with complex Business DQ rule implementation becomes difficult using Deequ package. In this article we focus more on DataBrew as the AWS DQ solution.
AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code, to reduce the time it takes to prepare data for analytics and machine learning (ML) by up to 80% compared to today’s conventional, code-based data preparation. There are over 250 pre-built transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values, all without the need to write code.
Let's quickly look at the AWS Glue DataBrew components and the usage of those:
A sample architecture on how these components can be used to build a complete DQ solution is shown below. Here we consider our source as Redshift DB or Amazon S3 buckets. DataBrew can connect with other databases also using a JDBC connection which is very easy to setup in the UI. These are basically AWS Glue connection.
Recommended by LinkedIn
There are 2 parts in this architecture.
The upper part of the diagram focuses on DQ reporting where data is read from Redshift/S3, rules created, executed and failed dataset is captured along with rule output statistics (total rows, failed count). The DQ rule output statistics can be further fed into any Data Governance tool for reporting the scores (here we have used Informatica Axon). Failed dataset can be sent to either Redshift tables or S3 buckets and can be used for drilldown reporting of statistics.
The bottom part of the architecture is useful when data anomalies need to be remediated and fed into AWS Glue pipeline. this process can be combined using AWS Step Functions workflow (Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions | AWS Big Data Blog (amazon.com))
Deployment: AWS Databrew templates can be created for CloudFormation stack, and the stack can be deployed on other environments. All the Databrew components (e.g. Project, Recipe, Dataset, Ruleset etc.) are available as CloudFormation template. The details of the CF templates for DataBrew can be found here - AWS Glue DataBrew resource type reference - AWS CloudFormation (amazon.com).
Great post Arindam
Thanks Arindam for summarising it