Data Quality on AWS

Arindam Ghosh

Published Jan 2, 2023

AWS provide cloud Data Quality solution so that data can be monitored, measured against KPI and remediated before processing the data using AWS Glue pipelines or reporting or making it usable for analytics solution.

AWS provides two out of the box solutions as of now (AWS Glue Data Quality is still available as beta version on selected region) - AWS Glue DataBrew and a utility called Deequ. Deequ is a package with few inbuilt technical DQ checks for completeness, uniqueness etc. However, customization is limited and full-fledged DQ solution, especially with complex Business DQ rule implementation becomes difficult using Deequ package. In this article we focus more on DataBrew as the AWS DQ solution.

AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code, to reduce the time it takes to prepare data for analytics and machine learning (ML) by up to 80% compared to today’s conventional, code-based data preparation. There are over 250 pre-built transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values, all without the need to write code.

Let's quickly look at the AWS Glue DataBrew components and the usage of those:

Project: The interactive data preparation workspace in DataBrew is called a project. Using a data project, you manage a collection of related items: data, transformations, and scheduled processes. As part of creating a project, you choose or create a dataset to work on. Next, you create a recipe, which is a set of instructions or steps that you want DataBrew to act on. These actions transform your raw data into a form that is ready to be consumed by your data pipeline.
Dataset: Dataset simply means a set of data—rows or records that are divided into columns or fields. When you create a DataBrew project, you connect to or upload data that you want to transform or prepare. DataBrew can work with data from any source, imported from formatted files, and it connects directly to a growing list of data stores. For DataBrew, a dataset is a read-only connection to data. DataBrew collects a set of descriptive metadata to refer to the data. No actual data can be altered or stored by DataBrew.
DQ Ruleset and Rule: DQ rules are validation check for the dataset. Ruleset is a group of rules, you can write any number of rules in a Ruleset and mark it as active/inactive. Each Ruleset should be associated with a Dataset. The output of the Ruleset is stored in S3 as JSON file and contains Rule statistics e.g., Sample size, Failed count etc.
Recipe: A recipe is a set of instructions or steps for data that you want DataBrew to act on. A recipe can contain many steps, and each step can contain many actions. You use the transformation tools on the toolbar to set up all the changes that you want to make to your data. Later, when you're ready to see the finished product of your recipe, you assign this job to DataBrew and schedule it. DataBrew stores the instructions about the data transformation, but it doesn't store any of your actual data. You can download and reuse recipes in other projects. You can also publish multiple versions of a recipe. This is where we will get failed dataset if we apply the DQ rule as a recipe.
Jobs: DataBrew takes on the job of transforming your data by running the instructions that you set up when you made a recipe. The process of running these instructions is called a job. A job can put your data recipes into action according to a preset schedule. But you aren't confined to a schedule. You can also run jobs on demand. There are 2 types of Jobs that you can create:

To profile some data, just set up a Profile job to create a data profile. This will take into account the DQ rules also along with other statistical profiling of the dataset
To transform the data or to get a filtered output, a Recipe job needs to be created.

A sample architecture on how these components can be used to build a complete DQ solution is shown below. Here we consider our source as Redshift DB or Amazon S3 buckets. DataBrew can connect with other databases also using a JDBC connection which is very easy to setup in the UI. These are basically AWS Glue connection.

Recommended by LinkedIn

Azure Data Factory: Comprehensive Overview

Manoj Karkera 1 year ago

Handling Nested Data in AWS: Strategies and Examples

Hugo Tota 2 years ago

🪜 The Step-by-Step Playbook for a Successful Cloud…

Janagoudar (Jana) S 1 year ago

There are 2 parts in this architecture.

The upper part of the diagram focuses on DQ reporting where data is read from Redshift/S3, rules created, executed and failed dataset is captured along with rule output statistics (total rows, failed count). The DQ rule output statistics can be further fed into any Data Governance tool for reporting the scores (here we have used Informatica Axon). Failed dataset can be sent to either Redshift tables or S3 buckets and can be used for drilldown reporting of statistics.

The bottom part of the architecture is useful when data anomalies need to be remediated and fed into AWS Glue pipeline. this process can be combined using AWS Step Functions workflow (Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions | AWS Big Data Blog (amazon.com))

Deployment: AWS Databrew templates can be created for CloudFormation stack, and the stack can be deployed on other environments. All the Databrew components (e.g. Project, Recipe, Dataset, Ruleset etc.) are available as CloudFormation template. The details of the CF templates for DataBrew can be found here - AWS Glue DataBrew resource type reference - AWS CloudFormation (amazon.com).

#aws #dataquality #dq #dqoncloud #kpmgindia #databrew #awsglue

Sumit Mukherjee 3y

Great post Arindam

1 Reaction

Somnath De 3y

Thanks Arindam for summarising it

1 Reaction

See more comments

To view or add a comment, sign in

Data Quality on AWS

Arindam Ghosh

Recommended by LinkedIn

Others also viewed

Maximizing performance with AWS Glue partitions for parallelized reads and writes

Extracting Data from MS Fabric Warehouse to AWS S3 via Fabric Pipeline – A Practical Data Engineering Example

Unlocking Scalable Data Storage on Google Cloud: A Data Engineer’s Guide

Overview about Azure Data Factory

Movie Magic with AWS: Create Your Own Recommendation System

Azure Data Factory : A Comprehensive Guide

How We Build Data Pipelines on AWS

Incremental Data Loading in Azure Data Factory: From Simple Watermarking to Advanced Patterns

Cloud Migration for Data Analytics: Navigating the Unseen Challenges

Success in Shifting Analytics to the Cloud

AWS Data Transformation for Cloud-Based Solutions

Data Cleaning and Preparation

AWS Bedrock for Improving AI Content Accuracy

How to Ensure High-Quality Data for AI Projects

Explore content categories

Recommended by LinkedIn

Others also viewed

Maximizing performance with AWS Glue partitions for parallelized reads and writes

Extracting Data from MS Fabric Warehouse to AWS S3 via Fabric Pipeline – A Practical Data Engineering Example

Unlocking Scalable Data Storage on Google Cloud: A Data Engineer’s Guide

Overview about Azure Data Factory

Movie Magic with AWS: Create Your Own Recommendation System

Azure Data Factory : A Comprehensive Guide

How We Build Data Pipelines on AWS

Incremental Data Loading in Azure Data Factory: From Simple Watermarking to Advanced Patterns

Cloud Migration for Data Analytics: Navigating the Unseen Challenges

Success in Shifting Analytics to the Cloud

Similar topics

AWS Data Transformation for Cloud-Based Solutions

Data Cleaning and Preparation

AWS Bedrock for Improving AI Content Accuracy

How to Ensure High-Quality Data for AI Projects

Explore content categories