Data Quality on AWS

AWS provide cloud Data Quality solution so that data can be monitored, measured against KPI and remediated before processing the data using AWS Glue pipelines or reporting or making it usable for analytics solution.

AWS provides two out of the box solutions as of now (AWS Glue Data Quality is still available as beta version on selected region) - AWS Glue DataBrew and a utility called Deequ. Deequ is a package with few inbuilt technical DQ checks for completeness, uniqueness etc. However, customization is limited and full-fledged DQ solution, especially with complex Business DQ rule implementation becomes difficult using Deequ package. In this article we focus more on DataBrew as the AWS DQ solution.

AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code, to reduce the time it takes to prepare data for analytics and machine learning (ML) by up to 80% compared to today’s conventional, code-based data preparation. There are over 250 pre-built transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values, all without the need to write code.

Let's quickly look at the AWS Glue DataBrew components and the usage of those:

  • Project: The interactive data preparation workspace in DataBrew is called a project. Using a data project, you manage a collection of related items: data, transformations, and scheduled processes. As part of creating a project, you choose or create a dataset to work on. Next, you create a recipe, which is a set of instructions or steps that you want DataBrew to act on. These actions transform your raw data into a form that is ready to be consumed by your data pipeline.
  • Dataset: Dataset simply means a set of data—rows or records that are divided into columns or fields. When you create a DataBrew project, you connect to or upload data that you want to transform or prepare. DataBrew can work with data from any source, imported from formatted files, and it connects directly to a growing list of data stores. For DataBrew, a dataset is a read-only connection to data. DataBrew collects a set of descriptive metadata to refer to the data. No actual data can be altered or stored by DataBrew.
  • DQ Ruleset and Rule: DQ rules are validation check for the dataset. Ruleset is a group of rules, you can write any number of rules in a Ruleset and mark it as active/inactive. Each Ruleset should be associated with a Dataset. The output of the Ruleset is stored in S3 as JSON file and contains Rule statistics e.g., Sample size, Failed count etc.
  • Recipe: recipe is a set of instructions or steps for data that you want DataBrew to act on. A recipe can contain many steps, and each step can contain many actions. You use the transformation tools on the toolbar to set up all the changes that you want to make to your data. Later, when you're ready to see the finished product of your recipe, you assign this job to DataBrew and schedule it. DataBrew stores the instructions about the data transformation, but it doesn't store any of your actual data. You can download and reuse recipes in other projects. You can also publish multiple versions of a recipe. This is where we will get failed dataset if we apply the DQ rule as a recipe.
  • Jobs: DataBrew takes on the job of transforming your data by running the instructions that you set up when you made a recipe. The process of running these instructions is called a job. A job can put your data recipes into action according to a preset schedule. But you aren't confined to a schedule. You can also run jobs on demand. There are 2 types of Jobs that you can create:

  1. To profile some data, just set up a Profile job to create a data profile. This will take into account the DQ rules also along with other statistical profiling of the dataset
  2. To transform the data or to get a filtered output, a Recipe job needs to be created. 

A sample architecture on how these components can be used to build a complete DQ solution is shown below. Here we consider our source as Redshift DB or Amazon S3 buckets. DataBrew can connect with other databases also using a JDBC connection which is very easy to setup in the UI. These are basically AWS Glue connection.

There are 2 parts in this architecture.

The upper part of the diagram focuses on DQ reporting where data is read from Redshift/S3, rules created, executed and failed dataset is captured along with rule output statistics (total rows, failed count). The DQ rule output statistics can be further fed into any Data Governance tool for reporting the scores (here we have used Informatica Axon). Failed dataset can be sent to either Redshift tables or S3 buckets and can be used for drilldown reporting of statistics.

The bottom part of the architecture is useful when data anomalies need to be remediated and fed into AWS Glue pipeline. this process can be combined using AWS Step Functions workflow (Prepare, transform, and orchestrate your data using AWS Glue DataBrew, AWS Glue ETL, and AWS Step Functions | AWS Big Data Blog (amazon.com))

No alt text provided for this image

Deployment: AWS Databrew templates can be created for CloudFormation stack, and the stack can be deployed on other environments. All the Databrew components (e.g. Project, Recipe, Dataset, Ruleset etc.) are available as CloudFormation template. The details of the CF templates for DataBrew can be found here - AWS Glue DataBrew resource type reference - AWS CloudFormation (amazon.com).

#aws #dataquality #dq #dqoncloud #kpmgindia #databrew #awsglue


To view or add a comment, sign in

Others also viewed

Explore content categories