Solving Data Quality issues using Machine Learning

Parveez (Parvez) Shaikh

Published Jul 31, 2021

Background

The data quality issues has been prevailing in every organization for years. Back in time every system was manual, the resources required to solve any DQ issue was going through each transaction or document and then correct the errors by manual reconciliation. With the advent of new technologies, we started using excel, databases queries, ETLs to solve the data quality issues.

The problem

Major driver for data quality issue are:

1. Huge volume of data

2. Multiple source systems providing redundant data

3. Data aggregation consistency across the organization

4. Structure and semantic

Currently many organizations have implemented systems and process to solve the data quality issues using analytics, static rule based and/or using manual review process.

Any changes to the source system cause the data quality programs/processes (DQP) to be updated accordingly. Such frequent updates to DQP are big overheads for any organization.

The Solution

A consolidated, seamless, scalable and easily integrable solution enabling detection and improvement of data quality in near real time with embedded hybrid (rule-based + machine learning) approach to help in resolving Data Quality problems.

This hybrid Machine learning based solution should have following:

1. Static Rule based validation: rows, columns, missing, duplicates, nulls, nan’s, meta data management.

2 Machine Learning based validation: detect anomalies (numerical and categorical) in data and meta-data, feedback loop to improve monitoring, training, test and validation.

Use case

Recommended by LinkedIn

Engineering Intelligence: Building Models That Matter

Kish Ranai 1 year ago

8 Steps to Executing a Machine Learning Solution

Anatoli Olkhovets 8 years ago

The Hidden Challenges of Data Sourcing for Machine…

Objectways 1 year ago

Optimizing Marketing Campaigns with better data quality

The company is a financial service provider in wealth management. The company’s marketing campaign relies on correct information of their customers. The information required for the campaign are contact details, buying and selling habits, web tracking data.

Problem

The customer data is coming from various product systems, with different semantics, formats etc. The redundancy / errors in customer data can impact the outcome of campaign.

Solution

In order to achieve highest quality of customer data, it is important to cleanse the incoming data from various sources and resolve the errors, following are likely steps:

1. Define / configure the input or independent parameters for which DQ checks are to be done

2. Identify each input parameter if it is numeric or categorical

3. Transform the categorical data

4. Select the correct algorithm (for example: K-means, Gaussian/Elliptic, Markov Chain, Isolation Forest, SVM, RNN) based on the type of input data

5. Generate the anomaly from the input data

6. Review and provide feedback on the anomalies

7. Re-run the model and check if the anomaly is generated correctly (else continue step#6) till the anomalies generated are accurate

8. Filter all anomalies and proceed with the accurate data (which is generally send to the upstream systems)

Conclusion

Using algorithms, it becomes much easier to identify the data quality issues, even if the incoming input files meta-data changes it is still easier to configure the algorithm, train and process it. The auto-correction of errors improves by upto 30% depending on the volume and no of source systems.

Ashley Kellerman 1y

Great share Parveez!

Solving Data Quality issues using Machine Learning

Parveez (Parvez) Shaikh

Recommended by LinkedIn

More articles by Parveez (Parvez) Shaikh

Others also viewed

When should you start a machine learning project?

Will AI Replace Data Analysts?

Machine Learning Monitoring, Part 4. How To Track Data Quality and Data Integrity

The Evolution of Prescriptive Analytics: From Insights to Action

Unleashing Efficiency Using Machine Learning for Data Automation

Data Science in Action: Cleaning Up the Mess – The Art of Data Collection and Cleaning

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?

DO for DSX : Why?

Data Science and Machine Learning for Business Innovation

Missing Data? No Problem.

Explore content categories

Recommended by LinkedIn

More articles by Parveez (Parvez) Shaikh

Using Machine Learning to reduce False Positive in Anti-Money Laundering (AML)

Anti-Money Laundering (AML) Automation Opportunities

Anti-Money Laundering: Challenges in Implementing Robust Programs for Effective Governance

AML Services from L&T Infotech.

Anti-Money Laundering: Challenges in Implementing Robust Programs for Effective Governance

Others also viewed

When should you start a machine learning project?

Will AI Replace Data Analysts?

Machine Learning Monitoring, Part 4. How To Track Data Quality and Data Integrity

The Evolution of Prescriptive Analytics: From Insights to Action

Unleashing Efficiency Using Machine Learning for Data Automation

Data Science in Action: Cleaning Up the Mess – The Art of Data Collection and Cleaning

Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?

DO for DSX : Why?

Data Science and Machine Learning for Business Innovation

Missing Data? No Problem.

Similar topics

Machine Learning in Marketing Analytics

How to Address Data Quality Issues for AI Implementation

Ensuring Data Quality For Scalable AI

Explore content categories