Solving Data Quality issues using Machine Learning

Background                                                                                

The data quality issues has been prevailing in every organization for years. Back in time every system was manual, the resources required to solve any DQ issue was going through each transaction or document and then correct the errors by manual reconciliation. With the advent of new technologies, we started using excel, databases queries, ETLs to solve the data quality issues.

The problem

Major driver for data quality issue are:

1.     Huge volume of data

2.     Multiple source systems providing redundant data

3.     Data aggregation consistency across the organization

4.     Structure and semantic

Currently many organizations have implemented systems and process to solve the data quality issues using analytics, static rule based and/or using manual review process.

Any changes to the source system cause the data quality programs/processes (DQP) to be updated accordingly. Such frequent updates to DQP are big overheads for any organization.

The Solution

A consolidated, seamless, scalable and easily integrable solution enabling detection and improvement of data quality in near real time with embedded hybrid (rule-based + machine learning) approach to help in resolving Data Quality problems.

This hybrid Machine learning based solution should have following:

1. Static Rule based validation: rows, columns, missing, duplicates, nulls, nan’s, meta data management.

2 Machine Learning based validation: detect anomalies (numerical and categorical) in data and meta-data, feedback loop to improve monitoring, training, test and validation.

Solving Data Quality erros

 Use case

Optimizing Marketing Campaigns with better data quality

The company is a financial service provider in wealth management. The company’s marketing campaign relies on correct information of their customers. The information required for the campaign are contact details, buying and selling habits, web tracking data.

Problem

The customer data is coming from various product systems, with different semantics, formats etc. The redundancy / errors in customer data can impact the outcome of campaign.

Solution

In order to achieve highest quality of customer data, it is important to cleanse the incoming data from various sources and resolve the errors, following are likely steps:

1.      Define / configure the input or independent parameters for which DQ checks are to be done

2.      Identify each input parameter if it is numeric or categorical

3.      Transform the categorical data

4.      Select the correct algorithm (for example: K-means, Gaussian/Elliptic, Markov Chain, Isolation Forest, SVM, RNN) based on the type of input data

5.      Generate the anomaly from the input data

6.      Review and provide feedback on the anomalies

7.      Re-run the model and check if the anomaly is generated correctly (else continue step#6) till the anomalies generated are accurate

8.      Filter all anomalies and proceed with the accurate data (which is generally send to the upstream systems)

Conclusion

Using algorithms, it becomes much easier to identify the data quality issues, even if the incoming input files meta-data changes it is still easier to configure the algorithm, train and process it. The auto-correction of errors improves by upto 30% depending on the volume and no of source systems.

To view or add a comment, sign in

More articles by Parveez (Parvez) Shaikh

Others also viewed

Explore content categories