Solving Data Quality issues using Machine Learning
Background
The data quality issues has been prevailing in every organization for years. Back in time every system was manual, the resources required to solve any DQ issue was going through each transaction or document and then correct the errors by manual reconciliation. With the advent of new technologies, we started using excel, databases queries, ETLs to solve the data quality issues.
The problem
Major driver for data quality issue are:
1. Huge volume of data
2. Multiple source systems providing redundant data
3. Data aggregation consistency across the organization
4. Structure and semantic
Currently many organizations have implemented systems and process to solve the data quality issues using analytics, static rule based and/or using manual review process.
Any changes to the source system cause the data quality programs/processes (DQP) to be updated accordingly. Such frequent updates to DQP are big overheads for any organization.
The Solution
A consolidated, seamless, scalable and easily integrable solution enabling detection and improvement of data quality in near real time with embedded hybrid (rule-based + machine learning) approach to help in resolving Data Quality problems.
This hybrid Machine learning based solution should have following:
1. Static Rule based validation: rows, columns, missing, duplicates, nulls, nan’s, meta data management.
2 Machine Learning based validation: detect anomalies (numerical and categorical) in data and meta-data, feedback loop to improve monitoring, training, test and validation.
Use case
Recommended by LinkedIn
Optimizing Marketing Campaigns with better data quality
The company is a financial service provider in wealth management. The company’s marketing campaign relies on correct information of their customers. The information required for the campaign are contact details, buying and selling habits, web tracking data.
Problem
The customer data is coming from various product systems, with different semantics, formats etc. The redundancy / errors in customer data can impact the outcome of campaign.
Solution
In order to achieve highest quality of customer data, it is important to cleanse the incoming data from various sources and resolve the errors, following are likely steps:
1. Define / configure the input or independent parameters for which DQ checks are to be done
2. Identify each input parameter if it is numeric or categorical
3. Transform the categorical data
4. Select the correct algorithm (for example: K-means, Gaussian/Elliptic, Markov Chain, Isolation Forest, SVM, RNN) based on the type of input data
5. Generate the anomaly from the input data
6. Review and provide feedback on the anomalies
7. Re-run the model and check if the anomaly is generated correctly (else continue step#6) till the anomalies generated are accurate
8. Filter all anomalies and proceed with the accurate data (which is generally send to the upstream systems)
Conclusion
Using algorithms, it becomes much easier to identify the data quality issues, even if the incoming input files meta-data changes it is still easier to configure the algorithm, train and process it. The auto-correction of errors improves by upto 30% depending on the volume and no of source systems.
Great share Parveez!
Great share, Parveez!