Data Quality

Sourav B.

Published Jul 21, 2024

+ Follow

The goal of data quality is to ensure that:

Data is accurate – that is does not contain duplicates, outliers etc.

Data is complete – All of the data is captured, no missing information.

Data is ‘timely’ – the data is captured in the correct order, if order is a key attribute of the data

Ultimate goal is to enhance ‘trust’ in the data! Data quality can have an impact on the decision making and lead to ‘bad data cost’. ML modes running on this data can amplify these errors.

Improving Data Quality

Highlighting some practices that are useful to improve the quality of the data:

1. Create a scorecard – containing information about the data sources – with the metrics measuring the attributes – Accuracy, completeness & timeliness.

2. Prioritization – Prioritize the different sources of data to focus on the most critical ones first.

3. Annotation – standardization of ‘quality information’ assigned to the data sources

4. Profiling – Profiling involves

a. Generating a data profile – missing values, out-of-bounds/data outliers, range, min/max, cardinality

b. Data deduplication – remove duplicates

c. Data Outliers – detect and eliminate them early on

5. Lineage Tracking: If you can identify source datasets of high quality, you can follow the use of those high-quality data and express opinion of the quality of the results.

6. Data completeness – Are missing values ‘OK’ to be missing? Can you add annotations to indicate that the records are ‘OK’ to be missing?

7. Merging data – Here you need to consider the cases like ‘NULL’ and zero/missing values that might be equivalent and these need to be recorded

8. Data conflict resolution – How do we rank datasets providing the same information? Can we establish the hierarchy on terms of the quality of the data derived from these sources?

Data Lineage

Lineage is the recording of the ‘path’ that data takes as it travels through the DLC (data life cycle).

It is possible to show the lineage in multiple ways – by using graphs or by graphs and columns.

Data Protection

Recommended by LinkedIn

Data Security Strategy - Data Masking

Gaurav Bhedi 4 years ago

Day 9: Security and Data Governance

Muhammed Swalih T, PMP®, CSM®, CSPO® 1 year ago

Why you should keep data observability separate from…

Eric Gerstner 3 years ago

How to plan for data protection?

a. Lineage & Quality – provides capability to catch erroneous or malicious use of data

b. Level of Protection – should be justified by the cost & likelihood of a data breach

c. Data Classification – to provision access of data assets and prevent unauthorized access.

And to provide data security

On the cloud – Cloud provides have native security policies that are more modern ways to achieve security goals like multi – tenancy etc.

Physical security – these are physical security protocols (access card etc.) employed at data centres.

Network security generally use network perimeter techniques

Identity & Access Management – this includes authentication specifying who has access, authorization specifies the level of access and auditing who has accessed what and when.

Differential privacy – a situation achieved when a dataset containing sensitive values is shared in a form without exposing the parties involved. It is achieved by some standard techniques like k-anonymity (aggregation of k individuals) and adding ‘statistically insignificant noise.’

Data Exfiltration

Data exfiltration is a scenario in which an authorized person or application extracts the data they are allowed to access and shares it with an unauthorized 3rd party/system. Some preventions/limiting measures that can be employed to prevent data exfiltration:

· Compartmentalization of data by line of businesses

· Grant access in a timebound manner

· Log the access trail

· Deploy anomaly detection to identify suspicious behaviour

Monitoring Governance

It is similar to performance management that measures the value generated from the governance initiatives that has alerting, accounting and auditing capabilities. Auditing is required to show that the systems is performing as designed.

The other key goal of monitoring should be to meet compliance needs – i.e. it meets all laws and regulations.

What should be monitored?

Data Quality – Data should be complete and accurate. Key ingredients would be to established a baseline and develop quality signals. These can be statistically detecting anomalies.

Data Lineage Monitoring – provide the audit trail on how the data travelled and transformed

Security incidents & compliance failures.

To view or add a comment, sign in

Data Quality

Sourav B.

Improving Data Quality

Data Lineage

Data Protection

Recommended by LinkedIn

How to plan for data protection?

And to provide data security

Data Exfiltration

Monitoring Governance

What should be monitored?

More articles by Sourav B.

Others also viewed

How do you manage data errors?

How Data Integrity Can Be Maintained?

Governance in the age of data explosion

Demystifying Data Governance and Management

Data Classifications in Small Business: Navigating Information Sensibly

Will Data Governance be the only real way to keep Cyber-security effective and efficient?

DATA INTEGRITY...

Enhancing Data Governance and Compliance in the Era of Big Data: Insights from an Experienced Professional

Data Liabilities

Importance of Implementing RLS in Power BI Report for your Organization

Importance of Data Lineage in Analytics

How to Ensure High-Quality Data for AI Projects

Best Practices for Data Quality in Generative AI

How to Ensure Data Quality in Complex Data Pipelines

Ensuring Data Quality For Scalable AI

Explore content categories

Improving Data Quality

Data Lineage

Data Protection

Recommended by LinkedIn

How to plan for data protection?

And to provide data security

Data Exfiltration

Monitoring Governance

What should be monitored?

More articles by Sourav B.

2025: Canadian Banking Overview

Generative AI: Anomaly Detection using Large Language Models

Bayesian A/B Testing: Simulating Customer CTR

Radical Uncertainty - What do we know?

Data Governance over the ‘Data Life Cycle’

Data Governance: What is it?

The Cold Start Problem

Notes From The Future of Money

Others also viewed

How do you manage data errors?

How Data Integrity Can Be Maintained?

Governance in the age of data explosion

Demystifying Data Governance and Management

Data Classifications in Small Business: Navigating Information Sensibly

Will Data Governance be the only real way to keep Cyber-security effective and efficient?

DATA INTEGRITY...

Enhancing Data Governance and Compliance in the Era of Big Data: Insights from an Experienced Professional

Data Liabilities

Importance of Implementing RLS in Power BI Report for your Organization

Similar topics

Importance of Data Lineage in Analytics

How to Ensure High-Quality Data for AI Projects

Best Practices for Data Quality in Generative AI

How to Ensure Data Quality in Complex Data Pipelines

Ensuring Data Quality For Scalable AI

Explore content categories