Data Quality

Data Quality

The goal of data quality is to ensure that:

Data is accurate – that is does not contain duplicates, outliers etc.

Data is complete – All of the data is captured, no missing information.

Data is ‘timely’ – the data is captured in the correct order, if order is a key attribute of the data

Ultimate goal is to enhance ‘trust’ in the data! Data quality can have an impact on the decision making and lead to ‘bad data cost’. ML modes running on this data can amplify these errors.

Improving Data Quality

Highlighting some practices that are useful to improve the quality of the data:

1.       Create a scorecard – containing information about the data sources – with the metrics measuring the attributes – Accuracy, completeness & timeliness.

2.       Prioritization – Prioritize the different sources of data to focus on the most critical ones first.

3.       Annotation – standardization of ‘quality information’ assigned to the data sources

4.       Profiling – Profiling involves

a.       Generating a data profile – missing values, out-of-bounds/data outliers, range, min/max, cardinality

b.       Data deduplication – remove duplicates

c.       Data Outliers – detect and eliminate them early on

5.       Lineage Tracking: If you can identify source datasets of high quality, you can follow the use of those high-quality data and express opinion of the quality of the results.

6.       Data completeness – Are missing values ‘OK’ to be missing? Can you add annotations to indicate that the records are ‘OK’ to be missing?

7.       Merging data – Here you need to consider the cases like ‘NULL’ and zero/missing values that might be equivalent and these need to be recorded

8.       Data conflict resolution – How do we rank datasets providing the same information? Can we establish the hierarchy on terms of the quality of the data derived from these sources?

Data Lineage

Lineage is the recording of the ‘path’ that data takes as it travels through the DLC (data life cycle).

It is possible to show the lineage in multiple ways – by using graphs or by graphs and columns.

Article content
Lineage by Graphs


Article content
Lineage by Columns



Data Protection

How to plan for data protection?

a.       Lineage & Quality – provides capability to catch erroneous or malicious use of data

b.       Level of Protection – should be justified by the cost & likelihood of a data breach

c.       Data Classification – to provision access of data assets and prevent unauthorized access.

And to provide data security

On the cloud – Cloud provides have native security policies that are more modern ways to achieve security goals like multi – tenancy etc.

Physical security – these are physical security protocols (access card etc.)  employed at data centres.

Network security generally use network perimeter techniques

Identity & Access Management – this includes authentication specifying who has access, authorization specifies the level of access and auditing who has accessed what and when.

Differential privacy – a situation achieved when a dataset containing sensitive values is shared in a form without exposing the parties involved. It is achieved by some standard techniques like k-anonymity (aggregation of k individuals) and adding ‘statistically insignificant noise.’

Data Exfiltration

Data exfiltration is a scenario in which an authorized person or application extracts the data they are allowed to access and shares it with an unauthorized 3rd party/system. Some preventions/limiting measures that can be employed to prevent data exfiltration:

·         Compartmentalization of data by line of businesses

·         Grant access in a timebound manner

·         Log the access trail

·         Deploy anomaly detection to identify suspicious behaviour



Monitoring Governance

It is similar to performance management that measures the value generated from the governance initiatives that has alerting, accounting and auditing capabilities. Auditing is required to show that the systems is performing as designed.

The other key goal of monitoring should be to meet compliance needs – i.e. it meets all laws and regulations.

What should be monitored?

Data Quality – Data should be complete and accurate. Key ingredients would be to established a baseline and develop quality signals. These can be statistically detecting anomalies.

Data Lineage Monitoring – provide the audit trail on how the data travelled and transformed

Security incidents & compliance failures.




To view or add a comment, sign in

More articles by Sourav B.

  • 2025: Canadian Banking Overview

    With the annual reports of all major Canadian Banks made publicly available, taking a high level look at the state of…

  • Generative AI: Anomaly Detection using Large Language Models

    This is a series that tries to unlock the power of LLMs to solve traditional data analytics problems. The goal is to…

  • Bayesian A/B Testing: Simulating Customer CTR

    This project simulates and compares several algorithms for solving the explore-exploit dilemma, a classic reinforcement…

  • Radical Uncertainty - What do we know?

    With humans minds competing with data intensive prediction infrastructure, laying out the current view of uncertainty…

  • Data Governance over the ‘Data Life Cycle’

    The Data Life Cycle The data life cycle can be divided in to 6 key parts, capturing the creation of data to its…

  • Data Governance: What is it?

    A data management function to ensure the quality, integrity, security and usability of the data collected by an…

  • The Cold Start Problem

    By Andrew Chen Building a ‘Networked’ company from scratch is a difficult problem. Andrew Chen breaks it down into 5…

  • Notes From The Future of Money

    By Eshwar Prasad One can clearly define the aspects of money today into 3 key facets — Fintech Innovations, Crypto…

Others also viewed

Explore content categories