Data Quality
The goal of data quality is to ensure that:
Data is accurate – that is does not contain duplicates, outliers etc.
Data is complete – All of the data is captured, no missing information.
Data is ‘timely’ – the data is captured in the correct order, if order is a key attribute of the data
Ultimate goal is to enhance ‘trust’ in the data! Data quality can have an impact on the decision making and lead to ‘bad data cost’. ML modes running on this data can amplify these errors.
Improving Data Quality
Highlighting some practices that are useful to improve the quality of the data:
1. Create a scorecard – containing information about the data sources – with the metrics measuring the attributes – Accuracy, completeness & timeliness.
2. Prioritization – Prioritize the different sources of data to focus on the most critical ones first.
3. Annotation – standardization of ‘quality information’ assigned to the data sources
4. Profiling – Profiling involves
a. Generating a data profile – missing values, out-of-bounds/data outliers, range, min/max, cardinality
b. Data deduplication – remove duplicates
c. Data Outliers – detect and eliminate them early on
5. Lineage Tracking: If you can identify source datasets of high quality, you can follow the use of those high-quality data and express opinion of the quality of the results.
6. Data completeness – Are missing values ‘OK’ to be missing? Can you add annotations to indicate that the records are ‘OK’ to be missing?
7. Merging data – Here you need to consider the cases like ‘NULL’ and zero/missing values that might be equivalent and these need to be recorded
8. Data conflict resolution
Data Lineage
Lineage is the recording of the ‘path’ that data takes as it travels through the DLC (data life cycle).
It is possible to show the lineage in multiple ways – by using graphs or by graphs and columns.
Data Protection
Recommended by LinkedIn
How to plan for data protection?
a. Lineage & Quality – provides capability to catch erroneous or malicious use of data
b. Level of Protection – should be justified by the cost & likelihood of a data breach
c. Data Classification – to provision access of data assets and prevent unauthorized access.
And to provide data security
On the cloud – Cloud provides have native security policies that are more modern ways to achieve security goals like multi – tenancy etc.
Physical security – these are physical security protocols (access card etc.) employed at data centres.
Network security generally use network perimeter techniques
Identity & Access Management
Differential privacy – a situation achieved when a dataset containing sensitive values is shared in a form without exposing the parties involved. It is achieved by some standard techniques like k-anonymity (aggregation of k individuals) and adding ‘statistically insignificant noise.’
Data Exfiltration
Data exfiltration is a scenario in which an authorized person or application extracts the data they are allowed to access and shares it with an unauthorized 3rd party/system. Some preventions/limiting measures that can be employed to prevent data exfiltration:
· Compartmentalization of data by line of businesses
· Grant access in a timebound manner
· Log the access trail
· Deploy anomaly detection to identify suspicious behaviour
It is similar to performance management that measures the value generated from the governance initiatives that has alerting, accounting and auditing capabilities. Auditing is required to show that the systems is performing as designed.
The other key goal of monitoring should be to meet compliance needs – i.e. it meets all laws and regulations.
What should be monitored?
Data Quality – Data should be complete and accurate. Key ingredients would be to established a baseline and develop quality signals. These can be statistically detecting anomalies.
Data Lineage Monitoring – provide the audit trail on how the data travelled and transformed
Security incidents & compliance failures.