Your Data Lineage Fabric
For those of you who don’t know what Data Lineage is, here is a great place to start. From the main page it summarizes it as:
“Data lineage includes the data’s origins, what happens to it and where it moves over time.”
Data Lineage is central to the value your organization derives from its data. About 90% of the worlds data has been created in the last few years, and data is becoming crucial for your organization to survive and eclipse your competitors. Trying to implement a top class platform to handle this yourself is extremely difficult and expensive.
1. Why it’s Complicated
- Upstream Standards: Data Flowing in to your organisation will come in many different forms at many different times. The quality of this data will vary from being robust and industry controlled, to being defined by in-house employees. Even if the data has a specification for it, a lot of the time these are incomplete, have to be inferred from the data, or are constantly changing. On top of this many sources have their nuances which are only gleaned from experience.
- Storage of Source Data: A lot of organisations have the prime record of the data stored in Excel. If this is the case, any change to the spread sheet (which always happens) will break everything downstream. (By all means use Excel - we do, but make Excel a read only copy of the data!)
- Distance to Source within the Organisation: Data is usually imported by different departments for different usages at different times. This data is transformed, enriched, manipulated and aggregated many times before presented to management. It is then extremely hard to "explain" numbers, let alone track them "back out the door".
- Validation of Source Data: A lot of organisations will transform, enrich, manipulate and aggregate source data while importing it and will not store the original data. They will then validate their end data. Whilst this validation is required, without validating the source data, they are creating hours of manual "detective work" for their employees.
- Lack of Audit: A lot of systems will simply overwrite data that has changed upstream, leaving no audit trail in-house. Can you imagine having given a report to the regulators which you then cannot reproduce!
2. What it costs you
- Duplication of Effort: Without a visualization of the Data Space, it is easy for the same data to be imported by different teams, duplicating efforts and running the risk of these getting out of sync.
- Impaired Downstream Functionality: If your source (and internal) data is not properly validated and/or tagged for errors, all downstream calculations will either be impaired or might simply fail.
- Employee time: Employees spend a lot of time re-keying data, and managers spend a lot of time manually double checking calculations or data entry. Additionally, if a query comes in from a client or regulator to explain a number, many hours are wasted trying to trace this back to source and then explain the many transformations and steps that created this number.
- Reputation Risk: If you are stuck doing many manual processes, errors will creep in. Once this starts re-occurring, your organisation will start to get a reputation for "bad data" and this can cost existing as well as new clients.
- Growth Opportunities: If you are doing most of your data transformation in a non-automated fashion, it becomes extremely difficult, and costly, to scale up and/or respond rapidly to new growth opportunities.
- Risk of Identity Theft, Money Laundering (ML) and Financing of Terrorism (FT): If you have no automated data validation procedures, chances are your controls are not as good as they should be. Criminals are getting more and more sophisticated every day and the probability of you being an easy mark will simply increase.
- Money and License: If you are found in breach of compliance rules you could be fined and/or lose your trading license. With MiFID II, GDPR, RDR, FATCA and CRS just to name a few, can you afford not to have validation rules in place?
3. What you need
Every organisation needs a Data Lineage Fabric to mitigate the risks above. As a minimum this should contain:
- Point-in-Time: Each and every item of data entering and created in the organisation need to be stored in "Point-in-Time" (PIT) format. This will then allow you to run any query "as it was at point [T-N] in the past" even if the data has changed multiple times since then.
- Data Validation Framework: It must be easy to create new validation rules for both source and in house data at both the "Data" and "Model" level. "Data" Validation would be something like validating a country ISO code. "Model" validation would be something like "if the Client is a US Resident, then TIN needs to be populated". These validation rules need to be applied to the data, and this needs to be visible to non-technical users in real time.
- Automated Traceability: It must be easily for a non-technical user to trace data "back out the door" to source.
- Proactive Data: It must enable the organisation to react proactively rather then reactively.
- Full audit Trail: It must be easy to provide to clients, regulators and auditors a full audit trail of how, where and when data changed.
- Calculation Framework: It must be easy to visualize what calculation rules were applied on source data to create derived data. NOTE: This derived data now can become source data.
- Data Tagging: It must be easy to visually see if aggregate data contains impaired data components. For example, viewing Firm-Wide PnL would show it in RED. Drilling down by currency (for example) would then show GREEN for all except JPY, which is RED. Drilling down by Asset Class will show all GREEN except Bonds which are tagged with "PRICE_ROLLED". The time from spotting a data issue (which is instant now!) to finding the cause is then hugely reduced. NOTE: This would also have been caught at the BOND level in the Data Validation stage, but it required indicating to the end user that the aggregates based on this data need to take this into account.
- Risk and Quality Scores: Each row can be assessed and scored according to validation or risk based rules and these can be monitored, reported or aggregated into other risk metric. This enables Risk Based AML/CFT as well is triggering actions when data quality drops below certain levels.
Conclusion
We believe that a good Data Lineage Fabric is crucial to any organisation who is serious about growth, minimizing costs and freeing up employees to do what they should be doing. We are passionate about Data and have spent many years in this space developing our Data Lineage Fabric called DataView.
Next Up: Data Lineage Deltas
Mark, thanks for sharing!