The Need for Data Lineage
I was recently working with a client to provide reporting from their historical membership and donations data, which are 2 main sources of their organization’s revenue. The goal was to uncover the demographics of their new members and doners in order to focus the organization’s efforts and marketing budget in 2023. For anyone working in the data and analytics space, this is a common scenario we are frequently presented with. I got the project started by defining the mission, scoping it, determining goals and deliverables, identifying stakeholders, and identifying data sources and their human data owners/stewards. One common issue and concern kept coming up: None of the data owners/stewards in the organization knew where their data was coming from (originated) and how it had been previously processed (transformed). Immediately I knew I had a data lineage problem.
Where Does the Data Come From?
Everything within our modern lives produces data. Sometimes data is generated by activities and sometimes it is generated by events happening within an environment. Think about this for a minute…
The volume of data generated within organizations is extremely large and it comes from everywhere and everything: CRM (sales), POS (transactions), HRMS (employees, benefits), marketing, inventory management, environment, smart phone apps, IoT devices, and more. Each system collects data on different topics with differing levels of granularity and frequency.
Data can be collected through extraction or through streaming. Transaction-based data is typically extracted by querying a database or logs. Using your cardkey is an example. Stream-based data is typically retrieved by processes listening for events to occur within a given context. Smart home thermometers which turn on/off heating and cooling based on continually sampling ambient temperature is an example.
Once collected, data is usually transformed based on some rules the collector has defined. Some common transformations include removing errors and outliers, summing by date, aggregating by a common attribute, converting into a common unit of measure (currency, Celsius to Fahrenheit), or equivalent. From there it could be stored in a database, a file (comma delimited, Excel), a data lake (a collection of all types of stored data), or a data warehouse (providing better organization and normalization). Procedures within an organization then publish this data for others to consume and use.
What Use Is This Data?
It is important to understand the overall goal of collecting all this data. Data is being collected for the organization’s on-going operations and decision making.
Visibility
All data is important to someone of something. Providing visibility exposes the data to the organization so that it can be used to make fact-based data-driven decisions for future work. Surfacing all data, a concept in data democracy, empowers everyone in your organization to use all available data and thus reducing guess-work. Note that sometimes sensitive data is not available, but the knowledge that it is there and has been collected can also be important to the organization.
Viability
Viability focuses on the usefulness of the data as it applies to a purpose. If the purpose is to determine how to get the most bang-for-the-buck out of the marketing budget, then understanding the response rate for previously used marketing channels is important.
Note: Big Data also uses a term called the “5 V’s.” The 5 V’s are Velocity, Volume, Value, Variety, and Veracity. Don’t you think visibility and viability should also be added to the “5 V’s?”
The Problem I Faced
The good
My client had systems collecting data. They had processes which were extracting data. They had repositories of data in an Azure Data Lake. The data lake was made available to the organization through least-privilege access. There were some data governance procedures in place, although the procedures were limited.
The bad
None of the data owners/stewards could determine anything about the lifecycle of the data. Therefore, we had a lot of uncertainty about the data accuracy, quality, and its viability. I could not answer the following:
What is Data Lineage
Data lineage is the concept that every aspect of the data’s lifecycle is clearly documented and followed:
Why do we need Data Lineage?
Data lineage provides the clear visibility to track where your data comes from in the case of:
How does this fit with data governance?
Data Governance is a set of standards, policies, and practices that an organization takes to extract, transform, and load (ETL) data. Data governance is something internal to an organization and has nothing to do with government or industry policy. A data governance policy should have rules in place for data lineage.
What You Should Do
My client did not have a sophisticated data governance policy and the policy did not include any data lineage practices. I was able to recreate data by manually tracing back to source transaction systems and re-extracting data, but the source and justification for edits and other changes have been lost.
Here are some conversation topics and ideas to start discussing in your organization to ensure you do not run into a data lineage problem like I faced:
1. Ensure the organization has updated data governance policy and procedures
2. Ensure data lineage is addressed in the policy
3. Ensure you are capturing and providing data with a UTC timestamp
4. Ensure you are capturing the source system(s) originating currency and conversion spot rates. Capturing the source of the spot rate is important too.
5. Put your physical data under source control, such as a Git repository
6. Lock-down edit access to data with a least privilege model
7. Investigate dedicated data lineage tools (there are many)
Good data is happy data. Do not be afraid to raise the issue now and start working on a progressive implementation plan.