The Need for Data Lineage

I was recently working with a client to provide reporting from their historical membership and donations data, which are 2 main sources of their organization’s revenue. The goal was to uncover the demographics of their new members and doners in order to focus the organization’s efforts and marketing budget in 2023. For anyone working in the data and analytics space, this is a common scenario we are frequently presented with. I got the project started by defining the mission, scoping it, determining goals and deliverables, identifying stakeholders, and identifying data sources and their human data owners/stewards. One common issue and concern kept coming up: None of the data owners/stewards in the organization knew where their data was coming from (originated) and how it had been previously processed (transformed). Immediately I knew I had a data lineage problem.

Where Does the Data Come From?

Everything within our modern lives produces data. Sometimes data is generated by activities and sometimes it is generated by events happening within an environment. Think about this for a minute…

Did your smart phone or watch wake you up this morning? If so, the services behind these devices have data on your wake-up habits.
Did you drive into your work location? If so, most modern vehicles keep odometer measurements, monitor engine health, and more. Google maps told you how long your commute would be. Your fast-pass was charged for use of the express lane.
Did you get coffee at a café? If so, your order was entered into a system, you paid for it via credit card or by a phone payment app, all of which have a transaction record for your coffee.
Did you use a cardkey to access your office? The swipe was checked against the security service and logged.
Did you clock-in with a time card? Yup, you guessed it, a system recorded this event.
Did you send and receive email? The volume and frequency of inbound and outbound email can easily be determined.

The volume of data generated within organizations is extremely large and it comes from everywhere and everything: CRM (sales), POS (transactions), HRMS (employees, benefits), marketing, inventory management, environment, smart phone apps, IoT devices, and more. Each system collects data on different topics with differing levels of granularity and frequency.

Data can be collected through extraction or through streaming. Transaction-based data is typically extracted by querying a database or logs. Using your cardkey is an example. Stream-based data is typically retrieved by processes listening for events to occur within a given context. Smart home thermometers which turn on/off heating and cooling based on continually sampling ambient temperature is an example.

Once collected, data is usually transformed based on some rules the collector has defined. Some common transformations include removing errors and outliers, summing by date, aggregating by a common attribute, converting into a common unit of measure (currency, Celsius to Fahrenheit), or equivalent. From there it could be stored in a database, a file (comma delimited, Excel), a data lake (a collection of all types of stored data), or a data warehouse (providing better organization and normalization). Procedures within an organization then publish this data for others to consume and use.

What Use Is This Data?

It is important to understand the overall goal of collecting all this data. Data is being collected for the organization’s on-going operations and decision making.

Visibility

All data is important to someone of something. Providing visibility exposes the data to the organization so that it can be used to make fact-based data-driven decisions for future work. Surfacing all data, a concept in data democracy, empowers everyone in your organization to use all available data and thus reducing guess-work. Note that sometimes sensitive data is not available, but the knowledge that it is there and has been collected can also be important to the organization.

Viability

Viability focuses on the usefulness of the data as it applies to a purpose. If the purpose is to determine how to get the most bang-for-the-buck out of the marketing budget, then understanding the response rate for previously used marketing channels is important.

Note: Big Data also uses a term called the “5 V’s.” The 5 V’s are Velocity, Volume, Value, Variety, and Veracity. Don’t you think visibility and viability should also be added to the “5 V’s?”

The Problem I Faced

The good

My client had systems collecting data. They had processes which were extracting data. They had repositories of data in an Azure Data Lake. The data lake was made available to the organization through least-privilege access. There were some data governance procedures in place, although the procedures were limited.

The bad

None of the data owners/stewards could determine anything about the lifecycle of the data. Therefore, we had a lot of uncertainty about the data accuracy, quality, and its viability. I could not answer the following:

The source system(s)
When data was collected, especially if collection happened before or after recent business process changes
The amount of transformation processing done to correct errors, omissions, outliers
The amount of aggregation done based on date (or other facts)
The method used for currency conversion (currency was noted, all values were converted into USD, but the conversion rate was lost)
The number and extent of “fixes” that were done to the data by the finance department. These were mainly corrections to the data post financial close.
Who had update access to the data after it was extracted

What is Data Lineage

Data lineage is the concept that every aspect of the data’s lifecycle is clearly documented and followed:

Data you are using has a clear definition of where it comes from (source systems, versions, environments, correlation with software updates and business process changes)
Where and how the data is stored (data bases, data lakes, file systems)
The extraction or streamed event date/time is captured in UTC
How it has been cleaned and transformed
Changes made to the data after extraction (who and when)

Why do we need Data Lineage?

Data lineage provides the clear visibility to track where your data comes from in the case of:

Errors (or speculation of errors)
Audits
Identifying data inclusion or exclusion
Generating more data of the same type, for AI/ML training as an example
Understanding how data was changed of edited after extraction

How does this fit with data governance?

Data Governance is a set of standards, policies, and practices that an organization takes to extract, transform, and load (ETL) data. Data governance is something internal to an organization and has nothing to do with government or industry policy. A data governance policy should have rules in place for data lineage.

What You Should Do

My client did not have a sophisticated data governance policy and the policy did not include any data lineage practices. I was able to recreate data by manually tracing back to source transaction systems and re-extracting data, but the source and justification for edits and other changes have been lost.

Here are some conversation topics and ideas to start discussing in your organization to ensure you do not run into a data lineage problem like I faced:

1. Ensure the organization has updated data governance policy and procedures

2. Ensure data lineage is addressed in the policy

3. Ensure you are capturing and providing data with a UTC timestamp

4. Ensure you are capturing the source system(s) originating currency and conversion spot rates. Capturing the source of the spot rate is important too.

5. Put your physical data under source control, such as a Git repository

6. Lock-down edit access to data with a least privilege model

7. Investigate dedicated data lineage tools (there are many)

Good data is happy data. Do not be afraid to raise the issue now and start working on a progressive implementation plan.

The Need for Data Lineage

Brad Hamilton

Where Does the Data Come From?

What Use Is This Data?

Visibility

Viability

The Problem I Faced

The good

The bad

What is Data Lineage

Why do we need Data Lineage?

How does this fit with data governance?

What You Should Do

More articles by Brad Hamilton

Explore content categories

Where Does the Data Come From?

What Use Is This Data?

Visibility

Viability

The Problem I Faced

The good

The bad

What is Data Lineage

Why do we need Data Lineage?

How does this fit with data governance?

What You Should Do

More articles by Brad Hamilton

Getting to the :root of <html>

Picking and Sticking With a Version Control System (1/2)

What’s Your +1?

The 3 E Model of Modern DevOps

Eastside Accessibility Meetup - A Call to Action

Our First Eastside Accessibility Meetup, April 16, 2019

Preparing For the Technical Interview

Building Project Vision

Greenfield Versus Brownfield Software Development Projects

Origami

Explore content categories