Enterprise Data Model in Redshift, actionable lineage information (v1)

Bruno Freitag

Published Jun 29, 2021

We have fully automated end-to-end lineage at column level from 90+ different (transactional) source systems through the integrated minimarts in our modular enterprise data warehouse to the consumption of the data in self-service, machine learning and in traditional BI-reports.

Providing integrated data for data insights is a perpetual challenge. Often data scientists spend 80% of their time hunting for data and only 20% in actual data insights. We changed that. We’ve added automated lineage for trust and transparency irrespective of source system and technology.

Trust is key to the adoption of consolidated data marts.

Our Enterprise Data Model (EDM) is an modular set of data marts in AWS Redshift. Those data marts provide fully integrated data across the system, technology, process and product life cycle scope of Syngenta group. They assimilate data from more than 90 source system platforms and 190 business systems, ranging from R&D to various SAP systems, salesforce.com, finance systems and external data providers. Data scientist and visualization engineers use and combine the data without intrinsic knowledge of any source system. Quotes from data scientists “Most of the heavy data work is already done in the integrated data mart, my model and dashboard run really fast”, “I can easily get the new feature from the weather within a few mins”.

People naturally think in “source systems”, they just don’t want the hassle of it. Trust, knowing precisely where data originates, is key to the adoption of any consolidated data marts. That’s where lineage comes into play. We added lineage information to our meta board, complementing Alation as our data catalogue and giving the data scientist meta data about the marts.

Realized the tremendous value of [automated] lineage only once we’ve experienced it

The challenge about lineage is twofold. The first is getting end-to-end lineage information in complex multi-technology environments in the first place, the second is condensing it to actionable insights.

Above picture shows the in. and outgoing flow on the example of material master data. This minimart is consolidating data from eleven source systems.

This results in complex graphs for even simple processes. The challenge is condensing those detailed lineage information into actionable insights that are understandable and helpful to the data scientist and data engineer. We use a combination of Gudusoft, their query language and simplified visualizations to translate the complex lineage data into meaningful insights.

The graph above results from Gudusoft’s analysis and visualization of our transformation processes. This detailed information is thereafter condensed using Gudusoft’s API and additional visualizations at various levels of aggregation.

Recommended by LinkedIn

2022 Data Analytics Predictions

Thomas Handorf 4 years ago

Ingesting Parquet files into Microsoft Fabric Data…

Jovan Popovic 4 months ago

Fast track Data Strategy for Leaders

Ram Kuchana 6 years ago

Lineage in only useful if condensed to actionable insights

What looks like a simple flow is in reality a rather complex multistep process, see the initial picture. This information is condensed to the essentials.

Our successful lineage processes was helped by three lucky, practical coincidences: using the power of Redshift for transforming data from various sources into the conforming marts, strict naming conventions for the ingestion folders in AWS S3 and the ability of Gudusoft to create qualified lineage graphs / lineage data in a simple process.

In addition to incoming lineage there’s also transparency on data usage. Whether a minimart is used directly or through a view, there’s transparency outgoing from the minimart to the view ultimately using the data up to the query statement, user and the interaction frequency. This in a simplified way to cope with the data volume and, again, translate into actionable insights and impact analysis.

Outgoing lineage is down to endpoint, user and query statement

Other instances of lineage information might exist that are not tied to specific technologies, are fully automated, condensed to actionable insights. I just have not seen them yet. The lineage information as implemented is independent of the source technology (90+ source systems), fully automated end-to-end. Its transparency and reliability answer many questions, not only those from data scientists about the origin of data, but also the reverse ones in any combination.

A quote from a data scientist about the data mart and its lineage information:

“I can easily change the pipeline on the back and scale the data for other countries and regions”.

The [almost] finest level of lineage information provides insights for individual columns with or without the clutter of all the intermediate steps. It is, obviously, for the detail oriented yet vital for acceptance.

All together the lineage information, based on Gudusoft’s automated analysis and compressed for simplicity provides a coherent, understandable and actionable lineage framework from coarse to fine and back across the many source technologies we integrated into our data marts and out to the data’s usage, whether in graphical or tabular form.

Ang Soon Huat (Ash) 4y

awesome, thanks for sharing Bruno!

Gerald Wluka 4y

Great to learn from you how Syngenta derives value from CompilerWorks Lineage.

Swamycharan Avunooru 4y

Nice to see this ready Bruno. It has been your dream to simplify on how to expose the data that is trusted by the customers of it..

Ronald Baan 4y

Nice work and very useful!

See more comments

To view or add a comment, sign in

Enterprise Data Model in Redshift, actionable lineage information (v1)

Bruno Freitag

Recommended by LinkedIn

More articles by Bruno Freitag

Others also viewed

Data Warehouse Redesign: From “No Trust” to Near-Real-Time

Talking to your data- From Data Management to Visualization

Data Lake vs. Data Warehouse: Which Term Should You Use for Analytical Conversations?

Top 16 Big Data Analytics Companies in 2026 Driving Data-First Growth

Data Mesh vs. Data Lakehouse vs. Data Warehouse: Which to Choose?

Organizing the analytical data

The Data Lakehouse: The Future of Data is Unified

Architecture for Data Exploration using Microsoft Azure

Characteristics, Whats, and Whys of the Modern Data Platform

Explore content categories

Recommended by LinkedIn

More articles by Bruno Freitag

Good Data – Good AI

Harmonized data without data harmonization?

Data Mesh Design Or – keep the chicken and the egg for integrated data.