Enterprise Data Model in Redshift, actionable lineage information (v1)
superseded by the revised version:
We have fully automated end-to-end lineage at column level from 90+ different (transactional) source systems through the integrated minimarts in our modular enterprise data warehouse to the consumption of the data in self-service, machine learning and in traditional BI-reports.
Providing integrated data for data insights is a perpetual challenge. Often data scientists spend 80% of their time hunting for data and only 20% in actual data insights. We changed that. We’ve added automated lineage for trust and transparency irrespective of source system and technology.
Trust is key to the adoption of consolidated data marts.
Our Enterprise Data Model (EDM) is an modular set of data marts in AWS Redshift. Those data marts provide fully integrated data across the system, technology, process and product life cycle scope of Syngenta group. They assimilate data from more than 90 source system platforms and 190 business systems, ranging from R&D to various SAP systems, salesforce.com, finance systems and external data providers. Data scientist and visualization engineers use and combine the data without intrinsic knowledge of any source system. Quotes from data scientists “Most of the heavy data work is already done in the integrated data mart, my model and dashboard run really fast”, “I can easily get the new feature from the weather within a few mins”.
People naturally think in “source systems”, they just don’t want the hassle of it. Trust, knowing precisely where data originates, is key to the adoption of any consolidated data marts. That’s where lineage comes into play. We added lineage information to our meta board, complementing Alation as our data catalogue and giving the data scientist meta data about the marts.
Realized the tremendous value of [automated] lineage only once we’ve experienced it
The challenge about lineage is twofold. The first is getting end-to-end lineage information in complex multi-technology environments in the first place, the second is condensing it to actionable insights.
Above picture shows the in. and outgoing flow on the example of material master data. This minimart is consolidating data from eleven source systems.
This results in complex graphs for even simple processes. The challenge is condensing those detailed lineage information into actionable insights that are understandable and helpful to the data scientist and data engineer. We use a combination of Gudusoft, their query language and simplified visualizations to translate the complex lineage data into meaningful insights.
The graph above results from Gudusoft’s analysis and visualization of our transformation processes. This detailed information is thereafter condensed using Gudusoft’s API and additional visualizations at various levels of aggregation.
Recommended by LinkedIn
Lineage in only useful if condensed to actionable insights
What looks like a simple flow is in reality a rather complex multistep process, see the initial picture. This information is condensed to the essentials.
Our successful lineage processes was helped by three lucky, practical coincidences: using the power of Redshift for transforming data from various sources into the conforming marts, strict naming conventions for the ingestion folders in AWS S3 and the ability of Gudusoft to create qualified lineage graphs / lineage data in a simple process.
In addition to incoming lineage there’s also transparency on data usage. Whether a minimart is used directly or through a view, there’s transparency outgoing from the minimart to the view ultimately using the data up to the query statement, user and the interaction frequency. This in a simplified way to cope with the data volume and, again, translate into actionable insights and impact analysis.
Outgoing lineage is down to endpoint, user and query statement
Other instances of lineage information might exist that are not tied to specific technologies, are fully automated, condensed to actionable insights. I just have not seen them yet. The lineage information as implemented is independent of the source technology (90+ source systems), fully automated end-to-end. Its transparency and reliability answer many questions, not only those from data scientists about the origin of data, but also the reverse ones in any combination.
A quote from a data scientist about the data mart and its lineage information:
“I can easily change the pipeline on the back and scale the data for other countries and regions”.
The [almost] finest level of lineage information provides insights for individual columns with or without the clutter of all the intermediate steps. It is, obviously, for the detail oriented yet vital for acceptance.
All together the lineage information, based on Gudusoft’s automated analysis and compressed for simplicity provides a coherent, understandable and actionable lineage framework from coarse to fine and back across the many source technologies we integrated into our data marts and out to the data’s usage, whether in graphical or tabular form.
awesome, thanks for sharing Bruno!
Great to learn from you how Syngenta derives value from CompilerWorks Lineage.
Nice to see this ready Bruno. It has been your dream to simplify on how to expose the data that is trusted by the customers of it..
Nice work and very useful!