Zach Wilson’s Post

View profile for Zach Wilson
Zach Wilson Zach Wilson is an Influencer

Senior data engineers often own master data sets. These are highly trusted data sets used by many people in the company. Master data is handled and changed differently than regular data or raw data. Here’s how master data is different in #dataengineering: Changes to master data impact many more decisions. These changes need to be communicated effectively. If they aren’t, people will find discrepancies and lose trust in your data. This causes slower decisions and impacts revenue and/or costs. Master data needs the highest quality possible. What does this look like? - the pipeline has comprehensive data quality checks that stop data flow when they’re violated - the pipeline uses write-audit-publish pattern so bad data leaking into production is minimized - the pipeline has dedicated oncall rotations to troubleshoot problems when they arise - the pipeline is unit tested and integration tested to prevent problems before they even enter production - the data is efficiently modeled and answers 80% of the questions people have - the data tables are documented so people understand definitions and KPIs and understand the gaps of the data - the data has a guarantee on how stale it can be (this is called an SLA)

In the "old days" we used our AS400 change control system to manage and deploy centrally managed sets of master data like that. Treating it the same as a program (or even better) seems entirely appropriate.

Like
Reply

both Data Freshness and Data Availability are important Zach Wilson 👍

Zach Wilson Thanks for sharing, any recommendations for books/online resources for MDM implementation?

thank you for sharing ideas with us; we love it and appreciate a lot :)

Like
Reply

Your posts are always so informative and detailed, thank you !

Love this post. Specially the comment on how much of an impact on trust bad quality master data has. What I find it difficult thought is keeping the documentation and stakeholders fully aligned when the dataset (or code generating it) is being changed due to internal requests (I know it should be rare in stable/old running datasets, but it still happens, and it usually raises many escalations and alignment meetings).

See more comments

To view or add a comment, sign in

Explore content categories