From the course: Data Quality: Measure, Improve, and Enforce Reliable Systems
Unlock this course with a free trial
Join today to access over 25,500 courses taught by industry experts.
Designing for data quality - Python Tutorial
From the course: Data Quality: Measure, Improve, and Enforce Reliable Systems
Designing for data quality
- [Instructor] If you have ever tried to fix data quality issues after a pipeline is in production, you know it's like fixing a leaky roof in a thunderstorm. It's messy, it's reactive, and it often comes too late. We have talked about how to catch data issues. Now let's shift to how to prevent them, by designing pipelines that support quality from the very beginning. Designing for data quality means treating your data quality as a design goal, not just a set of validation steps. So let's walk through practical design principles that help prevent problems before they start. First, built in checkpoints, not just checks. Design your pipeline to include clear stages where data can be validated and routed, right after ingestion, before major transforms, before merge, and before the final loads. Think of this as putting data quality gates into your design, not patching problems later. Next, define and enforce data contracts. A data contract is a shared agreement between data producers and…