From the course: Data Quality Testing with Great Expectations

From manual to automated testing

If you work with data, you might be familiar with this situation. It's Monday morning and a colleague on your company's sales team sends you a message. Hey, the numbers on the sales dashboard look a little off. Did we really have such a big spike last Friday? You start digging through tables in your data warehouse, run ad hoc queries, and try to trace recent changes to figure out what's wrong. For a long time, data quality checks have relied on this kind of reactive process. one of SQL queries or visual spot checks to see where things broke. Sure, this can catch obvious errors, but it's slow and inconsistent and it depends heavily on your knowledge of the data. What's worse, those kinds of checks usually only happen once the problem has already occurred. It's clear that this ad hoc manual approach doesn't scale. Modern data teams need data quality testing that works more like software testing, meaning it's automated, repeatable, and built directly into the data pipeline. The ultimate goal is to catch problems before they make it into our production dashboards and analytics, ideally as far upstream as possible. Automated testing doesn't just protect data quality, it can also save us time and money. Imagine you receive daily data extracts from a third party and kick off a long-running data pipeline. Hours later, you discover that the source data had errors that now caused incorrect data downstream. With automated data testing in place at any stage of the pipeline, you can validate the source data early, trigger an alert, and stop the pipeline before wasting time and compute on bad input.

Contents