The surprisingly difficult challenge of evaluating if data is garbage or gold
Data quality feels like a topic that should be easy. Good and bad, right? It’s that simple. A binary choice that connects to core of human decision making and neatly ties to the ones and zeros we all love. The data in MY system is good and the data in SOMEONE ELSES system is bad.
But if pressed, can you really describe data quality? The Six Primary Dimensions for Data Quality Assessment by the Data Management Association (DAMA) does an excellent job of categorizing the attributes around data quality. Here’s how they break down the problem with very brief definitions:
- Complete: Absence of blank (NULL) values.
- Unique: Things are measured only once.
- Timely: Represents reality from the required point in time.
- Valid: Conforms to syntax and data types
- Accurate: Correctly describes the real world
- Consistent: Absence of difference on comparison
These are excellent. They bring clarity and organization. Some systems even calculate these elements as a part of summary statistics like the pg_stats table in PostgreSQL.
But there are business questions left unanswered. Have you heard these before?
- “I can’t get at that data”
- “It’s great, but I have to do a lot of work with it before I can do anything”
- “Yeah, but have you met the people who own that data? It might be ok now…”
Try adding three categories to the DAMA structure:
- Accessible: Available for access
- Usable: Understandable, simple, relevant, and in the way you need it
- Confident: Reputation of the data and how it is managed
With this full set of data characteristics, you may find yourself looking at deep existential questions you have never faced. If a data warehouse is inaccessible, is it really warehouse? If people are only using a dashboard to export data to Excel, is it really a dashboard? Do you really trust the people, process, and technology providing that data?