Single Source of Truth?
Michael Bentley. Source: https://www.infoworld.com/article/3146638/application-testing/google-goes-fuzzy-to-combat-insecure-software.html

Single Source of Truth?

Administrative data are data collected for the purpose of administering a service or for internal reporting. Administrative data are notoriously messy, and often of poor quality requiring special expertise to utilize beyond the purposes they are collected for.

Researchers, including statistical agencies, have utilized administrative data for quite some time. An example of administrative data is when insurance claim records are used to measure performance or effectiveness, or when records from family physicians are used to measure population health.

Statistics Canada has used these sources since the 1920's, while Fellegi and Sunter, from the same agency, in 1969 provided a landmark probabilistic framework to harmonize (link) across them. Linkage is a critical aspect of analysis because seldom are these data sources useful in isolation.

Wikipedia has a good example where one wants to match persons across different datasets: we often expect nice properties from data, but the reality is that data is captured with error or uses different conventions that makes matching difficult.

Record linkage, which goes by the name of data or fuzzy matching in other areas, has been used extensively in the census and public health research since the 40's. In the 80's, computer scientists popularized aspects of it for database applications.

Bill Winkler, who also happens to be an active LinkedIn members, has made significant contributions in this area; for instance, the Jaro-Winkler distance is a commonly used edit distance method. Roughly, edit distances refer to the number of changes one has to make to turn one word into another (for instance: MacDonald to McDonald).

Example of linkages from (source: Wikipedia)

Here's a simple example I often face at work: suppose one is trying to harmonize across all variations of the same entity, namely Fort McMurray. The most basic of edit distance methods results in the following:

> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"))

[1] 3 11

The edit distance between "Fort McMurray" and "Ft.McMurray" is 3; while the edit distance between "Fort McMurray" and "Calgary" is 11. There are numerous methods, from edit distance to qgrams to sound algorithms. Here are a few examples using R's stringdist package:

> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"), method = "jw", p = 0.1)

[1] 0.1865035 0.5934066

> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"), method = "cosine")

[1] 0.09850213 0.56356422

> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"), method = "soundex")

[1] 1 1

In the soundex method, 1 means that both comparison result different soundex signatures.

Source: http://mirlab.org/jang/book/dcpr/dpEd.asp?title=8-3%20Edit%20Distance

The Center for Disease Control, used to maintain a record linkage tool aimed at facilitating linkage between cancer registry data with administrative data. Though, in truth, the data could be anything. It appears the tool is still available here.

Today, more so than ever before, the idea of "Single source of truth" or federated systems, where all data sources are linked, remain impractical but in the simplest of cases: the majority of these sources are administrative data where matching can be challenging, tedious and expensive. Thomas Dinsmore writes:

The vision of the enterprise data warehouse as the single source of truth was never realized, even at the peak of the hype cycle. No large organization ever successfully integrated all of its data into a single centralized repository. If the vision of a unified enterprise data warehouse was unattainable in the 1990s, it is certainly unattainable today. There is simply too much data, the data is too diverse, and it moves too fast.

As is the case with many record linkage methods, there could be no deterministic solution, so linking things may require a probabilistic approach. Paradoxically, one ends up with "multiple sources of truths" each reflecting a different context the data was collected for. This becomes quickly unmanageable, especially, in the absence of a clear analysis objective.

One side effect is that we quickly loose faith in the data, deeming it inaccurate. It seldom is: in my experience, the linkage and transformations may not take into account the research objective or perhaps attempted to create a universal-fit-all dataset. In either cases, the resulting data is unlikely to meet the intended objectives.

When data is used for reasons beyond what they were collected for, my suggestion to organizations is to harmonize, link and clean data during the data analysis stage and only for the purpose of the analysis itself: only when the value of the analysis objective has been confirmed, should we think of larger harmonization effort.

Many times, we end up doing the precise opposite on the faulty belief that we can separate data from context and objectives.


References

Convenience data: https://en.wikipedia.org/wiki/Convenience_sampling?wprov=sfla1

Fellegi-Sunter and Record Linkage: https://en.wikipedia.org/wiki/Record_linkage?wprov=sfla1

Single Source of truth: https://en.m.wikipedia.org/wiki/Single_source_of_truth?wprov=sfla1

Disruptive technologies: Disruptive Analytics Charting Your Strategy for Next-Generation Business Analytics

Dinsmore, Thomas W.: http://a.co/2farUUL

Bill Winkler: MATCHING AND RECORD LINKAGE

https://www.census.gov/srd/papers/pdf/rr93-8.pdf

I especially like : "The vision of the enterprise data warehouse as the single source of truth was never realized, even at the peak of the hype cycle. No large organization ever successfully integrated all of its data into a single centralized repository."  When I say it, they think I am a crazy old woman...

To view or add a comment, sign in

More articles by Thomas Speidel

  • Fitting models is what we do for fun, when all the tedious work is done!

    As we continue to evolve at Suncor, I’m really excited about data literacy and technology playing such a big part of…

    4 Comments
  • Keeping Up With Data Science Innovation Part 1: Podcasts

    I get asked a lot how to keep up to date in the field of data science. Here are some of the resources I use.

    2 Comments
  • Single Point or Repeated Decisions?

    The work of a data scientist often results in one of two main outputs: single point decisions and repeated decisions…

    2 Comments
  • Rare Events & Cloud Services: a Winning Synergy?

    Most of my statistical formation happened in cancer research. Several types of cancer are considered rare diseases.

  • Substantive & Empirical Models

    In a previous post, I wrote about what models are and how they are chosen. However, I did not make justice to a broader…

    1 Comment
  • Do Your Homework Before Analyzing

    Many empirical studies confirm that in the course of building predictive and explanatory models, incorporating domain…

    1 Comment
  • Statistical Process Control

    Few months ago, I posted an article on comparing KPI's. I illustrated a methodology that compares the observed KPI to a…

    10 Comments
  • Yes, But What if You're That One?

    Growing up, I recall my aunt buying a lottery ticket nearly every week. I used to tell her to save the money, she…

    2 Comments
  • What's in a Model?

    A key concept in the world of Data Science is that of a model. A model is simply a generalization of a reality…

    3 Comments
  • Comparing KPI's

    Organizations are often interested in comparing performance metrics (KPI) such as web traffic, sales, safety…

    3 Comments

Others also viewed

Explore content categories