Single Source of Truth?
Administrative data are data collected for the purpose of administering a service or for internal reporting. Administrative data are notoriously messy, and often of poor quality requiring special expertise to utilize beyond the purposes they are collected for.
Researchers, including statistical agencies, have utilized administrative data for quite some time. An example of administrative data is when insurance claim records are used to measure performance or effectiveness, or when records from family physicians are used to measure population health.
Statistics Canada has used these sources since the 1920's, while Fellegi and Sunter, from the same agency, in 1969 provided a landmark probabilistic framework to harmonize (link) across them. Linkage is a critical aspect of analysis because seldom are these data sources useful in isolation.
Wikipedia has a good example where one wants to match persons across different datasets: we often expect nice properties from data, but the reality is that data is captured with error or uses different conventions that makes matching difficult.
Record linkage, which goes by the name of data or fuzzy matching in other areas, has been used extensively in the census and public health research since the 40's. In the 80's, computer scientists popularized aspects of it for database applications.
Bill Winkler, who also happens to be an active LinkedIn members, has made significant contributions in this area; for instance, the Jaro-Winkler distance is a commonly used edit distance method. Roughly, edit distances refer to the number of changes one has to make to turn one word into another (for instance: MacDonald to McDonald).
Example of linkages from (source: Wikipedia)
Here's a simple example I often face at work: suppose one is trying to harmonize across all variations of the same entity, namely Fort McMurray. The most basic of edit distance methods results in the following:
> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"))
[1] 3 11
The edit distance between "Fort McMurray" and "Ft.McMurray" is 3; while the edit distance between "Fort McMurray" and "Calgary" is 11. There are numerous methods, from edit distance to qgrams to sound algorithms. Here are a few examples using R's stringdist package:
> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"), method = "jw", p = 0.1)
[1] 0.1865035 0.5934066
> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"), method = "cosine")
[1] 0.09850213 0.56356422
> stringdist("Fort McMurray", c("Ft.McMurray", "Calgary"), method = "soundex")
[1] 1 1
In the soundex method, 1 means that both comparison result different soundex signatures.
Source: http://mirlab.org/jang/book/dcpr/dpEd.asp?title=8-3%20Edit%20Distance
The Center for Disease Control, used to maintain a record linkage tool aimed at facilitating linkage between cancer registry data with administrative data. Though, in truth, the data could be anything. It appears the tool is still available here.
Today, more so than ever before, the idea of "Single source of truth" or federated systems, where all data sources are linked, remain impractical but in the simplest of cases: the majority of these sources are administrative data where matching can be challenging, tedious and expensive. Thomas Dinsmore writes:
The vision of the enterprise data warehouse as the single source of truth was never realized, even at the peak of the hype cycle. No large organization ever successfully integrated all of its data into a single centralized repository. If the vision of a unified enterprise data warehouse was unattainable in the 1990s, it is certainly unattainable today. There is simply too much data, the data is too diverse, and it moves too fast.
As is the case with many record linkage methods, there could be no deterministic solution, so linking things may require a probabilistic approach. Paradoxically, one ends up with "multiple sources of truths" each reflecting a different context the data was collected for. This becomes quickly unmanageable, especially, in the absence of a clear analysis objective.
One side effect is that we quickly loose faith in the data, deeming it inaccurate. It seldom is: in my experience, the linkage and transformations may not take into account the research objective or perhaps attempted to create a universal-fit-all dataset. In either cases, the resulting data is unlikely to meet the intended objectives.
When data is used for reasons beyond what they were collected for, my suggestion to organizations is to harmonize, link and clean data during the data analysis stage and only for the purpose of the analysis itself: only when the value of the analysis objective has been confirmed, should we think of larger harmonization effort.
Many times, we end up doing the precise opposite on the faulty belief that we can separate data from context and objectives.
References
Convenience data: https://en.wikipedia.org/wiki/Convenience_sampling?wprov=sfla1
Fellegi-Sunter and Record Linkage: https://en.wikipedia.org/wiki/Record_linkage?wprov=sfla1
Single Source of truth: https://en.m.wikipedia.org/wiki/Single_source_of_truth?wprov=sfla1
Disruptive technologies: Disruptive Analytics Charting Your Strategy for Next-Generation Business Analytics
Dinsmore, Thomas W.: http://a.co/2farUUL
Bill Winkler: MATCHING AND RECORD LINKAGE
I especially like : "The vision of the enterprise data warehouse as the single source of truth was never realized, even at the peak of the hype cycle. No large organization ever successfully integrated all of its data into a single centralized repository." When I say it, they think I am a crazy old woman...