Properly Analyzing Data
A valid analysis depends on specific characteristics of the data and specific characteristics of the analysis
A valid analysis can depend very heavily on the characteristics of the data and on the specific type of analysis. Around five years ago, I was teaching a seminar on a specific type of analysis that had many associated programs and few theoretical results. In the seminar, I was able to show (emphasize) how the results in individual programs broke down when specific equalities (also inequalities) broke down when there were certain types of errors in the data. I was able to show to the seminar participants how certain data-cleaning programs (new for the specific analysis) assured that the results in the main computational programs would then be nearly as correct as possible. Almost all errors (even small 1-5% errors) are cumulatively additive. With 3-7 errors, the overall results of an analysis can be completely wrong (without correction of a series of small data errors).
The following corresponds to a relatively simple situation that some individuals have done.
It is very, very easy to do minor distortions in any data, perform a 'statistical' analysis on both the original data and the 'distorted' data, and observe that the results are substantially changed.
It turns out that very few people can 'clean' up their data and for those that can clean up the data, the amount of work is substantial. If there is 5% error in the data, how will you know? What analyses will you be able to do? If there is 10% error in the data, are there any analyses that you can do?
During the course of developing the 1990 production system, almost all the improvements in our matching were due to better methods for name standardization, address standardization, and data structuring prior to running the algorithms for producing the 'optimal' parameters. Each time we improved a pre-processing routine or added another preprocessing routine, our matching results improved.
In Winkler, W. E. (2014a), Matching and Record Linkage, Wiley Interdisciplinary Reviews: Computational Statistics, https://wires.wiley.com/WileyCDA/WiresArticle/wisId-WICS1317.html,, DOI: 10.1002/wics.1317, available from author by request for academic purposes.
In the above paper, I describe how the false match rates are estimated in the unsupervised situation and compare the results with more accurate results for the semi-supervised situation where moderate amounts of training data are available in each region.
There is no write-up of the specific problem that somewhat relates to my comments on how analyses fail with 'bad' data. The seminar to which I refer consisted of several Ph.D.s and one Master's who were good at theory and very good at computation. We went through a series of theoretical/methodological theorems/programs and a large number of data-cleanup-structuring programs (thirty-plus with most being data preparation/structuring). The 'real-world' problem was estimating record linkage (entity-resolution) false match rates (precision) when there is no training data. When there are massive amounts of training data, the situation is known as the 'regression problem' in the books by Vapnik and by Hastie, Thibshirani, and Friedman. It is considered one of the most difficult problems in statistical learning theory. I worked on the problem off-and-on for nearly twenty years.
As background, I describe some methods in the 1990 Decennial Census production matching system where we had to do unsupervised learning in ~500 regions and match the entire U.S. in less than six weeks using seven VAX 8700s (about the same speed as a fast 80386 machine). We had five very high quality test decks from various areas of the U.S. In Winkler (1988, 1989), I introduced a method for finding the best naive Bayes approximation of a general interaction Bayes Net and showed that the 'optimal' record linkage parameters varied significantly from region to region. The naive Bayes approximation was rediscovered using much slower-to-compute general additive models by Kim Larsen (2005, SIGKDD Explorations).
In revisiting the problem over the years, I would periodically try improving the methods. Generally, the improved data-preparation (very slowly and iteratively) improved the parameter estimation.
In the seminar, I went through the components of the theory and each of the associated parameter-estimation programs. For each parameter-estimation program, I showed how certain equations broke down when the pertinent data-cleaning/preprocessing programs were not used. The development of the full set of programs was very much a trial-and-error procedure.
<<If there is 5% error in the data, how will you know? What analyses will you be able to do? If there is 10% error in the data, are there any analyses that you can do?>> and <<The development of the full set of programs was very much a trial-and-error procedure.>> These could not be more important to consider or more often ignored. Thanks for sharing your insight, Bill Winkler