All data needed to be cleaned

All data need to be cleaned. This is particularly true of Big Data. One of the first times I encountered (severe) problems with data was when I was working with some econometricians in the early 1980s. One mentioned that when he was a graduate student, his professor and two other graduate students were going to do an analysis over the summer. After discovering severe errors in the data, their professor was able to get ~10 additional students who were able to work on cleaning the data for 10+ months. They were then able to build a model on the cleaned data.

Data clean-up has come a long way. There are very good books by Redman, English, Loshin and others that serve as introductions. There have been a few conferences and workshops at various CS and Stat conferences in last twenty-plus years. The three main ideas involve (1) capturing the data into computer files accurately, (2) filling in missing data in a principled manner so that joint distributions are preserved and edits (business rules are satisfied, and (3) identifying duplicate entities within and across files using (error-prone) quasi-identifiers such as names, addresses and phone numbers.

This is my sixth keynote at a workshop at a major international CS conference. https://sites.google.com/site/dinaworkshop2015/invited-speakers . I introduced the current software in two short courses at the University of London and more recently lectured on the methods when I was back in Cambridge at the Isaac Newton Institute. The web site lists some recent research reports. https://www.census.gov/srd/csrmreports/byyear.html . The report 'Cleaning and Using Administrative Lists...' https://www.census.gov/srd/papers/pdf/RRS2018-05.pdf  describes the development of the methods.

In the above paper, equations (3)-(6) describe the basic theoretical situation of Winkler (1988, 1989). For our 1990 Decennial Census, our production system needed to do the matching in ~500 regions having significantly different 'optimal' parameters in each region in 3-6 weeks. We did not have resources to find training data in each region and had to do unsupervised learning. We used seven VAX 8700s (about the same speed as a fast 8036 machine) to do the matching. Most of the improvements in the matching after equations (3)-(6) (also in Winkler 1988) were in the preprocessing and structuring of the data (about twenty steps) going into the parameter-estimation algorithms. These 'early' systematic data cleaning steps allowed us to do the matching with a 0.2% false match rate.

The difficulty affecting clean-up is that there may be a few larger errors (affecting 1-5% of records) and twenty-plus smaller errors (affecting at most 0.1% of records). If there is 10% error in data, then how will the data be cleaned up? How will someone even know how much error is in the data?

An interesting exercise for those having access to certain commercial and other software that does data profiling (that typically delineates 'data errors'). If someone has data that has been cleaned up previously with an audit trail of the corrections, that someone can run the original uncleaned data file through the data profiling software. Which errors, if any, are correctly identified? Some of the data profiling software will have a button (or buttons) that can be clicked on to 'automatically' correct the data errors? Are the automatic corrections accurate?

This is absolutely true and enormously frustrating for the researcher. I will distinguish between ensuring the accuracy of the data and cleaning based on diagnostic tests. The former is critical ("how can that number be six?") while the latter can be dangerous (we trim the data we don't understand/like because a test says we can).

There is no such thing as raw data. All data has inbuilt assumptions due to choices made when it was measured. Data is also almost never designed for the purpose we need it for. In my line of work we developed some code to partially clean time series data.

I’ve never forgotten the assistance you provided to me concerning data matching accuracy when I worked at ChannelInsight. It changed the trajectory of my life. I’ve also spent years cleansing data both manually through excel (the way early years) and through the use of software- Informatica. I’m finally working at the Census Bureau myself! I love this dataset!!

Another consideration is whether it’s even the right data to solve your problem. If not, go get exactly what you need and you needn’t settle for what is both lying around and “dirty”

To view or add a comment, sign in

More articles by Bill Winkler

Others also viewed

Explore content categories