The Joy of Data Retrieval
In this digital age, data is constantly being collected and often made readily available for empirical analysis. Even though it is simple to gain access to large datasets of previously collected observations, everyone involved in data science should personally collect a dataset at least once for the following reasons:
- Raw data is likely to contain errors.
- Personally retrieved data is customizable.
- Using a personally retrieved dataset is exciting.
When getting data from a third-party source, it is easy to assume that the fresh, raw data must be error-free and any issues can only enter the data while it is being analyzed. This is just categorically untrue. In fact, errors in raw datasets are so common that many different tools have been created to help identify and root out those problems (Abedjan, 2016). Gathering a dataset allows us to peer into the deeper levels of our data, see where errors are likely to appear in the raw data, and account for those errors in our analysis.
Data coming from a third party is rarely in the exact form necessary to answer the specific questions we are interested in. The manipulations and calculations required to clean the raw data add additional sources of possible errors. Retrieving our own data allows us to choose the exact form the data needs to be collected in. Additionally, knowing how we want the data retrieved helps us to think more deeply about how the data will be used to answer the research question before we get to the analysis step.
After working for weeks, months, or even years to gather a usable dataset, the joy of beginning analysis on that dataset is unrivaled. Seeing those first graphs and regressions coming from the carefully gathered dataset is akin to summiting a mountain after a long climb. Finally starting that analysis on a manually gathered dataset is a feeling that must be experienced to be truly understood.
Technology has allowed research to progress in ways and speeds previously unattainable. That technological progression is sure to continue, and more and more researchers could potentially spend their entire careers without gathering their own data. That said, the experience of getting into the weeds and collecting data for a project at least once is transformative. It changes how data is seen, how it is retrieved, and how it is analyzed.
References
Abedjan, Z. a. (2016, August). Detecting Data Errors: Where Are We and What Needs to Be Done? VLDB Endowment, 9(12), 993–1004. doi:10.14778/2994509.2994518
I enjoyed the read. Keep up the good work man!