Demystifying Data
We live in an age of data. Data is in nearly everything we do. Google alone processes 8.5 billion searches per day, each of which generates data. Data has revolutionized modern life and the global economy. Take economics, for example—my own field of study. Gone are the days of economists in tweed suits weaving theories based on abstractions. Contemporary economists are required to be statisticians and test each theory empirically. The goal of this article is to provide a framework for approaching data and to recommend initial steps when working with data. I have spent countless hours processing, building, and analyzing data. My hope is these simple steps provide guideposts when you begin your own data journey.
Step 0: Before diving into the data, whatever that may look like, you must first establish your objective. Although a fully thought-out plan is not necessary, you should not approach your data aimlessly. There's no easier way to get lost in the sea of data than forgetting your "North Star."
Step 1: The biggest problem with data is volume—there is simply so much of it. Very rarely is raw data intuitive. This is why you should always take stock of what data you have. If you are a student, you might be working with data from the BLS or FRED. If you are a business owner, you may have sales data, payroll data, financial data, etc. Taking an inventory of your data is not trivial—it is quite easy to lose track of the data you have. Would you be amazed to hear that some of the largest companies in the world fail this step? Fortune 500 companies can get lost in their data just as easily as you or I. Thus, the first step to working with data is to evaluate what you have.
Step 2: After inventorying your data, it is time to delve in. If you are dealing with multiple data files, pick one to start with. Data is an exploration, not a race—those other files will still be there waiting for you. Depending on the size of your data, it can be helpful first to analyze a sample to orient yourself (e.g., the first 1,000 rows). At this stage, I recommend opening your sample in Excel. While I would not suggest building your data in Excel, it is more than adequate for getting your bearings. One of Excel’s many strengths is that it allows you easily to filter your data (press CTRL+Shift+L).
Recommended by LinkedIn
Step 3: Now that your data is in Excel, it is time to review its fields. I have found that the most methodical and organized approach is best. Scan column-by-column, field-by-field, and jot down what information might be contained in each. It can be helpful to open a separate Excel sheet with a list of the fields in your data as a separate space for note-taking; this encourages a structured approach to your review. The goal of this initial pass is not to master your data (if only it were that easy!), but to familiarize yourself with it and its contents. For example, do you have numeric or string type fields? Do you have both? Do you have something that looks like a date but is formatted as a string? The options are endless. Remember, the idea is not to know everything about every variable; rather, this is a simple aid to determine which fields might be of value (think back to Step 0 and what you are hoping to use the data for).
Step 4: Now, consider how the fields are related to one another. For example, do you have fields that are encoded versions of longer string variables? Additionally, make sure to note any mathematical relationships between fields. For example, does price times quantity equal total sales? Excel is especially useful when inspecting these mathematical relationships. If common sense dictates that two fields should be related, like price and sales, but the math proves uncooperative, take note. This could indicate you are missing an element, or it could be a shortcoming in the data. Either way, such findings will influence your next steps.
Now that you have a sense for your data and its contents, you are ready to diagnose and validate your data. It is in this stage that you dig into the minutia of the data and truly begin to grapple with it. However, this is a subject for another (and much lengthier) discussion. Suffice to say, this is best approached in software optimized for data building (e.g., R, SAS, etc.). For now, I hope these steps allow you to start data building without fear. Data can be messy, but the insights it can reveal are well-worth the investment.
Great work Riley!!
I will use this guide everytime I analyze data. Well done.
Such a helpful guide!
Fascinating! Proud of your endeavors!
Insightful and informative ! Keep up the great work 👍