5 Simple Steps to Cleaning Messy Data

5 Simple Steps to Cleaning Messy Data

Have you ever pulled a report, looked at the data, and said to yourself, “Well that doesn’t look right.”? Messy data is ubiquitous, especially when there are multiple places you’re pulling from. Sometimes there’s room for human error and other times it’s simply a bug. Whatever the case may be, there are some common best practices to help prevent reporting on messy data.

As a Senior Analyst at Medical Guardian, I’ve wrangled a fair share of data and have picked up a few helpful tips for organizing data along the way:

Start with a quick scan

Whether you’re working with an .xlsx, .csv, .txt, or even a PDF, the most important first step is to simply look at the data. Rather than scanning through an entire data set, a quick trick is to look at the first 10 rows and then the last 10 rows. This gives you a top and bottom glimpse of what type of data you’re working with and how much domain knowledge is required to make decisions about data structures and organization.

Separate unknown and known data

Unless you retain complete domain knowledge of the data set, it’s likely there are a few data points that appear strange, misleading or incomplete. Place these data fields into an unknown bucket. Once you complete this exercise, connect with colleagues who can confirm the validity of these unknowns. Afterward, you’ll be left with a full data set you can understand, as well as increased domain knowledge.

Fix missing/blank/NA values

It’s normal for data sets to contain points that are blank. When faced with blank values, there are a couple of decisions you can make to smooth the data depending on both the scale and impact. With medium to large data sets, it’s fairly common to replace missing data points with the mean values. However, depending on your domain knowledge, it might be safer to completely remove the row data if there are too many missing values within the row.

Fix data types

It’s likely there will be entire columns of data originating from a source that has not exported into the required format. For example, a customer’s age might have been exported as a character or string type. In order to perform a numeric analysis, the data type must be changed to a numeric type.

Perform a brief exploratory analysis

At this point, your data will be in a workable state to begin grouping and analyzing for basic insights. It’s common practice to explore a few of the data set variables using histograms, correlation plots, and box plots. A histogram can quickly show if there is a high concentration (density) for a variable within a specified bin or range. This could quickly unveil that customers have a great affinity for a certain product, payment plan, or product color. This could also unveil something unexpected or incorrect in the data, such as the majority of customers are between 5-10 years old for a product—yet the customer must be 18 years or older to purchase. As a solution, confront the data entry point and determine how to clean the data. You may discover the age field was getting manipulated prior to exporting and being reduced by 20 years due to a bug. Once the bug has been identified and confirmed, age data can safely be restored.

Taking these simple steps when working with data can help save a lot of time and headache. It also provides your team with more confidence in reporting and making business decisions based on your analysis.

Happy data wrangling!

Jesse, I admired what you did here and thought it was nearly a textbook case of exceptional user-generated LinkedIn content. Fantastic content, appropriately pared for the format (but with a strong hint that there were many more deeper dives right at your fingertips, if so inclined), and a tone of upbeat, you-can-do-this helpfulness that made it a gift for the reader. Really well done here, and I am not remotely a data wrangler. 

To view or add a comment, sign in

Others also viewed

Explore content categories