Does your data cleaning spark joy?

Jonathan Peate

Published Feb 27, 2019

“You only get out, what you put in.” Although Greg Norman was speaking more on life in general, this sentiment holds equally true when it comes to data modelling. If you don’t spend the time cleaning and preparing your data, it will be reflected in the results you achieve. In fact, even simple algorithms can produce impressive insights from properly cleaned data. But, the majority of Data Scientists find data cleaning to be the least enjoyable part of their work, even though they spend around 80% of their time doing it! So the question must be asked, is there a way to make data cleaning more joyful?

Marie Kondo, and her Kon-Mari method of tidying is currently sweeping the world (pun intended), with its message of “helping people tidy their spaces by choosing joy” and by “developing the simplest and most effective tools to help you get there.” Is it possible to apply the fundamentals of the Kon-Mari method to data cleaning in order to make the process less tedious, and ultimately, possibly, even joyful?

What follows is a light hearted, and hopefully informative attempt to apply the 6 central tenets, or ‘rules’ of the Kon-Mari method to cleaning data. With a little paraphrasing and imagination, I believe that the Kon-Mari method applies itself well to data cleaning and does bring joy by providing a methodical, stepwise approach. So now that you’re keen, let's start the clean!

Rule 1: Commit yourself to tidying up.

You have your data, be it from a client, Kaggle or data you have collected yourself, and you know what lies ahead of you. Instead of feeling daunted by the task, be energised! Commit yourself to cleaning your data, knowing that the time and effort that you invest now will pay off in the quality of your modelling when all is done and dusted. Import the libraries you’re going to need, grab a fresh cup of coffee, and load your data. The basic libraries you'll need are:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

Rule 2: Imagine your ideal dataset

You know what kind of dataset you want, visualise it. I find the ideals espoused in Hadley Wickham’s Tidy Data particularly helpful for this. For example, imagine all of your feature variables as columns ordered by their role in the analysis, all observations as rows ordered by variable, every type of observational unit in its own table. You don’t need to follow the guidelines of tidy data, but it’s definitely worth aspiring to.

Although raw (and even some cleaned) datasets are uniquely messy, by creating a tidy dataset, you will ease your data cleaning.

Rule 3: Finish discarding first

Once your dataset has the semblance of a tidy dataset, the next thing you want to deal with is what to discard. A good place to start after inspecting your data, is to address the null values. I find this code snippet to be helpful:

null_count = df.isnull().sum()
null_count[null_count>0]

Once you know how many values are missing from each column, you need to decide what to do with them. If there is one value missing from a large dataset, I will often just remove the row. However, if the dataset is not particularly large, then it may be a better idea to impute the missing value. If the missing values are representative of absence, then I will usually fill the missing value with ‘0’. The Pandas Profiling Report is a great tool for assessing your dataframe as a whole.

import pandas_profiling as pp

pp.ProfileReport(df)

Once you have dealt with your missing values, it is time to determine which feature variables are necessary. Although you can use regularisation to minimise the noise that these variables may create, I still find it meaningful to go through the dataset and remove unnecessary columns. Pandas Profiling comes in handy here again as it highlights columns with high collinearity, low variance and high variance. It is then up to you to decide whether you want to discard or keep the feature.

Rule 4: Tidy by category

Are your floats, floats? Are your strings, strings? Before you can do any statistical analysis on your dataset, it is critical that to ensure each feature variable has the correct data-type.

Replace, remove or modify any erroneous observations as required, then convert the feature to the correct data type, you are going to need it in the next step.

Rule 5: Follow the right order

Once you have the data types correctly assigned, the numerical and categorical feature variables can now be processed. These have to be dealt with differently depending on what you intend to do with the dataset. However it is good practice to standardise or normalise the numerical features, and to dummify the categorical variables (i.e. take each unique categorical observation and create a feature variable from it, then, assign the observation a value of 1 if true). This is especially useful if you will be doing subsequent modelling.

It is critical that this is done in the right order, otherwise there is the risk of standardising the dummied features. The order of events that has worked well for me is to split the feature variables into numerical and categorical using:

df_numeric = df.astype(float)

df_categorical = df.astype(object)

I drop the numerical columns that will not be used in subsequent steps, then utilise sklearns Standard Scaler to standardise the remaining feature columns:

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

scaled_numeric = ss.fit_transform(df_numeric)

I then take the categorical variables and dummy them using pandas, which will one hot encode the features, creating dummies when the first created variable is dropped:

dummy_categorical = pd.get_dummies(df_categorical, drop_first = True, 
                                   dtype = float)

Once our preprocessing is finished, we then recombine the numerical and categorical features into a singular dataframe, prior to splitting out into training and test data.

Rule 6: Ask yourself if it sparks joy

Last, but definitely not least is, does it spark joy? I know, it seems impossible that a clean, preprocessed dataset wouldn’t spark tears of joy from even the most battle hardened Data Scientist, but sometimes, it just ain’t enough.

When Marie Kondo speaks of sparking joy, she doesn’t mean the kind of joy that comes from winning the lotto, rather, she is referring to ‘Tokimeku’, the urge to use something. Think of it as picking up a pen and then looking for something to write on. Does seeing your data, cleaned and preprocessed, spark that same urge to use it? Do you want to do some modelling with it and see what you get back? If it does, then that is Tokimeku, and your clean data has successfully sparked joy.

So there you have it, the Kon-Mari method can be successfully applied to the principles of data cleaning! Yay! Now, hopefully, the next time you encounter a mountain of dirty data, you can find zen in the knowledge, that yes, this data cleaning will indeed spark joy.

See more comments

To view or add a comment, sign in

Does your data cleaning spark joy?

Jonathan Peate

Others also viewed

Insight Is Everywhere.

At Least 3 Reasons Why an Accurate Answer May Not Be the Right Answer

The 70% Conundrum: Why Data Scientists Spend Most of Their Time Exploring Data

Heteroscedasticity- my problem child

Chapter 5 : K-nearest neighbors algorithm with code from scratch.

My Journey to Becoming a Warm Data Lab Host: Insights and Impacts

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

I've googled it - I don't need a doctor

Do you See what I See?

Summaries of common plots using the tips dataset

Explore content categories