Data Cleaning for ML

In continuation of the previous post about Machine Learning workshop, finally organised a separate session discussing about the pre-processing of data. Picked the classic PIMA Indian dataset about diabetes prediction - we are IQVIA after all. The dataset is right size for training purpose and has a number of "defects" allowing demonstration of various data cleaning functions. We continued using Python pandas, numpy and sklearn libraries. While some of them are intuitive (like removal of nulls or duplicates), I found the scaling, normalising and centering less known or used among the participants. It was good refresher for self while trying to explain the process and inventing "non-technical examples". 

The demo reinforced another of my belief - data cleaning is a must not just for machine learning modelling but for any kind of data analysis. Even while loading data in a database or datamart you need to clean data at some point of time. While it is commonly said a data scientist should be spending more time analysing data than cleaning it - it is easier said than done. Existing data, users feeding the data, legacy and incompatible systems are not going to change overnight. You may use number of tools to capture data "properly". But then each system and tool expects its way is the "proper" way. There are no standards (or very little). Finally it is often left to analyst or user to determine cleaning process.

Going back to PIMA Indian dataset, while discussing filling up missing values, participants decided that value of insulin shouldn't be imputed. Insulin being one of the basic parameters required to make a prediction. We decided to drop the records where the field was missing. However for fields like BMI or BP it was decided we can replace the missing values with the median value since the dataset belongs to a very homogeneous population. I agree even this comes with a flaw since using median means we are excluding the factor of age here. Similarly there is no way another field - "diabetes pedigree" can be guessed from rest of the fields. 

After the workshop I could say one thing for certain - functional expertise matter in predictive analysis. Some one with a medical background was in a much better position to judge what kind of data cleaning to do. I had people who had spent over a decade analysing medical data and sometimes even they were not sure how to proceed for our sample dataset. Machines may take over all of our jobs some day - it can't replace the human thought process of "gut-feeling". 

Would post the machine model we finalised during the workshop. Should come handy for those some of you.

To view or add a comment, sign in

More articles by Virendra Pratap Singh

  • Cost Benefit Analysis in team

    How do you measure your employee's worth? Does a Cost/Benefit ratio helps in resourcing? In our IT industry, the cost…

  • Perfection and reward

    I wanted to share this real industry incident that one of my ex-colleague shared. This happened sometime in 2014/15.

    1 Comment
  • Why my lock-down routine fell apart, but not my son's

    During the current Covid-19 lock-down I created a routine for myself and I asked my son to do so as well. I wanted to…

  • SnowflakeDB ELT - skipping staging layer

    I really don't want to write another piece to talk about the SnowflakeDB ELT approach. I think SnowflakeDB is doing a…

  • When we say we are Agile, are we really?

    Late August I, as part of an internal group, was attending a workshop for a forthcoming ISO audit. Hired adviser…

  • When ELT Scores Over ETL

    I don't think ETL vs ELT is any debate. It is like debating apple vs banana.

  • Sentiments of Data Science

    I have been spending some time with similarly inclined colleagues on Python and data science. While we all agree the…

  • Admiring Snowflake

    Returned back to using Snowflake DB for some training. I admire it as a great tool with my reasons.

  • Missing Data Profiling & Pre-processing

    Idea of writing this arose when I (recently) started couple of workshops for my team (which is a fairly large team…

    1 Comment

Explore content categories