DATA IMPUTATION

Ashwini Bhattarmakki

Published Oct 15, 2020

As we know that data is the most basic element and raw data always has some missing values such as blank space/ NA/NAN etc. So, we need an algorithm to handle it.

Imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with list wise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data. There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well-known attempts to deal with missing data include: hot deck and cold deck imputation; list wise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

Let’s take up these one by one:-

1) Mean/Median Imputation:-

The simplest methods to impute missing values include filling in a constant or the mean of the variable or other basic statistical parameters like median and mode. It is easy to implement and is suitable for smaller dataset. Here, the null values in particular column are replaced by mean/median of non-missing values.

2) Mode-Imputation:-

Mode imputation (or mode substitution) replaces missing values of a categorical variable by the mode of non-missing cases of that variable. This method is easy to implement but doesn’t consider correlation between other features.

3) Zero/Constant Imputation:-

Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify. It’s generally in random category.

4) Hot-Deck Imputation:-

Pros: Uses existing data.

Cons: Multivariable relationships are distorted.

Handles: MCAR and MAR Item Non-Response.

This method is another simple one, where missing values are replaced with random values from that column. While this has the advantage of being simple, be extra careful if you’re trying to examine the nature of the features and how they relate to each other, since multivariable relationships will be distorted.

5) 4- Imputation Using k-NN:-

The k nearest neighbors is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbors to the observation with missing data and then imputing them based on the non-missing values in the neighborhood. Generally we can code using Impyute library which provides a simple and easy way to use KNN for imputation.

6) Imputation Using Multivariate Imputation by Chained Equation (MICE):-

This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (i.e., continuous or binary) as well as complexities such as bounds or survey skip patterns.

7) Imputation Using Deep Learning (Data wig):-

This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a data frame. It also supports both CPU and GPU for training.

Pros:

Quite accurate compared to other methods.
It has some functions that can handle categorical data (Feature Encoder).
It supports CPUs and GPUs.

Cons:

Single Column imputation.
Can be quite slow with large datasets.
You have to specify the columns that contain information about the target column that will be imputed.

*Other Imputation Methods:

§ Stochastic regression imputation:

It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value.

§ Extrapolation and Interpolation:

It tries to estimate values from other observations within the range of a discrete set of known data points.

Actually it can be said that, there is no perfect way to compensate for the missing values in a dataset. Each strategy can perform better for certain datasets and missing data types but may perform much worse on other types of datasets. There are some set rules to decide which strategy to use for particular types of missing values, but beyond that, you should experiment and check which model works best for your dataset.

DATA IMPUTATION

Ashwini Bhattarmakki

More articles by Ashwini Bhattarmakki

Others also viewed

DATA IMPUTATION

What’s Driving These Results? A Data Scientist’s Guide to Root Cause Analysis

Important Data Preprocessing Elements in ML

The importance of median in continuous processes analysis and modeling

HyperLogLog 🔥🔥🔥

Unlocking Insights from Grouped Data: The Power of Intraclass Correlation (ICC)

Putting Data in its Proper Place

There are known knowns...

Assuring that data drift within a given time period does not impact AI/ML performance required by users/process:

Make predictions on Test Data set using Principal Components in R

Explore content categories

More articles by Ashwini Bhattarmakki

Basic info about AI and classifications

Others also viewed

DATA IMPUTATION

What’s Driving These Results? A Data Scientist’s Guide to Root Cause Analysis

Important Data Preprocessing Elements in ML

The importance of median in continuous processes analysis and modeling

HyperLogLog 🔥🔥🔥

Unlocking Insights from Grouped Data: The Power of Intraclass Correlation (ICC)

Putting Data in its Proper Place

There are known knowns...

Assuring that data drift within a given time period does not impact AI/ML performance required by users/process:

Make predictions on Test Data set using Principal Components in R

Explore content categories