DATA IMPUTATION

Supriya Kumari

Published Oct 18, 2020

Data Imputation‘Imputation’ is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data.

Simple Data Imputation

‘Simple Data Imputation’ is a method applied to impute one value for each missing item. According to Little and Rubin [2019], simple data imputations can be defined as averages or extractions from a predictive distribution of missing values, require a method of creating a predictive distribution for imputation based on the observed data and define two generic approaches for generating this distribution: explicit modeling and implicit modeling.

Explicit modeling

In explicit modeling, the predictive distribution is based on a formal statistical model, for example, multivariate normal, therefore the assumptions are explicit. Examples of explicit modeling are average imputation, regression imputation, and stochastic regression imputation.

Implicit modeling

In implicit modeling, the focus is on an algorithm, which implies an underlying model. Assumptions are implied, but they still need to be carefully evaluated to ensure they are reasonable. These are examples of implicit modeling: Hot Deck imputation, imputation by replacement and Cold Deck imputation.

Multiple Data Imputation

Single imputations replace an unknown missing value by a single value and then treat it as if it were a true value [Rubin, 1988]. As a result, single imputation ignores uncertainty and almost always underestimates the variance. Multiple imputations overcome this problem, by taking into account both within-imputation uncertainty and between-imputation uncertainty.

Composite Data Imputation

Proposed by 'Soares' [2007], composite imputation represents a class of imputation techniques that combine the execution of one or more tasks used in the KDD (Knowledge Discovery in Databases) process before predicting a new value to be imputed. For example, combine the execution of a clustering algorithm like k-means and/or selection feature algorithms like 'PCA' and then execute some machine learning algorithms to predict the new value. This technique can be used in the context of single or multiple imputations. Soares [2007] also introduces the missing data imputation committee concepts that consist to evaluate with some statistical method, between all predictions, the more plausible value.

Cascading Data Imputation

Proposed by Ferlin [2008], the cascading imputation takes into account the previous data imputation to perform the next imputation. The previously complemented groups of data are reused for the imputation of the later groups in a cascade effect. With this division-and-conquer approach, it is believed to simplify the imputation process and improve data quality imputed.

We can resolve data imputation by the following methods:-

· Use Mean imputation

· Use Median imputation

· Use Most-Frequent imputation

· Use K-Nearest Neighbor imputation

· Use Logistic Regression imputation

· Use Deep Learning imputation

Here are some coding to resolve this problem.

INSTALLING ALL THE LIBRARIES

We have collected data from 100 participants with the device in their hands while doing a bicep curl as shown in the figure below.
Our task is to merge data from 100 files into one Data Frame
Filter column number: * 0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 *
Corresponding columns names: 'Time','Accelerometer_X','Accelerometer_Y','Accelerometer_Z','Gyroscope_X',

           'Gyroscope_Y','Gyroscope_Z','Magnetometer_X','Magnetometer_Y','Magnetometer_Z',
          'Velocitemeter_X','Velocitemeter_Y','Velocitemeter_Z','LoadcellA','Quaternion_1',
           'Quaternion_2','Quaternion_3','Quaternion_4'

File suffix: fitmi_XX.txt
Total number of files in "data_exploration" folder: 100
Value separator: TAB

DIFFERENT WAYS TO EXTRACT COLUMNS

ORIGINAL PLOTS

Compare the results for all the imputation with the original distribution

Mean Imputation

Median Imputation

KNN Imputation

Mice(imputation)

Imputation through regression

Deep learning Imputation

Thanks for reading.

DATA IMPUTATION

Supriya Kumari

Multiple Data Imputation

Composite Data Imputation

Cascading Data Imputation

More articles by Supriya Kumari

Others also viewed

From pain-point to action and value – what lies in-between. Part 2

Data Interpretation

Missing data, Information and Survivorship bias - Advanced Data Science perspectives

Backtransformation

Unveiling Data Hallucinations: When Numbers Mislead

Decision Intelligence, except this time it’s not complicated

Assuring that data drift within a given time period does not impact AI/ML performance required by users/process:

Count-Min Sketch Probabilistic Data Structure

Data Never Lies? Challenging the Myths of Data Objectivity

Explore content categories

Multiple Data Imputation

Composite Data Imputation

Cascading Data Imputation

More articles by Supriya Kumari

Presenting my blog on ML and DL.

Others also viewed

From pain-point to action and value – what lies in-between. Part 2

Data Interpretation

Missing data, Information and Survivorship bias - Advanced Data Science perspectives

Backtransformation

Unveiling Data Hallucinations: When Numbers Mislead

Decision Intelligence, except this time it’s not complicated

Assuring that data drift within a given time period does not impact AI/ML performance required by users/process:

Count-Min Sketch Probabilistic Data Structure

Data Never Lies? Challenging the Myths of Data Objectivity

Explore content categories