DATA IMPUTATION
Data Imputation‘Imputation’ is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.
Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data.
Simple Data Imputation
‘Simple Data Imputation’ is a method applied to impute one value for each missing item. According to Little and Rubin [2019], simple data imputations can be defined as averages or extractions from a predictive distribution of missing values, require a method of creating a predictive distribution for imputation based on the observed data and define two generic approaches for generating this distribution: explicit modeling and implicit modeling.
Explicit modeling
In explicit modeling, the predictive distribution is based on a formal statistical model, for example, multivariate normal, therefore the assumptions are explicit. Examples of explicit modeling are average imputation, regression imputation, and stochastic regression imputation.
Implicit modeling
In implicit modeling, the focus is on an algorithm, which implies an underlying model. Assumptions are implied, but they still need to be carefully evaluated to ensure they are reasonable. These are examples of implicit modeling: Hot Deck imputation, imputation by replacement and Cold Deck imputation.
Multiple Data Imputation
Single imputations replace an unknown missing value by a single value and then treat it as if it were a true value [Rubin, 1988]. As a result, single imputation ignores uncertainty and almost always underestimates the variance. Multiple imputations overcome this problem, by taking into account both within-imputation uncertainty and between-imputation uncertainty.
Composite Data Imputation
Proposed by 'Soares' [2007], composite imputation represents a class of imputation techniques that combine the execution of one or more tasks used in the KDD (Knowledge Discovery in Databases) process before predicting a new value to be imputed. For example, combine the execution of a clustering algorithm like k-means and/or selection feature algorithms like 'PCA' and then execute some machine learning algorithms to predict the new value. This technique can be used in the context of single or multiple imputations. Soares [2007] also introduces the missing data imputation committee concepts that consist to evaluate with some statistical method, between all predictions, the more plausible value.
Cascading Data Imputation
Proposed by Ferlin [2008], the cascading imputation takes into account the previous data imputation to perform the next imputation. The previously complemented groups of data are reused for the imputation of the later groups in a cascade effect. With this division-and-conquer approach, it is believed to simplify the imputation process and improve data quality imputed.
We can resolve data imputation by the following methods:-
· Use Mean imputation
· Use Median imputation
· Use Most-Frequent imputation
· Use K-Nearest Neighbor imputation
· Use Logistic Regression imputation
· Use Deep Learning imputation
Here are some coding to resolve this problem.
INSTALLING ALL THE LIBRARIES
- We have collected data from 100 participants with the device in their hands while doing a bicep curl as shown in the figure below.
- Our task is to merge data from 100 files into one Data Frame
- Filter column number: * 0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 *
- Corresponding columns names: 'Time','Accelerometer_X','Accelerometer_Y','Accelerometer_Z','Gyroscope_X',
'Gyroscope_Y','Gyroscope_Z','Magnetometer_X','Magnetometer_Y','Magnetometer_Z', 'Velocitemeter_X','Velocitemeter_Y','Velocitemeter_Z','LoadcellA','Quaternion_1', 'Quaternion_2','Quaternion_3','Quaternion_4'
- File suffix: fitmi_XX.txt
- Total number of files in "data_exploration" folder: 100
- Value separator: TAB
DIFFERENT WAYS TO EXTRACT COLUMNS
ORIGINAL PLOTS
Compare the results for all the imputation with the original distribution
Mean Imputation
Median Imputation
KNN Imputation
Mice(imputation)
Imputation through regression
Deep learning Imputation
Thanks for reading.