Effective Imputation Techniques for Handling Null Values in Data Cleaning
1. Deletion Methods:
When to use: When the proportion of missing data is very small and the missing values appear to be randomly distributed (Missing Completely at Random, MCAR).
2. Mean/Median/Mode Imputation
When to use: Suitable for numerical data where missing values are assumed to be random and the distribution is approximately symmetric.
When to use: Preferred for numerical data with outliers or skewed distributions.
When to use: Appropriate for categorical data or when the mode is representative of the missing values.
3. Forward Fill / Backward Fill
When to use: Time series data where missing values can be reasonably assumed to be similar to the previous or next observations.
4. Interpolation
When to use: Time series or sequential data where a linear relationship between observations is assumed.
When to use: When a more complex relationship is assumed between the observations.
Recommended by LinkedIn
5. K-Nearest Neighbors (KNN) Imputation
When to use: When there is a strong relationship between variables, suitable for both numerical and categorical data.
6. Multivariate Imputation by Chained Equations (MICE)
When to use: For complex datasets where missing values depend on multiple other variables.
7. Using Model-Based Methods
When to use: When there are strong predictors for the missing values within the dataset.
8. Domain-Specific Imputation
When to use: When there is specific knowledge about the dataset that can inform the imputation strategy (e.g., filling missing ages in a medical dataset based on other health indicators).
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Sample DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': ['cat', 'dog', np.nan, 'mouse', 'cat']
})
# Mean Imputation
df['A_mean_imputed'] = df['A'].fillna(df['A'].mean())
# Median Imputation
df['A_median_imputed'] = df['A'].fillna(df['A'].median())
# Mode Imputation
df['C_mode_imputed'] = df['C'].fillna(df['C'].mode()[0])
# Forward Fill
df['B_ffill'] = df['B'].fillna(method='ffill')
# Backward Fill
df['B_bfill'] = df['B'].fillna(method='bfill')
# Linear Interpolation
df['A_interpolated'] = df['A'].interpolate(method='linear')
# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=2)
df[['A_knn', 'B_knn']] = knn_imputer.fit_transform(df[['A', 'B']])
# MICE Imputation
mice_imputer = IterativeImputer()
df[['A_mice', 'B_mice']] = mice_imputer.fit_transform(df[['A', 'B']])
print(df)
Don't miss out! ➡️ (Subscribe on LinkedIn https://www.garudax.id/build-relation/newsletter-follow?entityUrn=7175221823222022144)
Follow me on LinkedIn: www.garudax.id/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=bhargava-naik-banoth-393546170