Effective Imputation Techniques for Handling Null Values in Data Cleaning

Effective Imputation Techniques for Handling Null Values in Data Cleaning

1. Deletion Methods:

  • Complete Case Deletion (Listwise Deletion): Remove entire rows or columns that contain missing values.

When to use: When the proportion of missing data is very small and the missing values appear to be randomly distributed (Missing Completely at Random, MCAR).


2. Mean/Median/Mode Imputation

  • Mean Imputation: Replace missing values with the mean of the column.

When to use: Suitable for numerical data where missing values are assumed to be random and the distribution is approximately symmetric.

  • Median Imputation: Replace missing values with the median of the column.

When to use: Preferred for numerical data with outliers or skewed distributions.

  • Mode Imputation: Replace missing values with the mode (most frequent value) of the column.

When to use: Appropriate for categorical data or when the mode is representative of the missing values.


3. Forward Fill / Backward Fill

  • Forward Fill: Propagate the next valid observation forward.
  • Backward Fill: Propagate the next valid observation backward.

When to use: Time series data where missing values can be reasonably assumed to be similar to the previous or next observations.


4. Interpolation

  • Linear Interpolation: Estimate missing values using linear interpolation between surrounding points.

When to use: Time series or sequential data where a linear relationship between observations is assumed.

  • Other Interpolation Methods: Polynomial, spline, etc.

When to use: When a more complex relationship is assumed between the observations.


5. K-Nearest Neighbors (KNN) Imputation

  • KNN Imputation: Use the k-nearest neighbors to impute missing values based on the similarity of other observations.

When to use: When there is a strong relationship between variables, suitable for both numerical and categorical data.


6. Multivariate Imputation by Chained Equations (MICE)

  • MICE: Use regression models to predict and fill in missing values, iterating multiple times.

When to use: For complex datasets where missing values depend on multiple other variables.


7. Using Model-Based Methods

  • Regression Imputation: Use a regression model to predict and fill missing values.

When to use: When there are strong predictors for the missing values within the dataset.


8. Domain-Specific Imputation

  • Custom Imputation: Use domain knowledge to impute missing values.

When to use: When there is specific knowledge about the dataset that can inform the imputation strategy (e.g., filling missing ages in a medical dataset based on other health indicators).


import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Sample DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': ['cat', 'dog', np.nan, 'mouse', 'cat']
})

# Mean Imputation
df['A_mean_imputed'] = df['A'].fillna(df['A'].mean())

# Median Imputation
df['A_median_imputed'] = df['A'].fillna(df['A'].median())

# Mode Imputation
df['C_mode_imputed'] = df['C'].fillna(df['C'].mode()[0])

# Forward Fill
df['B_ffill'] = df['B'].fillna(method='ffill')

# Backward Fill
df['B_bfill'] = df['B'].fillna(method='bfill')

# Linear Interpolation
df['A_interpolated'] = df['A'].interpolate(method='linear')

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=2)
df[['A_knn', 'B_knn']] = knn_imputer.fit_transform(df[['A', 'B']])

# MICE Imputation
mice_imputer = IterativeImputer()
df[['A_mice', 'B_mice']] = mice_imputer.fit_transform(df[['A', 'B']])

print(df)
        

Don't miss out! ➡️ (Subscribe on LinkedIn https://www.garudax.id/build-relation/newsletter-follow?entityUrn=7175221823222022144)

Follow me on LinkedIn: www.garudax.id/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=bhargava-naik-banoth-393546170



To view or add a comment, sign in

More articles by Bhargava Naik Banoth

Others also viewed

Explore content categories