Effective Imputation Techniques for Handling Null Values in Data Cleaning

Bhargava Naik Banoth

Published May 24, 2024

1. Deletion Methods:

Complete Case Deletion (Listwise Deletion): Remove entire rows or columns that contain missing values.

When to use: When the proportion of missing data is very small and the missing values appear to be randomly distributed (Missing Completely at Random, MCAR).

2. Mean/Median/Mode Imputation

Mean Imputation: Replace missing values with the mean of the column.

When to use: Suitable for numerical data where missing values are assumed to be random and the distribution is approximately symmetric.

Median Imputation: Replace missing values with the median of the column.

When to use: Preferred for numerical data with outliers or skewed distributions.

Mode Imputation: Replace missing values with the mode (most frequent value) of the column.

When to use: Appropriate for categorical data or when the mode is representative of the missing values.

3. Forward Fill / Backward Fill

Forward Fill: Propagate the next valid observation forward.
Backward Fill: Propagate the next valid observation backward.

When to use: Time series data where missing values can be reasonably assumed to be similar to the previous or next observations.

4. Interpolation

Linear Interpolation: Estimate missing values using linear interpolation between surrounding points.

When to use: Time series or sequential data where a linear relationship between observations is assumed.

Other Interpolation Methods: Polynomial, spline, etc.

When to use: When a more complex relationship is assumed between the observations.

Recommended by LinkedIn

Data Optimizations Techniques in the Machine Learning

Rama Krishna Reddy Dyava 3 years ago

Model Generalization: Understanding Overfitting…

Navadeep Komarraju 1 year ago

Don't be Confused, it's just Confusion Matrix!

Nirmiti Kamtekar 3 years ago

5. K-Nearest Neighbors (KNN) Imputation

KNN Imputation: Use the k-nearest neighbors to impute missing values based on the similarity of other observations.

When to use: When there is a strong relationship between variables, suitable for both numerical and categorical data.

6. Multivariate Imputation by Chained Equations (MICE)

MICE: Use regression models to predict and fill in missing values, iterating multiple times.

When to use: For complex datasets where missing values depend on multiple other variables.

7. Using Model-Based Methods

Regression Imputation: Use a regression model to predict and fill missing values.

When to use: When there are strong predictors for the missing values within the dataset.

8. Domain-Specific Imputation

Custom Imputation: Use domain knowledge to impute missing values.

When to use: When there is specific knowledge about the dataset that can inform the imputation strategy (e.g., filling missing ages in a medical dataset based on other health indicators).

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Sample DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': ['cat', 'dog', np.nan, 'mouse', 'cat']
})

# Mean Imputation
df['A_mean_imputed'] = df['A'].fillna(df['A'].mean())

# Median Imputation
df['A_median_imputed'] = df['A'].fillna(df['A'].median())

# Mode Imputation
df['C_mode_imputed'] = df['C'].fillna(df['C'].mode()[0])

# Forward Fill
df['B_ffill'] = df['B'].fillna(method='ffill')

# Backward Fill
df['B_bfill'] = df['B'].fillna(method='bfill')

# Linear Interpolation
df['A_interpolated'] = df['A'].interpolate(method='linear')

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=2)
df[['A_knn', 'B_knn']] = knn_imputer.fit_transform(df[['A', 'B']])

# MICE Imputation
mice_imputer = IterativeImputer()
df[['A_mice', 'B_mice']] = mice_imputer.fit_transform(df[['A', 'B']])

print(df)

Don't miss out! ➡️ (Subscribe on LinkedIn https://www.garudax.id/build-relation/newsletter-follow?entityUrn=7175221823222022144)

Follow me on LinkedIn: www.garudax.id/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=bhargava-naik-banoth-393546170

Effective Imputation Techniques for Handling Null Values in Data Cleaning

Bhargava Naik Banoth

1. Deletion Methods:

2. Mean/Median/Mode Imputation

3. Forward Fill / Backward Fill

4. Interpolation

Recommended by LinkedIn

5. K-Nearest Neighbors (KNN) Imputation

6. Multivariate Imputation by Chained Equations (MICE)

7. Using Model-Based Methods

8. Domain-Specific Imputation

The Future of Work with AI

302 followers

More articles by Bhargava Naik Banoth

Others also viewed

KNN(K-Nearest Neighbors) from Scratch and analysis on IRIS dataset

Understanding Dimensionality Reduction and the Curse of Dimensionality

Strengths, Limitations, and Performance Optimization of KNN

Ensuring Robust Model Selection: Why Cross Validation Matters More Than the Test Set

How to select important variables from very wide data sets

What are Outliers and its impact on the machine learning models?

LinearSVC (Classification)

🚀 Dimensionality Reduction Techniques: Simplifying Data Without Losing Essence 🚀

Feature Selection: Knowing What to Keep

Bias and Variance and Its Trade Off

Explore content categories

1. Deletion Methods:

2. Mean/Median/Mode Imputation

3. Forward Fill / Backward Fill

4. Interpolation

Recommended by LinkedIn

5. K-Nearest Neighbors (KNN) Imputation

6. Multivariate Imputation by Chained Equations (MICE)

7. Using Model-Based Methods

8. Domain-Specific Imputation

The Future of Work with AI

302 followers

More articles by Bhargava Naik Banoth

🚫 Stop Blaming AI for Bad Code — Improve How You Use It

ML security made simple — 10 risks, examples, how attackers do it, and the impact

🇮🇳 Job Shock 2025: AI and U.S. Tariffs Trigger Layoffs, Raise Anxiety in India’s Workforce

Comprehensive Guide to Choosing the Right Machine Learning and Deep Learning Models

The Future of AI, Data Analytics, and Data Science in India: Job Opportunities and the Risk of Obsolescence

The Cost-Benefit Analysis of Process Automation: How Long Does It Take to Save Time?

Advanced Financial Models: Expanding the Toolkit for Modern Finance

A Comprehensive Guide to Financial Modeling: Techniques, Applications, and Best Practices

Effortless Form Filling and Submission with Python: No Selenium Required

Streamlining Web Form Submissions with Python: Excel-Driven Automation

Others also viewed

KNN(K-Nearest Neighbors) from Scratch and analysis on IRIS dataset

Understanding Dimensionality Reduction and the Curse of Dimensionality

Strengths, Limitations, and Performance Optimization of KNN

Ensuring Robust Model Selection: Why Cross Validation Matters More Than the Test Set

How to select important variables from very wide data sets

What are Outliers and its impact on the machine learning models?

LinearSVC (Classification)

🚀 Dimensionality Reduction Techniques: Simplifying Data Without Losing Essence 🚀

Feature Selection: Knowing What to Keep

Bias and Variance and Its Trade Off

Explore content categories