Strategies for dealing with missing data

Pierre Baudin 丁俊杰

Published Feb 7, 2021

To follow up on my previous article on missing data and how it is categorized (see: missing data: classification), let’s find out in this post the different strategies for dealing with these information gaps.

As mentioned, missing data (or missing values) is defined as the absence of a value for a variable in an observation (see: the concept of tidy data). The problem of missing data is very common in all data professions and can have a significant effect on the conclusions that can be drawn from the analysis of this data as the quality of an analysis is closely linked to the quality of the data used.

Managing missing data is one of the critical tasks of the cleaning step and of preparing the data for use. It works together with exploratory analysis in order to understand the origins and impacts of missing values and to find the best strategy to overcome them.

It is important to also note that there is no "right way" to deal with missing data. Whichever method you choose, it is virtually impossible to find the lost information.

Prevention

When you discover that a dataset contains missing values, the best possible way to deal with the missing data is to first prevent the problem.

To do this, identifying the type of missing data and the source can correct the data collection and thus eliminate the problem.

For example, it could be to modify the validation conditions of an application form to ensure that the user provides the necessary information.

In the context of data processing, it is possible that we notice that a clearing operation induces missing data following the conversion of the data type of a variable (text into numbers) or the use of 'incorrect encoding. Intervention on faulty treatment is a case of preventing the generation of missing data.

Unlike the prevention strategy, the following methods intervene directly on the dataset. The two main methods of resolving the lack of information are deletion or imputation of data.

Deletion

Suppression techniques consist of eliminating certain observations or variables containing missing values.

The impact of deleting data is not the same depending on the type of missing data. In the MAR (missing at random) and MCAR (missing completely at random) cases, the exclusion of observations containing missing values is acceptable and will not induce bias in the analysis.

In contrast, in the MNAR (missing not at random) case, removing observations with missing values may produce bias in the model. You must therefore be very careful before deleting observations.

There are three methods in dealing with missing data by deleting it:

Listwise deletion (Complete case Analysis): By far the most common approach for missing data that simply excludes cases with missing values. This approach is known as a full case (or available case) analysis.
Pairwise deletion: This eliminates observations only when the variables needed for the analysis contain missing values. This preserves more data than simply deleting rows with missing values.
Removal of variables: if there is a missing data rate of more than 60% for a variable, rejection of the variable in question can be considered.

Imputation

When data is missing, it may be a good idea to delete it. However, this may not be the most efficient option. For example, if too much information is excluded, it may not be possible to perform a reliable analysis. Insufficient data can also impact the quality and reliability of a prediction model.

Imputation methods offer an alternative to deletion. Depending on the reason the data is missing, it may provide reasonably reliable results by calculating estimates for the missing values. They are particularly effective when the percentage of missing data is relatively low. If the proportion of missing data is too high, the results lack natural variation and therefore limit the quality of the estimates.

Depending on the nature of the data and their characteristics, some imputation methods work more effectively.

For a temporal dataset:

Without trend and seasonality: replace missing values with mean, mode, median or by random sample imputation
With trend but without seasonality: linear regression
With trend and seasonality: seasonal adjustment and regression

For a general dataset:

For a categorical variable: consider missing values as a category, apply multiple imputation or logistic regression.
For a continuous variable: replacement of missing values by the mean, the mode, the median, the result of multiple imputation or the application of a linear regression.

There are also imputation techniques using machine learning algorithms such as k-nearest neighbors (knn). KNN can predict both discrete values (the most frequent value among the k nearest neighbors) and continuous values (the average among the k nearest neighbors).

To conclude, the different strategies for dealing with missing data make it possible to limit the impact of the absence of information. They provide most needed help to the use of the data set and thus the quality of the analysis.

It should also be kept in mind that preserving the integrity of the dataset requires testing several techniques in order to assess their effectiveness and to be aware of the biases that could be introduced by their use.

In a future article, I'll walk you through the technical details with examples of how to use the available suppression and imputation methods for handling missing data.

Until then, let's stay connected!

To view or add a comment, sign in

Strategies for dealing with missing data

Pierre Baudin 丁俊杰

Prevention

Deletion

Imputation

More articles by Pierre Baudin 丁俊杰

Others also viewed

Data isn't Dirty, but Data Work can be

Data Detectives: Unveiling and Taming missing values using Advanced Methods!!!

Handling Missing Data: How Missing Data Leads to Wrong Conclusions

Decoding Data Disasters: Navigating the Minefield of Misleading Visuals

Decoding Bloom Filter.🤩

Data Fairness: Why "Correct" Numbers Can Still Be Wrong ⚖️

Your model of data should focus so much more on semantics

Cornering Corona: The Data Science Approach Using Passive Data Collection

The biggest myth in data prep

Explore content categories

Prevention

Deletion

Imputation

More articles by Pierre Baudin 丁俊杰

Business Intelligence VS. Data Science

Data Warehouse vs Data Lake vs Data Mart: The Guide - Definition

Data Warehouse vs Data Lake vs Data Mart : le guide - Usage

How to succeed in your data project

Machine learning and types of learning

Data Warehouse vs Data Lake vs Data Mart : le guide - Définition

Data Trends for 2021

Méthodologie pour réussir son projet data

Les grandes tendances de la data en 2021

Machine learning et types d'apprentissage

Others also viewed

Data isn't Dirty, but Data Work can be

Data Detectives: Unveiling and Taming missing values using Advanced Methods!!!

Handling Missing Data: How Missing Data Leads to Wrong Conclusions

Decoding Data Disasters: Navigating the Minefield of Misleading Visuals

Decoding Bloom Filter.🤩

Data Fairness: Why "Correct" Numbers Can Still Be Wrong ⚖️

Your model of data should focus so much more on semantics

Cornering Corona: The Data Science Approach Using Passive Data Collection

The biggest myth in data prep

Similar topics

Data Cleaning Techniques for Accurate Analysis

Explore content categories