Strategies for dealing with missing data

Strategies for dealing with missing data

To follow up on my previous article on missing data and how it is categorized (see: missing data: classification), let’s find out in this post the different strategies for dealing with these information gaps.

As mentioned, missing data (or missing values) is defined as the absence of a value for a variable in an observation (see: the concept of tidy data). The problem of missing data is very common in all data professions and can have a significant effect on the conclusions that can be drawn from the analysis of this data as the quality of an analysis is closely linked to the quality of the data used.

Managing missing data is one of the critical tasks of the cleaning step and of preparing the data for use. It works together with exploratory analysis in order to understand the origins and impacts of missing values and to find the best strategy to overcome them.

It is important to also note that there is no "right way" to deal with missing data. Whichever method you choose, it is virtually impossible to find the lost information.

Prevention

When you discover that a dataset contains missing values, the best possible way to deal with the missing data is to first prevent the problem.

Aucun texte alternatif pour cette image

To do this, identifying the type of missing data and the source can correct the data collection and thus eliminate the problem.

For example, it could be to modify the validation conditions of an application form to ensure that the user provides the necessary information.

In the context of data processing, it is possible that we notice that a clearing operation induces missing data following the conversion of the data type of a variable (text into numbers) or the use of 'incorrect encoding. Intervention on faulty treatment is a case of preventing the generation of missing data.

Unlike the prevention strategy, the following methods intervene directly on the dataset. The two main methods of resolving the lack of information are deletion or imputation of data.

Deletion

Suppression techniques consist of eliminating certain observations or variables containing missing values.

The impact of deleting data is not the same depending on the type of missing data. In the MAR (missing at random) and MCAR (missing completely at random) cases, the exclusion of observations containing missing values is acceptable and will not induce bias in the analysis.

In contrast, in the MNAR (missing not at random) case, removing observations with missing values may produce bias in the model. You must therefore be very careful before deleting observations.

There are three methods in dealing with missing data by deleting it:

  • Listwise deletion (Complete case Analysis): By far the most common approach for missing data that simply excludes cases with missing values. This approach is known as a full case (or available case) analysis.
  • Pairwise deletion: This eliminates observations only when the variables needed for the analysis contain missing values. This preserves more data than simply deleting rows with missing values.
  • Removal of variables: if there is a missing data rate of more than 60% for a variable, rejection of the variable in question can be considered.

Imputation

When data is missing, it may be a good idea to delete it. However, this may not be the most efficient option. For example, if too much information is excluded, it may not be possible to perform a reliable analysis. Insufficient data can also impact the quality and reliability of a prediction model.

Aucun texte alternatif pour cette image

Imputation methods offer an alternative to deletion. Depending on the reason the data is missing, it may provide reasonably reliable results by calculating estimates for the missing values. They are particularly effective when the percentage of missing data is relatively low. If the proportion of missing data is too high, the results lack natural variation and therefore limit the quality of the estimates.

Depending on the nature of the data and their characteristics, some imputation methods work more effectively.

For a temporal dataset:

  • Without trend and seasonality: replace missing values with mean, mode, median or by random sample imputation
  • With trend but without seasonality: linear regression
  • With trend and seasonality: seasonal adjustment and regression

For a general dataset:

  • For a categorical variable: consider missing values as a category, apply multiple imputation or logistic regression.
  • For a continuous variable: replacement of missing values by the mean, the mode, the median, the result of multiple imputation or the application of a linear regression.

There are also imputation techniques using machine learning algorithms such as k-nearest neighbors (knn). KNN can predict both discrete values (the most frequent value among the k nearest neighbors) and continuous values (the average among the k nearest neighbors).

To conclude, the different strategies for dealing with missing data make it possible to limit the impact of the absence of information. They provide most needed help to the use of the data set and thus the quality of the analysis.

It should also be kept in mind that preserving the integrity of the dataset requires testing several techniques in order to assess their effectiveness and to be aware of the biases that could be introduced by their use.


In a future article, I'll walk you through the technical details with examples of how to use the available suppression and imputation methods for handling missing data.

Until then, let's stay connected!

To view or add a comment, sign in

More articles by Pierre Baudin 丁俊杰

  • Business Intelligence VS. Data Science

    Avec l’explosion des demandes d’analyses des données au sein des organisations, deux grands courants se détachent : la…

  • Data Warehouse vs Data Lake vs Data Mart: The Guide - Definition

    The world of data is changing very quickly and it is very easy to get lost in all the technical terms that flourish…

    2 Comments
  • Data Warehouse vs Data Lake vs Data Mart : le guide - Usage

    Les data warehouses, les datamarts et data lakes constituent les éléments incontournables des écosystèmes de données…

    1 Comment
  • How to succeed in your data project

    Managing data isn't always the funniest part of data science. It includes several essential steps that require time and…

  • Machine learning and types of learning

    To follow up on my first article on what machine learning is, how it works and its usefulness (see: what is machine…

  • Data Warehouse vs Data Lake vs Data Mart : le guide - Définition

    Le monde de la data évolue très rapidement et il est très facile de se perdre dans tous les termes techniques qui…

  • Data Trends for 2021

    The data science revolution that is dramatically influencing the way companies approach their business and their…

    2 Comments
  • Méthodologie pour réussir son projet data

    La gestion des données n’est pas toujours la partie la plus amusante de la science des données. Elle inclut plusieurs…

    2 Comments
  • Les grandes tendances de la data en 2021

    La révolution de la science des données qui influence de manière spectaculaire la façon dont les entreprises approchent…

  • Machine learning et types d'apprentissage

    Pour donner suite à mon premier article sur ce qu’est le machine learning, son fonctionnement et son utilité (voir : le…

Others also viewed

Explore content categories