Missing data​: classification

Missing data: classification

Every person who works with data will tell you about it, they all had to deal with missing data. This experience can be a headache and sometimes a nightmare when it comes to mining data riddled with missing values. As we know, data quality is one of the main keys to successfully carrying out a data project (see: the concept of clean data).

However, before embarking on strategies for dealing with missing data, it is important to be able to identify and understand the reasons behind these information gaps. (see: Do you speak data?)


Aucun texte alternatif pour cette image

Understand and identify missing data

There are several categories of missing data which are based on the reasons and mechanisms leading to the missing data. In the following paragraphs, I detail the three main types.

Missing Completely At Random (MCAR)

Missing data are classified as MCAR (Missing Completely At Random) if the events that lead to the absence of a particular information are independent from both the observable variables and the unobservable parameters. That is, this missing data is produced entirely at random. This implies that the causes of missing data are not related to the data itself.

An example of MCAR is a scale running out of batteries. Some data will be missing just because of bad luck.

In the context of a company collecting information on its website, MCAR classified data appears when the site is no longer functional for any reason (breakdown, temporary stoppage of services, maintenance, etc.).

When the data is MCAR, the analysis performed on that data is unbiased. None of the variables are affected more than another. The statistical advantage of MCAR data is that the analysis remains unbiased despite an obvious loss of information.

However, data is rarely MCAR.

Missing at Random (MAR)

Modern statistical methods generally start with the Missing At Random (MAR) assumption to justify missing data.

Random missing data is a more general and realistic assumption than MCAR. MAR occurs when the absence is not random but it can be fully accounted for by variables for which there is complete information.

For example, when placed on a soft surface, our scale may produce more missing values than when placed on a hard surface. These data are therefore not MCARs because we know that different surfaces give different results. However if we know the surface type and if we can assume that the data is MCAR on that surface type then the data is considered MAR.

In our business and website information collection context, an example of MAR data may be a difference in the behavior of data flows between desktop browsing (via a computer) and mobile browsing (via a smartphone). In this case, it is possible to know the differences of collection between these two types of navigation. By isolating the data from one of the site access mode we can then consider the data as MCAR.

 Missing Not At Random (MNAR)

If the characteristics of the missing data do not match those of MCAR or MAR, they fall under the category of Missing Not At Random (MNAR).

Aucun texte alternatif pour cette image

MNAR means that the probability of missing data varies for reasons unknown to us. For example, the mechanism of our scale may wear down over time, producing more missing data over time, but we may not notice it. If the heaviest objects are measured later in our study, then we get a distribution of measurements that will be distorted. MNAR also includes the possibility that our scale will produce more missing values for heavier objects, a phenomenon that could be difficult to identify and manage.

An example of non-random missing data for our company and its website may be changes in site behavior and data collection components as frameworks and systems used are updated. In this case, it can be very complex to identify the mechanisms leading to the generation of missing data.

MNAR data cases are problematic. The only way to get an unbiased estimate of the parameters in such a case is to model the missing data. The model can then be incorporated into a more complex model to estimate missing values.


The concept of missing data is important to understand in order to successfully manage the management and usage of data. If the missing values are not handled correctly by the user, it can lead to inaccurate conclusions about the data. In a future article, I will share with you the strategies available to deal with missing data.

To view or add a comment, sign in

More articles by Pierre Baudin 丁俊杰

  • Business Intelligence VS. Data Science

    Avec l’explosion des demandes d’analyses des données au sein des organisations, deux grands courants se détachent : la…

  • Data Warehouse vs Data Lake vs Data Mart: The Guide - Definition

    The world of data is changing very quickly and it is very easy to get lost in all the technical terms that flourish…

    2 Comments
  • Data Warehouse vs Data Lake vs Data Mart : le guide - Usage

    Les data warehouses, les datamarts et data lakes constituent les éléments incontournables des écosystèmes de données…

    1 Comment
  • How to succeed in your data project

    Managing data isn't always the funniest part of data science. It includes several essential steps that require time and…

  • Machine learning and types of learning

    To follow up on my first article on what machine learning is, how it works and its usefulness (see: what is machine…

  • Data Warehouse vs Data Lake vs Data Mart : le guide - Définition

    Le monde de la data évolue très rapidement et il est très facile de se perdre dans tous les termes techniques qui…

  • Data Trends for 2021

    The data science revolution that is dramatically influencing the way companies approach their business and their…

    2 Comments
  • Méthodologie pour réussir son projet data

    La gestion des données n’est pas toujours la partie la plus amusante de la science des données. Elle inclut plusieurs…

    2 Comments
  • Strategies for dealing with missing data

    To follow up on my previous article on missing data and how it is categorized (see: missing data: classification)…

  • Les grandes tendances de la data en 2021

    La révolution de la science des données qui influence de manière spectaculaire la façon dont les entreprises approchent…

Explore content categories