Handling Data Imbalance

Amit Raj

Published Sep 21, 2020

Data Imbalance is a situation we data scientists have faced on a daily basis. It is a scenario where the distribution of one kind of class is extremely imbalanced as compared to its counterparts in the data. Hardly have we seen datasets with almost equally distributed classes. Let's take an example, say we are solving a classic classification problem and are trying to identify 0s and 1s. While going through the data, there can roughly be 3 scenarios:

Equally distributed data i.e. 1's make up ~40-60% of data
Slightly imbalanced i.e. 1's make ~20-30% of data
Highly imbalanced i.e. 1's make ~5-15% of the data

In the cases listed above, the first one is never a problem, in fact, it's the best-case scenario. The second case can be tackled by training for a longer time or using boosting algorithms like ADA, Gradient or xgBoost. The third one i.e. highly imbalanced class problem is difficult to handle. If we train longer i.e. increase the number of epoch, we'll see our accuracy increasing and constant drop in the loss but when we check the confusion matrix, we notice our model performs poorly on the minority classes. The increase in accuracy is because it is able to classify/fit/overfit the majority class better which comprises of 90-95% of the data and ignoring the minority classes. Boosting algorithms can only offer limited help in highly imbalanced data case.

The solution for such a scenario can be obtained using the python library - imblearn. This library offers a variety of oversampling and undersampling techniques. Currently, this article focusses on oversampling techniques, to avoid data loss/reduction. There are 3 main algorithms I recommend in this library that will a good place to start with and will yield faster results:

Random Oversampling
Synthetic Minority Oversampling Technique (SMOTE & Boderline SMOTE)
ADASYN

Following is a crash course based on research and various online sources to showcase the same (please ignore typos) and I hope this proves useful for your next ML application. Towards the end of this article, I've shared the links of the papers for these techniques for a deeper understanding.

External links to the papers of respective algorithms:

If you are not able to read any of the papers, copy the links and paste it here

Atul Awad 5y

Great article!!!Just adding to your thoughts, mostly when using SMOTE we can only make an event rate as high as 1 but I have created a tool which enables to customize the event rate

Handling Data Imbalance

Amit Raj

More articles by Amit Raj

Others also viewed

Random Forest

Unveiling the Power of Multiclass Classification (Part 9)

Lies, Damned Lies and Data Science

from data preparation , visualization to fit a simple neuralnet in R+RStudio

From SQL to AI: How I Built a Prediction Model

Demystifying Data Science, Part IV: Models and Machine Learning

Classification without the Plumbing: Bring ML to your Data with SingleStore ML Functions

Data Science in 17 bullet points

With significant project time being spent on data preparation, is it that significant?

Back to Basics with Sorting Algorithms

Explore content categories

More articles by Amit Raj

Entropy Loss: The Fundamental metric of Classification Algorithms

The Other Side of K-NN

Effectively tackling the multiclass problem: Siamese Models

Automated Outlier Detection: Resolving Outliers in a Flash

Iterative Imputer: Hidden Gem of sklearn