Imbalanced classification

Welcome!

I'm Ali Mirzaei Data scientist.

and I help developers get results with machine learning.


Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.

Before we proceed to the topic. Balanced dataset means target column of class A and class B should be in 50:50 ratio or 60:40 ratio.When we have class A and B of 80:20 or 90:10 is considered as Imbalanced Dataset.

No alt text provided for this image





If we have such dataset, the model will get biased and it will lead to Model Overfitting.

No alt text provided for this image




To avoid such situation, we try to sample the dataset.

Problem with an Imbalanced Datasets

Let’s say you are working in a leading tech company and company is giving you a task to train a model on detecting the fraud detection. But here’s the catch. The fraud transaction is relatively rare;So you start to training you model and get over 95% accuracy.You feel good and present your model in front of company CEO’s and Share Holders.When they give inputs to your model so your model is predicting “Not a Fraud Transaction” every time.This is clearly a problem because many machine learning algorithms are designed to maximize overall accuracy.

Now what happen?? You get 95% accuracy but your model in predicting wrong every time?

Let’s find out why?

The 3 Mistakes Made By Beginners

Imbalanced classification problems look like normal classification problems.

As such, beginners wonder in and start using their normal techniques. It may even look like they are getting good results, but they are falling into the most common trap (that you really want to avoid)!

The common mistakes that beginners make when working on imbalanced classification problems are as follows:

1. They Use Classification Accuracy

Beginners will use classification accuracy to estimate performance.

Accuracy is dangerously misleading.

If 99% of examples in a dataset belong to one class, a model that always predicts that class will achieve a classification accuracy of 99%. This looks good to a beginner, but in fact is the worst case performance.

2. They Fit Models on Raw Data

Beginners will fit standard models on raw data.

Fitting on raw data will result in terrible performance.

If 99% of examples in a dataset belong to one class, then standard models fit on this dataset would focus attention on the majority class at the expense of the minority class.

3. They Use Standard Algorithms

Beginners will use standard machine learning algorithms.

Standard algorithms treat all classification errors as the same.

If 99% of examples in a dataset belong to one class, then misclassification errors for the minority class should be a lot more important to the model than misclassification errors for the majority class.

  1. What is Sampling ?

Sampling meant to increase the minority class records or delete majority class records in order to make the dataset as Balanced Dataset.Sampling could be applied to Binary or Multiclass Classification problems.

Techniques to Convert Imbalanced Dataset into Balanced Dataset

Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.Now, let’s cover a few techniques to solve the class imbalance problem.

It is a challenging problem in general, especially if little is known about the dataset, as there are tens, if not hundreds, of machine learning algorithms to choose from. The problem is made significantly more difficult if the distribution of examples across the classes is imbalanced. This requires the use of specialized methods to either change the dataset or change the learning algorithm to handle the skewed class distribution.

A common way to deal with the overwhelm on a new classification project is to use a favorite machine learning algorithm like Random Forest or SMOTE. Another common approach is to scour the research literature for descriptions of vaguely similar problems and attempt to re-implement the algorithms and configurations that are described.

These approaches can be effective, although they are hit-or-miss and time-consuming respectively. Instead, the shortest path to a good result on a new classification task is to systematically evaluate a suite of machine learning algorithms in order to discover what works well, then double down. This approach can also be used for imbalanced classification problems, tailored for the range of data sampling, cost-sensitive, and one-class classification algorithms that one may choose from.

No alt text provided for this image

in this article , I will focus only on Data sampling algorithms.

1-Over-sampling (Up Sampling): This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.(Or)Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to over fitting.

No alt text provided for this image

Advantages

  • No loss of information
  • Mitigate overfitting caused by oversampling.

Disadvantages

Overfitting

2— Under-sampling (Down Sampling): Unlike oversampling, this technique balances the imbalance dataset by reducing the size of the class which is in abundance. There are various methods for classification problems such as cluster centroids and Tomek links. The cluster centroid methods replace the cluster of samples by the cluster centroid of a K-means algorithm and the Tomek link method removes unwanted overlap between classes until all minimally distanced nearest neighbors are of the same class.(Or)Under-sampling, on contrary to over-sampling, aims to reduce the number of majority samples to balance the class distribution. Since it is removing observations from the original data set, it might discard useful information.

No alt text provided for this image

Advantages

  • Run-time can be improved by decreasing the amount of training dataset.
  • Helps in solving the memory problems

Disadvantages

  • Losing some critical information

3-Feature selection: In order to tackle the imbalance problem, we calculate the one-sided metric such as correlation coefficient (CC) and odds ratios (OR) or two-sided metric evaluation such as information gain (IG) and chi-square (CHI) on both the positive class and negative class. Based on the scores, we then identify the significant features from each class and take the union of these features to obtain the final set of features. Then, we use this data to classify the problem.

Identifying these features will help us generate a clear decision boundary with respect to each class. This helps the models to classify the data more accurately. This performs the function of intelligent subsampling and potentially helps reduce the imbalance problem

4- Combining Oversample and Undersample

  • Its better to combine oversampling and undersampling together.
  • First apply OverSampling on minority class labels by 50 % and then apply UnderSampling on majority class labels by 20 or 30%.
  • By doing so, we might not lose major datapoints instead we lose 20 or 30% datapoints only.

Data

The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y).

source: https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv

Data exploration

No alt text provided for this image

percentage of no subscription is 88.73458288821988

percentage of subscription 11.265417111780131

Our classes are imbalanced, and the ratio of no-subscription to subscription instances is 89:11. Before we go ahead to balance the classes, let’s do some more exploration.

Collinearity

No alt text provided for this image









Fit the model uing Logitic Regression

No alt text provided for this image

We can see 91% accuracy, we are getting very high accuracy because it is predicting mostly the majority class that is 0 (no subscription).

Create dummy variables

That is variables with only two values, zero and one

No alt text provided for this image

Over-sampling using SMOTE

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short

With our training data created, I’ll up-sample the no-subscription using the SMOTE algorithm(Synthetic Minority Oversampling Technique). At a high level, SMOTE:

  1. Works by creating synthetic samples from the minor class (no-subscription) instead of creating copies.
  2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.We are going to implement SMOTE in Python.
No alt text provided for this image


No alt text provided for this image

Now we have a perfect balanced data! You may have noticed that I over-sampled only on the training data, because by oversampling only on the training data, none of the information in the test data is being used to create synthetic observations, therefore, no information will bleed from test data into the model training.

Under-sampling: Tomek links

omek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.Tomek’s link exists if the two samples are the nearest neighbors of each other.

No alt text provided for this image



No alt text provided for this image
  • Combining Oversample and Undersample

Next, the dataset is transformed, first by oversampling the minority class, then undersampling the majority class.

No alt text provided for this image

Recursive Feature Elimination

Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.

No alt text provided for this image







The RFE has helped us select the following features: “euribor3m”, “job_blue-collar”, “job_housemaid”, “marital_unknown”, “education_illiterate”, “default_no”, “default_unknown”, “contact_cellular”, “contact_telephone”, “month_apr”, “month_aug”, “month_dec”, “month_jul”, “month_jun”, “month_mar”, “month_may”, “month_nov”, “month_oct”, “poutcome_failure”, “poutcome_success”.

Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.A popular algorithm for this technique is Penalized-SVM.During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is.We also want to include the argument probability=True if we want to enable probability estimates for SVM algorithms.

Let’s train a model using Penalized-SVM on the original imbalanced dataset:

No alt text provided for this image


Random Forests algorithm

While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets.Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll jump right into those:Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be addressed.

No alt text provided for this image

Below is a list of eight examples of problem domains where the class distribution of examples is inherently imbalanced.

  1. Fraud Detection.
  2. Claim Prediction
  3. Churn Prediction.
  4. Spam Detection.
  5. Anomaly Detection.
  6. Outlier Detection.
  7. Intrusion Detection
  8. Conversion Prediction
  9. Road traffic Prediction

Conclusion

To summarize, in this article, we have seen various techniques to handle the class imbalance in a dataset. There are actually many methods to try when dealing with imbalanced data. Hope this article was useful if so please share and like it.

Thanks for reading…!

Ali Mirzaei

Great article, Ali Mirzaei! My two cents: 1) I'm not a big fan of over-sampling. In my experience, it's best to under-sample (if possible) or use cost-sensitive training instead. 2) you didn't mention it but class imbalance can make machine learning models less fair towards underrepresented classes which is one of the reasons practitioners should be extra careful with this sort of problem.

Hi Ali, Your work is really amazing , well detailed oriented and also, I took a look at your code, it’s quite cleaner specially the way you use list comprehension is pythonic way. This article will really help beginners to understand and grasp the concepts more quickly.

Like
Reply

To view or add a comment, sign in

More articles by Ali Mirzaei

Others also viewed

Explore content categories