Imbalanced classification
Welcome!
I'm Ali Mirzaei Data scientist.
and I help developers get results with machine learning.
Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.
Before we proceed to the topic. Balanced dataset means target column of class A and class B should be in 50:50 ratio or 60:40 ratio.When we have class A and B of 80:20 or 90:10 is considered as Imbalanced Dataset.
If we have such dataset, the model will get biased and it will lead to Model Overfitting.
To avoid such situation, we try to sample the dataset.
Problem with an Imbalanced Datasets
Let’s say you are working in a leading tech company and company is giving you a task to train a model on detecting the fraud detection. But here’s the catch. The fraud transaction is relatively rare;So you start to training you model and get over 95% accuracy.You feel good and present your model in front of company CEO’s and Share Holders.When they give inputs to your model so your model is predicting “Not a Fraud Transaction” every time.This is clearly a problem because many machine learning algorithms are designed to maximize overall accuracy.
Now what happen?? You get 95% accuracy but your model in predicting wrong every time?
Let’s find out why?
The 3 Mistakes Made By Beginners
Imbalanced classification problems look like normal classification problems.
As such, beginners wonder in and start using their normal techniques. It may even look like they are getting good results, but they are falling into the most common trap (that you really want to avoid)!
The common mistakes that beginners make when working on imbalanced classification problems are as follows:
1. They Use Classification Accuracy
Beginners will use classification accuracy to estimate performance.
Accuracy is dangerously misleading.
If 99% of examples in a dataset belong to one class, a model that always predicts that class will achieve a classification accuracy of 99%. This looks good to a beginner, but in fact is the worst case performance.
2. They Fit Models on Raw Data
Beginners will fit standard models on raw data.
Fitting on raw data will result in terrible performance.
If 99% of examples in a dataset belong to one class, then standard models fit on this dataset would focus attention on the majority class at the expense of the minority class.
3. They Use Standard Algorithms
Beginners will use standard machine learning algorithms.
Standard algorithms treat all classification errors as the same.
If 99% of examples in a dataset belong to one class, then misclassification errors for the minority class should be a lot more important to the model than misclassification errors for the majority class.
- What is Sampling ?
Sampling meant to increase the minority class records or delete majority class records in order to make the dataset as Balanced Dataset.Sampling could be applied to Binary or Multiclass Classification problems.
Techniques to Convert Imbalanced Dataset into Balanced Dataset
Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.Now, let’s cover a few techniques to solve the class imbalance problem.
It is a challenging problem in general, especially if little is known about the dataset, as there are tens, if not hundreds, of machine learning algorithms to choose from. The problem is made significantly more difficult if the distribution of examples across the classes is imbalanced. This requires the use of specialized methods to either change the dataset or change the learning algorithm to handle the skewed class distribution.
A common way to deal with the overwhelm on a new classification project is to use a favorite machine learning algorithm like Random Forest or SMOTE. Another common approach is to scour the research literature for descriptions of vaguely similar problems and attempt to re-implement the algorithms and configurations that are described.
These approaches can be effective, although they are hit-or-miss and time-consuming respectively. Instead, the shortest path to a good result on a new classification task is to systematically evaluate a suite of machine learning algorithms in order to discover what works well, then double down. This approach can also be used for imbalanced classification problems, tailored for the range of data sampling, cost-sensitive, and one-class classification algorithms that one may choose from.
in this article , I will focus only on Data sampling algorithms.
1-Over-sampling (Up Sampling): This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.(Or)Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to over fitting.
Advantages
- No loss of information
- Mitigate overfitting caused by oversampling.
Disadvantages
Overfitting
2— Under-sampling (Down Sampling): Unlike oversampling, this technique balances the imbalance dataset by reducing the size of the class which is in abundance. There are various methods for classification problems such as cluster centroids and Tomek links. The cluster centroid methods replace the cluster of samples by the cluster centroid of a K-means algorithm and the Tomek link method removes unwanted overlap between classes until all minimally distanced nearest neighbors are of the same class.(Or)Under-sampling, on contrary to over-sampling, aims to reduce the number of majority samples to balance the class distribution. Since it is removing observations from the original data set, it might discard useful information.
Advantages
- Run-time can be improved by decreasing the amount of training dataset.
- Helps in solving the memory problems
Disadvantages
- Losing some critical information
3-Feature selection: In order to tackle the imbalance problem, we calculate the one-sided metric such as correlation coefficient (CC) and odds ratios (OR) or two-sided metric evaluation such as information gain (IG) and chi-square (CHI) on both the positive class and negative class. Based on the scores, we then identify the significant features from each class and take the union of these features to obtain the final set of features. Then, we use this data to classify the problem.
Identifying these features will help us generate a clear decision boundary with respect to each class. This helps the models to classify the data more accurately. This performs the function of intelligent subsampling and potentially helps reduce the imbalance problem
4- Combining Oversample and Undersample
- Its better to combine oversampling and undersampling together.
- First apply OverSampling on minority class labels by 50 % and then apply UnderSampling on majority class labels by 20 or 30%.
- By doing so, we might not lose major datapoints instead we lose 20 or 30% datapoints only.
Data
The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y).
source: https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv
Data exploration
percentage of no subscription is 88.73458288821988
percentage of subscription 11.265417111780131
Our classes are imbalanced, and the ratio of no-subscription to subscription instances is 89:11. Before we go ahead to balance the classes, let’s do some more exploration.
Collinearity
Fit the model uing Logitic Regression
We can see 91% accuracy, we are getting very high accuracy because it is predicting mostly the majority class that is 0 (no subscription).
Create dummy variables
That is variables with only two values, zero and one
Over-sampling using SMOTE
One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short
With our training data created, I’ll up-sample the no-subscription using the SMOTE algorithm(Synthetic Minority Oversampling Technique). At a high level, SMOTE:
- Works by creating synthetic samples from the minor class (no-subscription) instead of creating copies.
- Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.We are going to implement SMOTE in Python.
Now we have a perfect balanced data! You may have noticed that I over-sampled only on the training data, because by oversampling only on the training data, none of the information in the test data is being used to create synthetic observations, therefore, no information will bleed from test data into the model training.
Under-sampling: Tomek links
omek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.Tomek’s link exists if the two samples are the nearest neighbors of each other.
- Combining Oversample and Undersample
Next, the dataset is transformed, first by oversampling the minority class, then undersampling the majority class.
Recursive Feature Elimination
Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.
The RFE has helped us select the following features: “euribor3m”, “job_blue-collar”, “job_housemaid”, “marital_unknown”, “education_illiterate”, “default_no”, “default_unknown”, “contact_cellular”, “contact_telephone”, “month_apr”, “month_aug”, “month_dec”, “month_jul”, “month_jun”, “month_mar”, “month_may”, “month_nov”, “month_oct”, “poutcome_failure”, “poutcome_success”.
Penalize Algorithms (Cost-Sensitive Training)
The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.A popular algorithm for this technique is Penalized-SVM.During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is.We also want to include the argument probability=True if we want to enable probability estimates for SVM algorithms.
Let’s train a model using Penalized-SVM on the original imbalanced dataset:
Random Forests algorithm
While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets.Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll jump right into those:Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be addressed.
Below is a list of eight examples of problem domains where the class distribution of examples is inherently imbalanced.
- Fraud Detection.
- Claim Prediction
- Churn Prediction.
- Spam Detection.
- Anomaly Detection.
- Outlier Detection.
- Intrusion Detection
- Conversion Prediction
- Road traffic Prediction
Conclusion
To summarize, in this article, we have seen various techniques to handle the class imbalance in a dataset. There are actually many methods to try when dealing with imbalanced data. Hope this article was useful if so please share and like it.
Thanks for reading…!
Ali Mirzaei
Great article, Ali Mirzaei! My two cents: 1) I'm not a big fan of over-sampling. In my experience, it's best to under-sample (if possible) or use cost-sensitive training instead. 2) you didn't mention it but class imbalance can make machine learning models less fair towards underrepresented classes which is one of the reasons practitioners should be extra careful with this sort of problem.
Hi Ali, Your work is really amazing , well detailed oriented and also, I took a look at your code, it’s quite cleaner specially the way you use list comprehension is pythonic way. This article will really help beginners to understand and grasp the concepts more quickly.