Classification Models on imbalanced data sets

Madhan Seduraman

VP, Head of AI Strategy & Innovation @ Prudential Indonesia | AI Strategy & Innovation

Published Dec 17, 2019

One of the key challenges we come across while building machine learning models is when we have highly skewed class distribution. It is quite often we will face this scenario while dealing with text classification problems. Specifically taking an example of a binary classification problem, lets consider classifying text in any pdf or word document as 'Legal' language or not. Considering sentence level classification, we often end-up in the scenario of > 97% of the sentences is Non-Legal and < 3% being Legal. We have to deal with this skewed distribution of the target class variable,so that the classification model we build will have high precision in the class prediction, with high recall in classifying the actual Legal sentence as Legal and finally should be able to generalise well for the unseen data sets in production.

There are many approaches like up-sampling, down-sampling, SMOTE, T-Link methods are available to treat the scenario of class imbalance. In this article I would like to discuss one of the down-sampling method which yielded consistent results & steady increase in the % recall with continuous re-training of the model.

Using random sampling with replacement, perform down-sampling of the majority class, (class 0 in our example), as equal to the value_counts of class 1. Perform down-sampling for S sample sets (S is 5 in this example). After merging the sample data of class 0 with class 1, build a separate Machine Learning model for each of these set. Each set will of course go through the process of model building, model tuning, model evaluation & testing, cross validation, etc. before we finalise the models.

Each of these models have learnt different information about class 0 and all the information about class 1 from the data set. Based on what each model has learnt, each of these models will answer differently whether or not the input sentence is Legal. Therefore stacking all of these models into a Voting method, we are able to arrive at the final prediction based on majority voting. This method enables the model to be very robust in the production scenario, helps continuously increase the % recall & precision as the model sees more & more data during the re-training process. The actual model performance is shown in week wise % recall improvement.

This is one of the methods provided reliable results consistently for different use cases in the area of text classification. Interestingly SKlearn recent update has the stacking classifier release which is also similar to the approach mentioned above. You can refer to the details here -https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.StackingClassifier.html

Many a times we start with less data in hand along with less chances to collect more data to explore more advanced methods like Deep Neural Networks with LSTM or CNN layers. These customised approaches & the power of randomness definitely help build more generalised models that will run in production with very high accuracy. I hope this article was useful and looking forward for any further inputs or comments.

Sriram S 6y

You say you have taken 5 sample sets but the results are for W1, W2...W8, totally eight sets. Pl clarify the number of models built here.

1 Reaction

Nisha Nath 6y

I think class imbalance is a common problem and great insight to deal with it Madhan. I will try the same in my current project.

Vishnu S Kumar 6y

Good one Madhan

See more comments

To view or add a comment, sign in

Classification Models on imbalanced data sets

Madhan Seduraman

VP, Head of AI Strategy & Innovation @ Prudential Indonesia | AI Strategy & Innovation

More articles by Madhan Seduraman

Others also viewed

Foundation Models or Fine Tuning LLM -What's good for Business?

Building a more accurate model

🚀 Deep Dive into Data Processing: Understanding random_state = 0 and remainder = "passthrough" 🧠

Thoughts on DeepSeek R1

Why my ML model is not working?

SHAP: Bridging the Gap Between Machine Predictions and Actionable Recommendations

The Real Moat is a Data Engineering Problem

How Much Data do I Need??

What are vector stores: A look under the hood

Explore content categories

More articles by Madhan Seduraman

Trying out the workings of Attention Mechanism in Transformer

AI Life Cycle in Automation World

Can you play the dartboard consciously?

Principal Component Analysis

Ignition to Cognition: The Automation story

The energy called Anger

Analytics to fast-track Continuous Improvement

Automation - RPA - Process Improvement Projects