Classification Models on imbalanced data sets

Classification Models on imbalanced data sets

One of the key challenges we come across while building machine learning models is when we have highly skewed class distribution. It is quite often we will face this scenario while dealing with text classification problems. Specifically taking an example of a binary classification problem, lets consider classifying text in any pdf or word document as 'Legal' language or not. Considering sentence level classification, we often end-up in the scenario of > 97% of the sentences is Non-Legal and < 3% being Legal. We have to deal with this skewed distribution of the target class variable,so that the classification model we build will have high precision in the class prediction, with high recall in classifying the actual Legal sentence as Legal and finally should be able to generalise well for the unseen data sets in production.

There are many approaches like up-sampling, down-sampling, SMOTE, T-Link methods are available to treat the scenario of class imbalance. In this article I would like to discuss one of the down-sampling method which yielded consistent results & steady increase in the % recall with continuous re-training of the model.

No alt text provided for this image

Using random sampling with replacement, perform down-sampling of the majority class, (class 0 in our example), as equal to the value_counts of class 1. Perform down-sampling for S sample sets (S is 5 in this example). After merging the sample data of class 0 with class 1, build a separate Machine Learning model for each of these set. Each set will of course go through the process of model building, model tuning, model evaluation & testing, cross validation, etc. before we finalise the models.

Week wise % recall improvement

Each of these models have learnt different information about class 0 and all the information about class 1 from the data set. Based on what each model has learnt, each of these models will answer differently whether or not the input sentence is Legal. Therefore stacking all of these models into a Voting method, we are able to arrive at the final prediction based on majority voting. This method enables the model to be very robust in the production scenario, helps continuously increase the % recall & precision as the model sees more & more data during the re-training process. The actual model performance is shown in week wise % recall improvement.

This is one of the methods provided reliable results consistently for different use cases in the area of text classification. Interestingly SKlearn recent update has the stacking classifier release which is also similar to the approach mentioned above. You can refer to the details here -https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.StackingClassifier.html

Many a times we start with less data in hand along with less chances to collect more data to explore more advanced methods like Deep Neural Networks with LSTM or CNN layers. These customised approaches & the power of randomness definitely help build more generalised models that will run in production with very high accuracy. I hope this article was useful and looking forward for any further inputs or comments.

You say you have taken 5 sample sets but the results are for W1, W2...W8, totally eight sets. Pl clarify the number of models built here.

I think class imbalance is a common problem and great insight to deal with it Madhan. I will try the same in my current project.

Like
Reply

To view or add a comment, sign in

More articles by Madhan Seduraman

  • Trying out the workings of Attention Mechanism in Transformer

    All of us would had Attention that all we needed within Transformer. This article I tried to explain my understanding…

  • AI Life Cycle in Automation World

    A plethora of things happening in the world of Artificial Intelligence. From the automation stand point, talking about…

  • Can you play the dartboard consciously?

    When playing dartboard, hitting the target always gives a sense of winning. All of us would have played it.

  • Principal Component Analysis

    Principal Component Analysis (PCA) is one of the powerful mathematical technique that can be used to simplify a data…

    1 Comment
  • Ignition to Cognition: The Automation story

    Parsing Tender Documents is one of the classic use case, briefly explained in this article, demonstrates the capability…

  • The energy called Anger

    Observing life, people around us and their various forms of anger, I think we can broadly classify the energy called…

  • Analytics to fast-track Continuous Improvement

    Converting text to data analytics In the world of digitization & analytics, any form of structured or un-structured…

    4 Comments
  • Automation - RPA - Process Improvement Projects

    The Juran's trilogy developed by Dr.Juran Joseph is an improvement cycle that is meant to reduce cost of operations…

    7 Comments

Others also viewed

Explore content categories