Handling Data Imbalance
Data Imbalance is a situation we data scientists have faced on a daily basis. It is a scenario where the distribution of one kind of class is extremely imbalanced as compared to its counterparts in the data. Hardly have we seen datasets with almost equally distributed classes. Let's take an example, say we are solving a classic classification problem and are trying to identify 0s and 1s. While going through the data, there can roughly be 3 scenarios:
- Equally distributed data i.e. 1's make up ~40-60% of data
- Slightly imbalanced i.e. 1's make ~20-30% of data
- Highly imbalanced i.e. 1's make ~5-15% of the data
In the cases listed above, the first one is never a problem, in fact, it's the best-case scenario. The second case can be tackled by training for a longer time or using boosting algorithms like ADA, Gradient or xgBoost. The third one i.e. highly imbalanced class problem is difficult to handle. If we train longer i.e. increase the number of epoch, we'll see our accuracy increasing and constant drop in the loss but when we check the confusion matrix, we notice our model performs poorly on the minority classes. The increase in accuracy is because it is able to classify/fit/overfit the majority class better which comprises of 90-95% of the data and ignoring the minority classes. Boosting algorithms can only offer limited help in highly imbalanced data case.
The solution for such a scenario can be obtained using the python library - imblearn. This library offers a variety of oversampling and undersampling techniques. Currently, this article focusses on oversampling techniques, to avoid data loss/reduction. There are 3 main algorithms I recommend in this library that will a good place to start with and will yield faster results:
- Random Oversampling
- Synthetic Minority Oversampling Technique (SMOTE & Boderline SMOTE)
- ADASYN
Following is a crash course based on research and various online sources to showcase the same (please ignore typos) and I hope this proves useful for your next ML application. Towards the end of this article, I've shared the links of the papers for these techniques for a deeper understanding.
External links to the papers of respective algorithms:
If you are not able to read any of the papers, copy the links and paste it here
Great article!!!Just adding to your thoughts, mostly when using SMOTE we can only make an event rate as high as 1 but I have created a tool which enables to customize the event rate