"How much Machine Learning depends on the bias in the data?" - An Example
I recently participated in an NLP kaggle competition called tweet sentiment extraction. The aim was to construct a model that can look at the labeled sentiment (positive, negative, or neutral) for a given tweet and figure out what word or phrase best supports it.
As it was tweet data, a lot of noisy stuff (slangs, repetitive characters in a word like sleeeeepppyyy, happppy, etc.) was there. so, I thought that the words with repetitive characters were affecting the model very much, because those words are highly sentimental in most cases, and every time the model finds such a word, it will be out of vocab.
So, I decided to build something that can turn this kind of words into meaningful words by removing extra repeated characters. I used the nltk library to decide whether to keep 2 repetitive characters or only 1 by checking which one is in the English dictionary. And built a tool that can convert each data instance into a meaning full sentence, maintain the meaningfulness throughout the model and track-back predictions to the original form as follows,
The conventions were pretty meaningful. And I was quite confident that this will give me a good lead.
But But But! When I tried this, my score decreased. I was confused. This should not have happened. So, I dug into the dataset more and found something really insightful.
The dataset was labeled in such a way that the labels were not containing the words with repetitive characters, even if that word is highly positive or negative. So, before I created the converter, the model was ignoring those words because those were out of vocab. And after making those words meaningful with the converter, the model got more confused because those words were highly sentimental but not to be selected as per the labels.
That shows the importance of labels and data. I mean, while designing an ML algorithm, it is crucial to know that in which direction the data and labels are biased. I had wasted almost one and a half days but ultimately I got to learn something :)
Below is the link to the kaggle kernel that contains the converter I mentioned above.
- Special thanks to Abhishek Thakur, I learned a lot from his kernels and videos, and to Bhoomit Vasani for constant support and guidance.
P.S: I was about to get into bronze (top 10%) on the public leaderboard, but the private leaderboard threw me to 30% 😓.
Special thanks to Abhishek Thakur, I learned a lot from his kernels and videos! And to Bhoomit Vasani for constant support and guidance.