ML Nugget#4: Data Quality vs Quantity
A common mantra in machine learning is that “more data is always better.” And it’s true that large datasets have fueled much of the progress in modern AI — from ImageNet powering the deep learning revolution to trillion-token corpora driving today’s large language models. But there’s a catch: not all data is created equal. Bigger isn’t always better.
The quality of your data often matters as much as — and sometimes more than — the sheer quantity. In fact, poor-quality, irrelevant, or mislabeled data can actively hurt performance by distracting the model from the true signal. At the same time, “low-quality” data isn’t always bad: if designed carefully (as in data augmentation), it can make models more robust to the messy conditions they will face in the real world.
So the nuance is this:
This distinction — between harmful noise and helpful noise — is crucial for both traditional ML and modern deep learning.
Experiments
To make this point concrete, we ran a series of simple experiments:
1. Label Noise: When labels were corrupted in a synthetic classification task, performance dropped sharply. More data didn’t help, since contradictions in labels overwhelmed the model. This mirrors real-world annotation problems (e.g., medical imaging or weakly labeled web data). See Figure 1 for more details on the experiment and the results.
2. Relevance: We built two sentiment classifiers: one on a small set of movie reviews, and another on a much larger set of financial news. Tested on movie sentiment, the small-but-relevant model clearly outperformed the large irrelevant one. Relevance beats raw scale. Figure 2 shows the comparison of performance and the top features for both scenarios (there are common sentiment words but also uncommon words specific to each domain – something that the irrelevant models cannot learn).
Recommended by LinkedIn
3. Data Noise — Augmentation vs Degradation: On MNIST digits (5000 base examples), we added mild corruptions like blur and slight distortions (10,000 images). Surprisingly, performance improved compared to clean-only training. This is the principle of data augmentation: noise that preserves labels makes models more robust. But when we increased the corruption severity (heavy occlusion, extreme noise), performance dropped.
4. Garbage Data: Finally, we mixed clean MNIST digits with 10,000 random noise images assigned random labels. Accuracy collapsed compared to training on 5,000 clean digits alone. More data made the model worse by drowning signal in irrelevant junk.
Figure 3 illustrates the scenarios of 3 and 4. In both cases, the clean model is 5000 and the augmented set is 15000.
Takeaway: The lesson is clear: quality matters as much as quantity.
In practice, blindly scaling with unfiltered data is risky. Smart data curation and augmentation often achieve better results than brute-force collection. More data is not always better — better data is better.
This is counter-intuitive and fascinating.