ML Nugget#4: Data Quality vs Quantity

ML Nugget#4: Data Quality vs Quantity

A common mantra in machine learning is that “more data is always better.” And it’s true that large datasets have fueled much of the progress in modern AI — from ImageNet powering the deep learning revolution to trillion-token corpora driving today’s large language models. But there’s a catch: not all data is created equal. Bigger isn’t always better.

The quality of your data often matters as much as — and sometimes more than — the sheer quantity. In fact, poor-quality, irrelevant, or mislabeled data can actively hurt performance by distracting the model from the true signal. At the same time, “low-quality” data isn’t always bad: if designed carefully (as in data augmentation), it can make models more robust to the messy conditions they will face in the real world.

So the nuance is this:

  • High-quality, relevant data is always valuable.
  • Irrelevant or mislabeled data can dilute or degrade performance, even at scale.
  • Controlled perturbations that preserve the true label (blur, rotation, noise) can be beneficial, because they mimic deployment variability and improve robustness.

This distinction — between harmful noise and helpful noise — is crucial for both traditional ML and modern deep learning.

Experiments

To make this point concrete, we ran a series of simple experiments:

1.     Label Noise: When labels were corrupted in a synthetic classification task, performance dropped sharply. More data didn’t help, since contradictions in labels overwhelmed the model. This mirrors real-world annotation problems (e.g., medical imaging or weakly labeled web data). See Figure 1 for more details on the experiment and the results.

Article content
Figure 1: Illustrating the effect of data quality vs quantity.

2.     Relevance: We built two sentiment classifiers: one on a small set of movie reviews, and another on a much larger set of financial news. Tested on movie sentiment, the small-but-relevant model clearly outperformed the large irrelevant one. Relevance beats raw scale. Figure 2 shows the comparison of performance and the top features for both scenarios (there are common sentiment words but also uncommon words specific to each domain – something that the irrelevant models cannot learn).


Article content
Figure 2: Illustrating the effect of Sentiment Models on relevant vs irrelevant data

3.     Data Noise — Augmentation vs Degradation: On MNIST digits (5000 base examples), we added mild corruptions like blur and slight distortions (10,000 images). Surprisingly, performance improved compared to clean-only training. This is the principle of data augmentation: noise that preserves labels makes models more robust. But when we increased the corruption severity (heavy occlusion, extreme noise), performance dropped.

4.     Garbage Data: Finally, we mixed clean MNIST digits with 10,000 random noise images assigned random labels. Accuracy collapsed compared to training on 5,000 clean digits alone. More data made the model worse by drowning signal in irrelevant junk.

Figure 3 illustrates the scenarios of 3 and 4. In both cases, the clean model is 5000 and the augmented set is 15000.


Article content
Figure 3: Illustrating the effect of adding Garbage data (left) - performance is significantly degraded vs controlled perturbations (augmentations) on the right, showing slight improvements in the model performance.

Takeaway:  The lesson is clear: quality matters as much as quantity.

  • Clean, relevant data consistently boosts performance.
  • Mislabeled or irrelevant data dilutes it.
  • Carefully designed perturbations (augmentation) can help by preparing the model for real-world conditions.

In practice, blindly scaling with unfiltered data is risky. Smart data curation and augmentation often achieve better results than brute-force collection. More data is not always better — better data is better.

This is counter-intuitive and fascinating.

Like
Reply

To view or add a comment, sign in

More articles by Rishabh Iyer

Others also viewed

Explore content categories