ML Nugget#4: Data Quality vs Quantity

Rishabh Iyer

Published Sep 12, 2025

A common mantra in machine learning is that “more data is always better.” And it’s true that large datasets have fueled much of the progress in modern AI — from ImageNet powering the deep learning revolution to trillion-token corpora driving today’s large language models. But there’s a catch: not all data is created equal. Bigger isn’t always better.

The quality of your data often matters as much as — and sometimes more than — the sheer quantity. In fact, poor-quality, irrelevant, or mislabeled data can actively hurt performance by distracting the model from the true signal. At the same time, “low-quality” data isn’t always bad: if designed carefully (as in data augmentation), it can make models more robust to the messy conditions they will face in the real world.

So the nuance is this:

High-quality, relevant data is always valuable.
Irrelevant or mislabeled data can dilute or degrade performance, even at scale.
Controlled perturbations that preserve the true label (blur, rotation, noise) can be beneficial, because they mimic deployment variability and improve robustness.

This distinction — between harmful noise and helpful noise — is crucial for both traditional ML and modern deep learning.

Experiments

To make this point concrete, we ran a series of simple experiments:

1. Label Noise: When labels were corrupted in a synthetic classification task, performance dropped sharply. More data didn’t help, since contradictions in labels overwhelmed the model. This mirrors real-world annotation problems (e.g., medical imaging or weakly labeled web data). See Figure 1 for more details on the experiment and the results.

Article content — Figure 1: Illustrating the effect of data quality vs quantity.

2. Relevance: We built two sentiment classifiers: one on a small set of movie reviews, and another on a much larger set of financial news. Tested on movie sentiment, the small-but-relevant model clearly outperformed the large irrelevant one. Relevance beats raw scale. Figure 2 shows the comparison of performance and the top features for both scenarios (there are common sentiment words but also uncommon words specific to each domain – something that the irrelevant models cannot learn).

Recommended by LinkedIn

The Role of AI and Machine Learning in Modern Data…

Dr Menka Yuvraj Varma 2 weeks ago

Unpacking the Data Buzz: AI vs. Data Science

Raʼed Awdeh, PhD 1 year ago

RAG: The Future of Smarter AI Systems

GANGULA VISHWAS 6 months ago

3. Data Noise — Augmentation vs Degradation: On MNIST digits (5000 base examples), we added mild corruptions like blur and slight distortions (10,000 images). Surprisingly, performance improved compared to clean-only training. This is the principle of data augmentation: noise that preserves labels makes models more robust. But when we increased the corruption severity (heavy occlusion, extreme noise), performance dropped.

4. Garbage Data: Finally, we mixed clean MNIST digits with 10,000 random noise images assigned random labels. Accuracy collapsed compared to training on 5,000 clean digits alone. More data made the model worse by drowning signal in irrelevant junk.

Figure 3 illustrates the scenarios of 3 and 4. In both cases, the clean model is 5000 and the augmented set is 15000.

Takeaway: The lesson is clear: quality matters as much as quantity.

Clean, relevant data consistently boosts performance.
Mislabeled or irrelevant data dilutes it.
Carefully designed perturbations (augmentation) can help by preparing the model for real-world conditions.

In practice, blindly scaling with unfiltered data is risky. Smart data curation and augmentation often achieve better results than brute-force collection. More data is not always better — better data is better.

Kishan Rakesh 6mo

This is counter-intuitive and fascinating.

To view or add a comment, sign in

ML Nugget#4: Data Quality vs Quantity

Rishabh Iyer

Recommended by LinkedIn

More articles by Rishabh Iyer

Others also viewed

Causality in Machine Learning

Data Science is the MVP for AI Products

Snowflake Agentic AI: Revolutionizing Data-Driven Decision Making

Beyond RAG: Enhancing AI Governance with Graph RAG

How AI works?

DeepSeek Synthetic Data Lessons + Flywheels, RAGs, and Other Breadcrumbs

Democratizing Data: The Role of AI in Empowering Decision-Making Across Organizations

What can Artificial Intelligence do for you?

Dispelling Some Common Myths on AI for the Enterprise

How Data Quality Impacts Genai Performance

How Poor Data Affects AI Results

How Data Integrity Affects AI Performance

The Importance of Data Annotation in AI

How to Prevent Large Language Model Performance Degradation

Explore content categories

Recommended by LinkedIn

More articles by Rishabh Iyer

ML Nugget#5: Principles of Smart Data Selection

ML Nugget# 3: Data Leakage: The Silent Model Killer

ML Nugget #2: Choosing the Right Train/Test Split

ML Nugget #1: Beyond Train/Test: The Deployment Gap and How to Quantify It

Others also viewed

Causality in Machine Learning

Data Science is the MVP for AI Products

Snowflake Agentic AI: Revolutionizing Data-Driven Decision Making

Beyond RAG: Enhancing AI Governance with Graph RAG

How AI works?

DeepSeek Synthetic Data Lessons + Flywheels, RAGs, and Other Breadcrumbs

Democratizing Data: The Role of AI in Empowering Decision-Making Across Organizations

What can Artificial Intelligence do for you?

Dispelling Some Common Myths on AI for the Enterprise

Similar topics

How Data Quality Impacts Genai Performance

How Poor Data Affects AI Results

How Data Integrity Affects AI Performance

The Importance of Data Annotation in AI

How to Prevent Large Language Model Performance Degradation

Explore content categories