The Data Hoarding Trap

Aadi Manchanda

Published Aug 6, 2025

There's a persistent myth that more data is always better. Too many organizations treat data collection like a “catch-all” game: if it comes in, dump it in the lakehouse and worry about it later.

Over the past year, working across industries and with a variety of customers, I've seen the same pattern: massive data lakes brimming with low-signal, outdated, and often irrelevant data. Teams are hoarding everything—logs, user events, sensor outputs—without any clear strategy. The result?

Models get noisier.
Storage and maintenance costs skyrocket.
Increased attack surface for privacy breaches.
Duplicate efforts across teams.

So here’s the question:

Are we indiscriminately hoarding data even when it adds noise, privacy risks, and unnecessary overhead?

If you’re nodding, you're not alone. This was once considered "safe." But now? It's expensive, risky, and often, worthless.

What Actually Works: Curated, Targeted Data Collection

Here’s what separates the Data Avengers from the hoarders:

1. Define Use Cases First

Only collect the data that directly supports your specific problem or use case. Don’t collect just because it’s available.

2. Regular Audits and Data Pruning

Constantly assess what data is still relevant.
Prune old, irrelevant data and consolidate duplication across teams.
Clean up redundant datasets—both to reduce storage and maintenance costs and to avoid pulling in noise for your models.

3. Architect Smart

Develop scalable architectures that reduce the ingestion time of new data.
Implement a metadata-driven framework for faster ingestion and automatic cleansing of incoming data.
Create sandbox environments or self-serve tools for rapid prototyping, ensuring data makes sense before it’s dumped into your data lake.

Recommended by LinkedIn

Navigating the Complex Landscape of Company Data:…

Mike Beardshall 2 years ago

Harnessing Data for National Security: The…

Portman Square Group 2 years ago

The Dilemma of Data Engineer

Jatinder S Luthra 1 month ago

4. Intentional Sampling

Smart sampling AI models: stratified, adversarial, or uncertainty-based sampling can provide high-value data insights without overloading the system.
You don't need everything. 10x less data can often yield the same results, so make your sampling intentional.

5. Privacy-By-Design Collection

Architect systems to collect only the minimum required fields necessary to achieve the goal.
Default to opt-out anonymization, and pseudonymize as early as possible.
Implement automatic retention policies to purge low-value or stale data regularly.

It’s Not About More Data—It’s About Better Data

Ask yourself these questions:

What’s the signal-to-noise ratio of the data we’re collecting?
Is this data actually solving a problem, or are we collecting “just in case”?
Could we achieve the same results with 10x less data?

If you haven’t critically examined your data collection pipeline lately, now is the time.

Accumulating data blindly is a drag on innovation—not a catalyst for it. It's a drain on resources—human, computational, and financial—diverting attention from what truly matters. Smart organizations curate data with intent. Without this focus, leaders often question the ROI of the data lake, missing the bigger picture of efficient, purposeful data use.

That’s it. No hype, just hard-earned lessons from the field.

Until next time,

The Data Avengers Team

The Data Hoarding Trap

Aadi Manchanda

What Actually Works: Curated, Targeted Data Collection

1. Define Use Cases First

2. Regular Audits and Data Pruning

3. Architect Smart

Recommended by LinkedIn

4. Intentional Sampling

5. Privacy-By-Design Collection

It’s Not About More Data—It’s About Better Data

Fastlane: Data & AI Chronicles

277 followers

More articles by Aadi Manchanda

Others also viewed

Is Big Data Changing The World For The Worse?

It’s Not Just About the Data – Thinking about a Data Revolution

Data has become an asset to companies but will they actually embrace it?

Data Analytics Trends Every Business Leader Must Keep an Eye on in 2024

Navigating the Evolving Data Landscape and Breaking Down Barriers

The Data Stewards: Ushering in an Era of Responsible Data Governance

Data - The New Oil: Data is the raw material of the information age and the need for strengthened data governance.

Exploring the nascent personal data {portability, sharing, mobility} market models, players and positioning.

Big Data Is The 800-pound Gorilla In The Room

What is Personal Data?

Explore content categories

What Actually Works: Curated, Targeted Data Collection

1. Define Use Cases First

2. Regular Audits and Data Pruning

3. Architect Smart

Recommended by LinkedIn

4. Intentional Sampling

5. Privacy-By-Design Collection

It’s Not About More Data—It’s About Better Data

Fastlane: Data & AI Chronicles

277 followers

More articles by Aadi Manchanda

Avengers HQ : Post-FabCon Debrief

Beyond the Lakehouse: Why Your AI Strategy Needs an Integrated DB

The Endgame Christmas Edition - "Modern Data Platform" Truth

Data Lakehouse Costs Out of Control? Race Ahead with These Optimization Secrets!

Microsoft Fabric Fastlane: The Governance Grid

Microsoft Fabric Fastlane: "Blueprint for the New Data Universe"

Microsoft Fabric : “The Sharing Saga”

Speed Meets Resilience – Empowering Your Data Journey with the Data Avengers!

The Data Intelligence Fastlane – Teaser Edition

Others also viewed

Is Big Data Changing The World For The Worse?

It’s Not Just About the Data – Thinking about a Data Revolution

Data has become an asset to companies but will they actually embrace it?

Data Analytics Trends Every Business Leader Must Keep an Eye on in 2024

Navigating the Evolving Data Landscape and Breaking Down Barriers

The Data Stewards: Ushering in an Era of Responsible Data Governance

Data - The New Oil: Data is the raw material of the information age and the need for strengthened data governance.

Exploring the nascent personal data {portability, sharing, mobility} market models, players and positioning.

Big Data Is The 800-pound Gorilla In The Room

What is Personal Data?

Similar topics

How to Ensure Transparent Data Usage in AI Models

Explore content categories