Why data pipelines fail before the code

Rocío Baigorria

Published Jan 1, 2026

+ Follow

TL;DR

Most data pipelines don’t fail because of tools
They fail because of missing decisions
Architecture is about trade-offs, not diagrams
Interviews test how you think, not what you install

The real problem (before any line of code)

When data projects fail, the post-mortem usually blames:

the framework
the cloud service
the performance

That’s rarely the root cause.

Most failures happen before the first line of code is written:

unclear business expectations
undefined data ownership
wrong assumptions about volume, latency, or quality

By the time code is involved, the outcome is already decided.

A typical (and broken) scenario

Business asks:

“We need this data available every morning.”

Engineering assumes:

batch job
daily schedule
best-effort freshness

No one clarifies:

what happens if data is late
how wrong data is handled
who is accountable if numbers change

The pipeline runs. Dashboards are built. Trust slowly erodes.

Not a tooling issue. A decision issue.

The engineering decision that matters

Before choosing Spark, Glue, Lambda, or anything else, a data engineer should answer:

What is the acceptable latency?
What is the cost of wrong data?
Can this be recomputed safely?
Who consumes this and how critical is it?

Recommended by LinkedIn

Understanding yield in C#: Simplifying Iteration Like…

Karuppasamy Pandian 6 months ago

Agentic Data Lakehouse With Claude Code: A…

Bauplan 2 months ago

🔁 The 9-Step Workflow for Every Data Science Project:…

Amit Kharche 1 year ago

These answers define the architecture.

Tools come later.

Simple architecture (with real trade-offs)

Trade-offs involved:

S3 → cheap, durable, slower queries
Glue → scalable, slower startup
Lambda → fast, limited execution time

There is no “best” option. Only best for this context.

Lessons I've learned

Architecture is a business decision expressed in code
If you can’t explain why, you don’t own the system
Reliability starts with assumptions, not retries
Simpler pipelines fail less — and are easier to fix

How this shows up in interviews

You’ll hear questions like:

“How would you design a reliable data pipeline?”

What they’re really asking:

Do you clarify requirements?
Do you think in trade-offs?
Can you connect data to impact?

A strong answer doesn’t list tools. It explains decisions.

Good data engineering doesn’t start with code. It starts with clarity.

Clarity about assumptions. Clarity about trade-offs. Clarity about impact.

That’s the kind of thinking this newsletter will focus on in 2026.

I design, therefore I exist.

Data Science: a Game Changer

Data Science: A Game Changer

872 followers

+ Subscribe

To view or add a comment, sign in

More articles by Rocío Baigorria

Building a Production-Ready Serverless Data Lake: From Raw CSVs to Actionable Insights

Apr 23, 2026

Building a Production-Ready Serverless Data Lake: From Raw CSVs to Actionable Insights

TL;DR Stop building "Data Swamps." A modern Data Lake must be serverless, event-driven, and secured within a VPC.
Stop Babysitting Clusters. Start Designing Systems That Scale Themselves.

Apr 9, 2026

Stop Babysitting Clusters. Start Designing Systems That Scale Themselves.

TL;DR — The Quick Take The Problem Managing dedicated clusters for intermittent SQL queries is expensive and…

1 Comment
The Silent Cost Leaks in AWS Glue Pipelines (and How to Design Around Them)

Mar 12, 2026

The Silent Cost Leaks in AWS Glue Pipelines (and How to Design Around Them)

Last week a founder building an AWS cost-optimization platform reached out with a simple question: Where does the real…

1 Comment
Ecommerce Data Warehouse on Amazon Redshift Serverless: A Production-Grade Star Schema with Historical Accuracy

Feb 12, 2026

Ecommerce Data Warehouse on Amazon Redshift Serverless: A Production-Grade Star Schema with Historical Accuracy

Terraform, AWS Glue, dbt Core, and a Medallion Architecture for scalable ecommerce analytics TL;DR I designed and…

1 Comment
Metrics Are Talking—Is Your Auto Scaling Listening?

Jan 29, 2026

Metrics Are Talking—Is Your Auto Scaling Listening?

Step Scaling: The Hidden Weapon for Data-Driven Auto Scaling TL;DR: Auto Scaling doesn’t have to be “add one instance…

2 Comments
Tendencias Data 2026 en PyMEs Latam

Jan 15, 2026

Tendencias Data 2026 en PyMEs Latam

TL;DR Las PyMEs que usan datos para decidir (no solo para reportar) crecen más rápido. BCG muestra que los data…
Why Most AWS Data Lakes Fail (and How Kiro Helped Me Avoid It)

Dec 18, 2025

Why Most AWS Data Lakes Fail (and How Kiro Helped Me Avoid It)

TL;DR You can design, deploy, and operate a cost-efficient data lake on AWS using Kiro, without memorizing every AWS…

3 Comments
AWS re:Invent Edition

Nov 28, 2025

AWS re:Invent Edition

El nuevo juego para crecimiento, automatización y datos en pequeñas y medianas empresas AWS re:Invent 2025 dejará un…

1 Comment
AWS Re:Invent and the New Data Engineering Baseline: Lakehouses, Low Latency, and AI-Native Architectures

Nov 27, 2025

AWS Re:Invent and the New Data Engineering Baseline: Lakehouses, Low Latency, and AI-Native Architectures

What the most important sessions reveal about where data engineering is heading in 2025. Introduction AWS re:Invent…

2 Comments
The Definitive Guide to Modern Data Architectures for Data Engineers

Nov 20, 2025

The Definitive Guide to Modern Data Architectures for Data Engineers

As we close the year, the noise around “modern data stacks” gets louder, but the truth is simple: companies don’t need…

1 Comment

See all articles

Why data pipelines fail before the code

Rocío Baigorria

TL;DR

The real problem (before any line of code)

A typical (and broken) scenario

The engineering decision that matters

Recommended by LinkedIn

Simple architecture (with real trade-offs)

Lessons I've learned

How this shows up in interviews

Data Science: A Game Changer

872 followers

More articles by Rocío Baigorria

Others also viewed

Model Context Protocols (MCPs) for Data Science

Why Data Science Estimates Fail — and How I Do It Differently

Ready to catapult your Data Engineering career to new heights?

Agents Need Pipelines: What a Closing Keynote Taught Me About the Future of Data Engineering

Why Palantir’s Pipeline Builder Feels Like a Gift at Year End

Reflections on Fabcon'26

20 Tips For Becoming a Data Scientist

A.I. Readiness Series: Data Architecture and Governance

We built AI that actually works for data engineering. We crushed the ADE benchmark.

Explore content categories

TL;DR

The real problem (before any line of code)

A typical (and broken) scenario

The engineering decision that matters

Recommended by LinkedIn

Simple architecture (with real trade-offs)

Lessons I've learned

How this shows up in interviews

Data Science: A Game Changer

872 followers

More articles by Rocío Baigorria

Building a Production-Ready Serverless Data Lake: From Raw CSVs to Actionable Insights

Stop Babysitting Clusters. Start Designing Systems That Scale Themselves.

The Silent Cost Leaks in AWS Glue Pipelines (and How to Design Around Them)

Ecommerce Data Warehouse on Amazon Redshift Serverless: A Production-Grade Star Schema with Historical Accuracy

Metrics Are Talking—Is Your Auto Scaling Listening?

Tendencias Data 2026 en PyMEs Latam

Why Most AWS Data Lakes Fail (and How Kiro Helped Me Avoid It)

AWS re:Invent Edition

AWS Re:Invent and the New Data Engineering Baseline: Lakehouses, Low Latency, and AI-Native Architectures

The Definitive Guide to Modern Data Architectures for Data Engineers

Others also viewed

Model Context Protocols (MCPs) for Data Science

Why Data Science Estimates Fail — and How I Do It Differently

Ready to catapult your Data Engineering career to new heights?

Agents Need Pipelines: What a Closing Keynote Taught Me About the Future of Data Engineering

Why Palantir’s Pipeline Builder Feels Like a Gift at Year End

Reflections on Fabcon'26

20 Tips For Becoming a Data Scientist

A.I. Readiness Series: Data Architecture and Governance

We built AI that actually works for data engineering. We crushed the ADE benchmark.

Explore content categories