Why data pipelines fail before the code
Image created with Nano Banana AI

Why data pipelines fail before the code


TL;DR

  • Most data pipelines don’t fail because of tools
  • They fail because of missing decisions
  • Architecture is about trade-offs, not diagrams
  • Interviews test how you think, not what you install


The real problem (before any line of code)

When data projects fail, the post-mortem usually blames:

  • the framework
  • the cloud service
  • the performance

That’s rarely the root cause.

Most failures happen before the first line of code is written:

  • unclear business expectations
  • undefined data ownership
  • wrong assumptions about volume, latency, or quality

By the time code is involved, the outcome is already decided.


A typical (and broken) scenario

Business asks:

“We need this data available every morning.”

Engineering assumes:

  • batch job
  • daily schedule
  • best-effort freshness

No one clarifies:

  • what happens if data is late
  • how wrong data is handled
  • who is accountable if numbers change

The pipeline runs. Dashboards are built. Trust slowly erodes.

Not a tooling issue. A decision issue.


The engineering decision that matters

Before choosing Spark, Glue, Lambda, or anything else, a data engineer should answer:

  • What is the acceptable latency?
  • What is the cost of wrong data?
  • Can this be recomputed safely?
  • Who consumes this and how critical is it?

These answers define the architecture.

Tools come later.


Simple architecture (with real trade-offs)

Trade-offs involved:

  • S3 → cheap, durable, slower queries
  • Glue → scalable, slower startup
  • Lambda → fast, limited execution time

There is no “best” option. Only best for this context.


Lessons I've learned

  • Architecture is a business decision expressed in code
  • If you can’t explain why, you don’t own the system
  • Reliability starts with assumptions, not retries
  • Simpler pipelines fail less — and are easier to fix


How this shows up in interviews

You’ll hear questions like:

“How would you design a reliable data pipeline?”

What they’re really asking:

  • Do you clarify requirements?
  • Do you think in trade-offs?
  • Can you connect data to impact?

A strong answer doesn’t list tools. It explains decisions.



Good data engineering doesn’t start with code. It starts with clarity.

Clarity about assumptions. Clarity about trade-offs. Clarity about impact.

That’s the kind of thinking this newsletter will focus on in 2026.

I design, therefore I exist.

Data Science: a Game Changer


To view or add a comment, sign in

More articles by Rocío Baigorria

Others also viewed

Explore content categories