Why data pipelines fail before the code
TL;DR
The real problem (before any line of code)
When data projects fail, the post-mortem usually blames:
That’s rarely the root cause.
Most failures happen before the first line of code is written:
By the time code is involved, the outcome is already decided.
A typical (and broken) scenario
Business asks:
“We need this data available every morning.”
Engineering assumes:
No one clarifies:
The pipeline runs. Dashboards are built. Trust slowly erodes.
Not a tooling issue. A decision issue.
The engineering decision that matters
Before choosing Spark, Glue, Lambda, or anything else, a data engineer should answer:
Recommended by LinkedIn
These answers define the architecture.
Tools come later.
Simple architecture (with real trade-offs)
Trade-offs involved:
There is no “best” option. Only best for this context.
Lessons I've learned
How this shows up in interviews
You’ll hear questions like:
“How would you design a reliable data pipeline?”
What they’re really asking:
A strong answer doesn’t list tools. It explains decisions.
Good data engineering doesn’t start with code. It starts with clarity.
Clarity about assumptions. Clarity about trade-offs. Clarity about impact.
That’s the kind of thinking this newsletter will focus on in 2026.
I design, therefore I exist.
Data Science: a Game Changer