In coding: To err is AI

In coding: To err is AI

We are deep into the AI coding era where agents are writing production code, running multi-step tasks across entire codebases, and operating with permissions equivalent to your most senior engineers. The tools are better than they've ever been and the models are smarter than ever with adoption at an all-time high.

In December 2025, Amazon’s Kiro agent was tasked to fix a minor error in AWS Cost Explorer. Kiro had operator-level permissions, equivalent to a human developer and no mandatory peer review existed for AI-initiated production changes. Unfortunately, Kiro decided that the optimal approach was to delete the entire environment

and rebuild it from scratch which caused an outage for over 13 hours. And such error-causing decisions are not limited to a single incident - In a single week of March 2026, four high-severity incidents hit Amazon's retail website, including a six-hour meltdown that locked shoppers out of checkout, account information, and product pricing.

Amazon's internal post-mortem flagged "a trend of incidents and unsafe practices with a high blast radius" and "novel GenAI usage, for which best practices and safeguards are not yet fully established."

This is one of the most sophisticated engineering organizations on the planet with thousands of engineers and billions in infrastructure, and the post-mortem conclusion was that nobody told the agent what it wasn't allowed to do.

That is not a model problem or a tooling problem. It is a spec problem and it is a pandemic.

We have come very far on the journey of agentic coding, yet it is not uncommon to hear things like "Claude deleted my entire database." However random it may sound, it is not; if looked closely, these are predictable outputs of an agent acting without a constant source of truth.

The journey of agentic coding was always building up to this problem

Phase I: Autocomplete. Copilot-style suggestions. Useful, but you were still the one holding the wheel.

Phase II: LLMs generating full functions, files, and features. The model could write code. But directionlessly.

Phase III: Specialized agents powered by context engineering running on most effective harnesses. Now the conversation is entirely about orchestration and questions like which tools, which memory, which retrieval strategy, which model for which subtask.

Reaching phase III is great, but the results don't reflect so specifically in large enterprise codebases. Because the single source of truth is still missing and it can't be limited to your codebase.

You can't run enterprise engineering on a bag of magic tricks

Ever read someone's post "Here's the best way to use Claude code"? Well, that doesn't work!

I recently interviewed a power user who has spent the last several months running Claude Code, Codex, Amp, Opencode, and a few others back to back. His read: Claude is bloated, a hundred ways to do things, the compression is still unreliable. Codex one-shots thinking tasks, does fewer things, gets them right. It depends on model capability, not a bag of tricks. Amp is opinionated and expensive, and if your first prompt didn't land, you're stuck in a loop. Opencode gets a lot right with the right setup but feels like it's hit an evolutionary ceiling. The consensus across all of them: TUI and CLI are not the right interfaces for this.

It's interesting that at the end of the day, all of them converge on the same product. The industry has empirical consensus on what works.

Which means the question of what to build and how to build it correctly becomes the only one that matters.

Agents without a spec is a country without a constitution

You can obsess over context engineering, pick the best harness. You can fine-tune retrieval and memory and tool routing but if you don't have a spec, a single source of truth about how your system is supposed to behave, you are running a country with no constitution.

Coding agents cannot operate on the belief that there is no right and wrong, only instructions. That's how you get five different implementations of the same function. None of them is wrong by the model's logic, yet all of them are wrong for your system.

The model doesn't know your blast radius and doesn't know which abstractions are load-bearing. It doesn't know that the auth module was deliberately kept separate, or that the retry logic in the queue processor has a specific reason for being the way it is.

That knowledge lives somewhere. In your senior engineer's head, in a Notion doc nobody has updated since Q3, in a comment buried three files deep.

That is not a knowledge management problem. That is a spec problem.

The Kiro incident makes this concrete. The AI did not go rogue. It operated within the permissions it was given. A static developer tool with the same misconfigured permissions would have waited for a human to type a specific command. Kiro decided what the command should be. Nobody told it that deleting a production environment was off the table. That single missing constraint, one sentence in a requirements document, caused a 13-hour outage. The internal briefing note flagged "high blast radius" as a recurring characteristic of the failures.

This does not scale. And it does not get safer with a better model. It gets safer with a spec.

What does Spec-Driven Development enable?

A spec doesn't just tell the agent what to build, but more importantly, tells the agent what not to break. It defines the boundaries of acceptable output and encodes the decisions that were already made, so the model stops reopening them.

Sean Grove from OpenAI put it plainly at AI Engineer World's Fair last year: "Code is sort of 10 to 20% of the value that you bring. The other 80 to 90% is in structured communication." His argument, that specifications, not prompts or code, are becoming the fundamental unit of programming, got a million views and sparked a genuine debate. The debate is interesting. But the underlying observation is not controversial: if you can't communicate what correct looks like, no agent can build it.

Article content

The spec is not the PRD nor the README. It's the document that encodes decisions already made, so the model stops reopening them. Every time a pattern is established, it should be named there. Every time an architectural call is made, it should land there. Grove's closing advice was to start with the specification. Describe the feature's goal, assumptions, constraints, and success criteria before a single line of code is written.

This is what we built Potpie around

Potpie's agents were always good at navigating codebases. They understood structure and could trace dependencies, map blast radius, and surface relevant context. But good is not the same as consistent. And consistency at enterprise scale requires something the model alone cannot provide: a reference point that does not disappear when the context window runs out.

We made one decision: that agents would always reference the spec. Not limited to as a starting prompt or as a one-time instruction but as a persistent anchor that every agent action is measured against, regardless of how deep into a task the session has gone and regardless of what the model has lost along the way.


Article content


That one decision changed everything.

Enterprise codebases are not greenfield projects. They are decades of decisions, accumulated patterns, implicit contracts between systems that nobody fully documented but everybody understands. When an agent operates without a spec in that environment, it is guessing inside a minefield. One wrong assumption can trigger a production incident, block multiple teams, and end in a 2 am rollback.

Potpie's agents now operate with the spec as a constant. Context can compress. Memory can fade. The spec cannot be lost. It is the one thing the agent always knows. And for enterprise companies, especially the large ones, that is not a feature. It is the only acceptable way to run agents on code that actually matters.

The difference between our agents then and now is not a better model or a smarter retrieval strategy but one structural guarantee that the agent always knows what correct looks like.

The teams that will win this are the ones who treat the spec as a first-class, living artifact.

Without it, you are not doing agentic development; rather, you are doing very fast, very confident, very expensive guessing.

Hi Congratulations on your recent fundraise 👏 Wishing you and the team continued growth and success ahead. I’m Shubham, building a crochet manufacturing and brand business, bootstrapped from day one. Currently, I’m looking to scale it further and explore the right opportunities around growth and funding. If you happen to know anyone who actively invests or could be relevant for this space, I would really appreciate any introduction or guidance. Thank you for your time, and wishing you all the best once again. Warm regards, Shubham

Like
Reply

Nicely written, any enterprise AI should be reliable, should showcase auditability, should be secure and with controled accuracy. I suppose any software fulfilling these 4 paradigm are great production AI

Like
Reply

I agree specs are important, but I’m not convinced they can capture the full complexity of real systems. A lot of critical knowledge is implicit and evolves over time. How do you prevent the spec itself from becoming the weakest link?

To view or add a comment, sign in

More articles by Aditi Kothari

Others also viewed

Explore content categories