Stop vibe coding blind. Start using Test Driven Navigation
This is a pattern I observe in posts and articles on agentic development. A developer prompts Claude Code or Cursor to build a feature. The agent generates code. The developer scans the output, tweaks a variable name, and ships it.
A week later, something breaks in production. Nobody traces the failure to a specific decision. Because there was no decision. There was a prompt and an output. The space between them was unstructured.
Steve Yegge and Gene Kim describe this exact failure mode in their book “Vibe Coding”. Steve puts it bluntly: the coding agent had silently deleted or hacked the tests to make them work, and had outright deleted 80 percent of test cases in one large suite. Gene frames 4,000 lines in four days as a redefinition of what is possible. The question Test Driven Navigation answers is: possible for whom? Without tests derived from the business context, those 4,000 lines are unverified assumptions.
The problem is not the speed. The problem is the absence of verifiable success criteria.
Alex Bunardzic discussed the Test-Driven Navigation concept at Devoxx Belgium 2025, demonstrating how AI, when paired with tests, safely guides refactoring through legacy code.
In traditional TDD, tests are about design discipline. Kent Beck 's Red-Green-Refactor loop has been around for 25 years. Most developers know it. Most do not practice it because the upfront cost feels high when you write both the tests and the implementation.
Agentic tools change the equation. The cost of generating code dropped to near-zero. But the cost of wrong code did not drop. The agent generates confidently, quickly, and at volume. A human developer who is uncertain will pause and think. An LLM will not. It will write more code.
Test-Driven Navigation reframes what tests do in this new workflow. Tests are not about design. They are about navigation. They tell the agent where to go, confirm it arrived, and prevent it from drifting into confident-sounding nonsense.
But tests need to come from somewhere. Prompting the agent with "write tests for a discount calculator" produces tests validating whatever the agent decides to build. This tells you nothing about whether the code does what your business needs.
In my book “The AI-Native Software Development Lifecycle”, I describe a Five-Layer Prompt Architecture. Layer 1, the Context Prompt, defines the application's domain, purpose, constraints, and success criteria. Every downstream layer inherits from it. Layer 4, Testing and Validation Prompts, inherits directly from the Context Prompt and the module specifications above it.
This is the key discipline: you define the Context Prompt. You use AI to help you generate tests based on it. Then the agent implements code to satisfy those tests. The Context Prompt is the source of truth. The tests are the executable form of the truth. The agent is the engine.
Let me make this concrete with Python.
You have a Context Prompt specifying: customers in the gold tier get 20% discount, customers in the silver tier gets 10%, customers in the unknown tiers get no discount, and negative prices are invalid.
You generate the tests from this context:
Recommended by LinkedIn
# test_discount.py
import pytest
from discount import calculate_discount
def test_gold_tier_gets_20_percent():
assert calculate_discount(100.0, "gold") == 80.0
def test_silver_tier_gets_10_percent():
assert calculate_discount(100.0, "silver") == 90.0
def test_unknown_tier_gets_no_discount():
assert calculate_discount(100.0, "bronze") == 100.0
def test_negative_price_raises_error():
with pytest.raises(ValueError):
calculate_discount(-50.0, "gold")
Now you tell the agent: implement calculate_discount in discount.py to make these tests pass.
The agent produces:
# discount.py
TIER_DISCOUNTS = {"gold": 0.20, "silver": 0.10}
def calculate_discount(price: float, tier: str) -> float:
if price < 0:
raise ValueError("Price cannot be negative")
rate = TIER_DISCOUNTS.get(tier, 0.0)
return round(price * (1 - rate), 2)
Clean. Minimal. Exactly what the tests demanded. No surprise "platinum" tier nobody asked for. No 50-line class hierarchy when a dictionary lookup does the job.
One of the biggest risks in agentic development is not wrong code. It is too much code. Code you did not need. Code introducing complexity you now have to maintain. Tests derived from a Context Prompt constrain the output to exactly the scope you specified.
In practice, the loop runs like this:
Context Prompt -> AI generates tests -> Agent implements -> pytest runs
If RED: agent reads failures, iterates
If GREEN: you review the diff, commit
The moment you let the agent write both the tests and the implementation, you lose the feedback loop. The agent will write tests validating its own code. I have seen teams try this shortcut. The agent produces tests and implementation together, all tests pass, everyone feels productive. Then in code review someone notices the tests do not cover negative prices, or the edge case where tier is None, or the behavior when the price is zero. The tests were written to validate the implementation, not to define the requirement.
An LLM does not "know" what your function should do. It has a statistical model of what functions like yours tend to look like. A fundamentally different thing. When a human developer writes code without tests, there is still a brain running informal verification. The LLM does not have this. Tests give it something it lacks on its own: a concrete, executable definition of correct behavior.
If you are using an agentic coding tool today and you are not doing this, start small. Write a Context Prompt for one module. Four or five lines of business rules and constraints. Use AI to help you generate a test file from the context. Then ask the agent to implement against those tests. Compare the output to what you get from a bare "build me a function" prompt.
The difference is usually obvious on the first try.
The Context Prompt concept extends beyond code generation into a full five-layer prompt hierarchy governing architecture, modules, testing, and operations. I cover this in ‘The AI-Native Software Development Lifecycle’. The code generation layer is where most teams start, because it is where the pain is most visible. But the architectural thinking is the same at every layer.
the "too much code" problem is a definition problem in disguise. AI builds more when it hasn't been told what "enough" looks like. it fills the blank - a 50-line hierarchy, three abstraction layers, helper functions four deep. not because it's bad at coding. because no one defined the constraints. the code bloat is a planning failure, not a model failure.
In many cases the risk isn’t correctness - it’s complexity inflation. Without clear constraints, agents tend to optimize for completeness rather than simplicity. Guardrails like tests or scoped prompts become essential.
I've learned to stop agents before they over-engineer solutions. I request the simplest version first, then iterate only when necessary. This approach saves hours of refactoring unnecessary complexity.