Test-Driven Development for Agents

Tests as Specification, Not Verification

A few years ago, I was leading a source code repository migration at a large financial services company. The business unit hadn't renewed their vendor contract, giving us a tight cutover window. We were migrating their repositories to our new platform, and failure was not an option. We were talking about legacy COBOL tied to general ledger systems and applications maintaining insurance products that companies were required to keep code for regulatory requirements. Lost code couldn't just be rewritten.

We didn't start with 20,000 repos. We started with our own. The engineering team in Toronto tested the migration scripts against our own team's repositories first. We defined what a successful migration looked like for a small set of repos we knew intimately, ran the scripts, and verified the output. Once we proved it worked for our team, we expanded to other teams. Once it worked for individual teams, we moved to entire orgs with multiple teams. Each scale-up was a test. The scripts earned trust incrementally, not all at once.

By the time we were ready for the enterprise cutover, we had evaluated over 20,000 repositories and put about half into active organizations. The scripts could migrate hundreds of repos in minutes. But as Product Director, I kept coming back to one question. What if we missed repositories?

During a pre-mortem planning session with my engineering manager, we walked through the system design, asking what could go wrong. When the "missed repositories" scenario came up, he made a casual observation that just clicked. What if we loaded everything into a staging org first, then transformed and migrated the required repos from there? That way, the original copy of every repository would be left behind for emergency purposes. I instantly approved the idea and put it into the sprint plan. We audited the number of repos, files, and users, so we knew what was delivered matched the original source. The audit was the specification. The hidden org was the safety net.

Then the cutover happened.

Sitting in our open office in Boston, the first email came in. Someone forgot to migrate a repository. They wanted to know if we could get it from one of their backups. I smiled. I knew the procedure. We had practiced it. I messaged the engineering manager, and five minutes later I was responding to the customer that it was complete. Under 60 seconds to transfer from the hidden org into their organization, regardless of whether it had been identified for migration.

That first email wasn't the last. Over the next few months, the same scenario played out dozens of times. Different business units, different repos, same result. Five minutes. Done.

That story taught me something I keep coming back to. We defined what "correct" looked like before the migration ran. We proved it at small scale before trusting it at enterprise scale. And when things went wrong, the safety net caught what the scripts missed.

That is the principle behind test-driven development.

When agents generate code faster than you can review it, you need the same discipline. Tests are not afterthoughts. They are the specification the agent builds against.

Why TDD Matters More With Agents

When I'm working with an agent, I'm not watching every line of code get written. The agent might generate 200 lines in seconds. I can't review all of it in real time. What I can do is make sure the agent has clear success criteria before it starts. Because when the next task builds on the last one, I need to know that foundation is solid. If it isn't, everything after it is built on sand. That's why I use TDD with agents. Not because it is a best practice. Because I can't afford to move forward on a foundation I haven't verified.

Issue 6 introduced three reasons why TDD matters more with agents. Let me expand on each.

Speed is the first reason. Agents generate code faster than you can review it. Without automated tests, you're trusting without verifying. In the SCR migration, the automated script could migrate hundreds of repositories in minutes. No human could manually verify each one. The tests (the audit counts, the rollback capabilities) were the only way to operate at that speed safely. The same principle applies when an agent scaffolds an entire API layer in thirty seconds. You need something faster than your eyes to verify the output.

Clarity is the second reason. Tests force you to define what "done" looks like before you start. For agents, this is specification, not verification. A test that says "this endpoint returns a 200 response with these three fields" tells the agent more about your intent than a paragraph of natural language. Tests are the most precise form of specification you can give an agent. They provide unambiguous success criteria, so that when the next task executes, its foundation is solid. You need to know the domain to write the test. As one developer put it, "I'm primarily using AI in cases where I know what the answer should be or should look like." That is the TDD mindset.

Accumulation is the third reason. Every caught bug becomes a permanent test. Traditional TDD accumulates a regression suite over months. With agents, this happens in hours. Each failed test from one generation session becomes a guard for the next. Your test suite is a growing body of institutional knowledge about what can go wrong. I saw this play out at enterprise scale during the Log4j crisis in December 2021. At a previous company, I had spent three months migrating our Software Composition Analysis (SCA), which automatically inventories every third-party library in your codebase and flags known vulnerabilities, to a SaaS platform and onboarding repositories. That SCA inventory was a specification defined before the crisis. When Log4j hit, the query was running the test. I opened the console, selected all applications with a Log4j dependency, and roughly 300 appeared. Within two weeks, every vulnerable dependency was remediated. The other response streams, using manual scanning and spreadsheets, kept meeting for another month.

Kent Beck introduced TDD as part of Extreme Programming almost three decades ago. The practice hasn't changed. What's changed is that agents make TDD mandatory, not optional. Even non-deterministic outputs like writing use TDD. The tests are rubrics and scoring thresholds instead of unit tests, but the principle is identical.

In OOP, unit tests verify that implementations satisfy interfaces. In AOD, tests ARE the interface. They define the contract the agent must satisfy. TDD is a form of spec-driven development where the specification is executable code, a natural extension of the principles we explored in Issue 7. That OOP parallel matters because it shows this isn't a new idea. It is an old discipline applied to a new context.

Let me show you exactly how this works. When I was building my AI Threat Scanner SaaS application, I started with one test. Can I reach the home page?

A Practical Example You Can Recreate

One Test, One Goal

When I started building the AI Threat Scanner, I wrote one end-to-end test before writing any code. Navigate to the home page and assert you get there. That's it. One test. One assertion. Can the app load?

That test drove everything that followed.

The Iteration Loop

First run. Test fails. No app exists yet. The agent scaffolds the app, creates routes, renders a home page. Test passes.

Add authentication. I tell the agent the app needs a login page. Now the test fails again. The home page is behind auth. The agent has to implement login, wire it up, and get the test back to green. It keeps iterating until the end-to-end test passes.

Add OAuth. Requirements grow. Now login needs OAuth. Test fails again. The agent adds the OAuth flow, handles redirects, manages tokens. The test keeps failing until the full auth chain works end-to-end.

Each feature breaks the test. The agent fixes it. The test never changes. The implementation evolves to satisfy it.

Why This Works

The test is the specification. "Reach the home page" is an unambiguous success criterion. The agent can't game it or satisfy it partially.

The test survives scope changes. Login, OAuth, role-based access. None of these change the test. They change the path to satisfying it.

The agent re-iterates against the failing test. When OAuth breaks the redirect chain, the agent doesn't need you to diagnose it. The test tells the agent exactly what is broken. "I cannot reach the home page." The agent iterates until it can.

And you stay in the design seat. You're not reviewing auth code line by line. You're defining what "working" means and letting the agent figure out how to get there.

The Pattern Generalized

This isn't specific to login flows. The pattern works for any feature. Write a test for an API endpoint that calls it and asserts the response shape. The agent builds the handler, validation, and database queries until the test passes. Write a test for a data pipeline that sends input and asserts output format. The agent builds the transformation chain. Write a test for a security gate that attempts an unauthorized action and asserts it gets blocked. The agent implements the access control.

The principle: define the destination as a test, let the agent find the route.

Four TDD Patterns at Work

That walkthrough demonstrates all four patterns from the TDD toolkit.

Test-First Prompting. The end-to-end test existed before any code. The agent built against it from the start.

Incremental Trust. I started with one simple test (reach the home page). As the agent proved reliable, scope expanded (add auth, add OAuth). Trust was earned through passing tests, not assumed.

Edge Case Accumulation. Every time OAuth broke the redirect chain or a token expired mid-flow, that failure became a new test. The suite grew from one test to dozens, each one born from a real failure.

Security Gates. "Unauthenticated users cannot reach the home page" is not a policy document. It is an executable test.

From Supervisor to Architect

There is a mindset shift that makes TDD with agents work. Without it, you default to either blind trust or constant micromanagement. The shift is moving from "human in the loop" to "human on the loop."

Issue 9 previewed the idea of the human as implicit eval, where the developer sitting with the agent constantly judges "good enough to proceed?" Here is that concept in practice. TDD is the mechanism. It is how you codify that judgment into something an agent can execute against.

There is a distinction in autonomous systems between these two postures. In the loop means you are the gatekeeper, reviewing every output before it moves forward. On the loop means you have designed the verification system and you monitor outcomes.

Elli Shlomo captures this shift well. The difference between being a manual supervisor for the agent and being the architect of the mission. Martin Fowler's team frames the practical side. The in-the-loop response to bad agent output is to fix the artifact. The on-the-loop response is to fix the harness that produced it. Tests are that harness.

Every time you catch yourself thinking "that looks wrong," you just identified a test you should have written. The discipline is capturing that judgment before the next session.

Go back to the AI Threat Scanner example. When OAuth broke the redirect chain, the "in the loop" response would be to debug the OAuth code myself. The "on the loop" response was to add a test for the redirect chain. The next time the agent touched auth, it couldn't break what was already working.

This is the deeper shift. In the loop, you're verifying after the fact. On the loop, you're specifying before the fact. You stop asking "did the agent get it right?" and start asking "did I define 'right' clearly enough?"

The TDD Delegation Spectrum

That mindset shift isn't binary. Human-agent pair programming operates on a spectrum based on confidence and risk, with failure modes at both extremes.

I find it useful to think of it as five positions. The two extremes are where things go wrong. The three middle positions are the healthy operating range.

Vibe Testing is one extreme. You trust the agent blindly. It writes the tests, generates the code, and you approve without reading either. Tests pass but you can't explain what they verify. This is vibe coding wearing a testing costume.

Drive is where you write the tests and review the output. The agent generates candidates. This is the right position for new domains, unfamiliar APIs, or any situation where you catch errors the agent doesn't know to test for.

Guide is where you write the tests and spot-check the output. The agent generates code and self-tests against your specifications. This works in familiar territory where agent output matches your expectations most of the time.

Delegate is where you define acceptance criteria only. The agent generates code, writes detailed tests, and iterates autonomously. You review the acceptance criteria results. You know you're here when agent test failures teach you edge cases you missed. This is where TDD produces its highest leverage.

Micromanage is the other extreme. You write every test, review every line, and rewrite the agent's output. You spend more time reviewing than writing it yourself would take. The agent becomes an expensive autocomplete.

What Moves You Between Levels

The key is that movement should be evidence-based, not vibes.

Move from Drive to Guide when the agent passes three or more sessions without a test gap you had to catch. Move from Guide to Delegate when agent-written tests catch bugs you didn't anticipate. Move from Delegate back to Guide when the agent misses a category of test, like never testing error paths. Move from Guide back to Drive when there's a domain shift, a new API, a new framework, unfamiliar territory. And any production incident traceable to undertested agent output should send you back to Drive, regardless of where you were.

The asymmetry matters. Erring toward more control is always safer than erring toward less. A micromanaged agent wastes time. A vibe-tested agent ships bugs. When in doubt, stay one level closer to Drive.

There is one more factor beyond track record. The blast radius of the domain determines the ceiling on delegation, not just the agent's history of passing tests. Some domains permanently stay closer to Drive because the cost of failure is too high. You might Delegate on API routes and Drive on auth logic in the same session, not because the agent failed at auth, but because the consequences of an auth bug are categorically different. The principle is simple. The higher the blast radius, the lower the ceiling.

The Specification Angle

The spectrum is really about who writes the specification. At Drive, you author the full specification as tests. At Guide, you author the specification and the agent fills in detail. At Delegate, you define acceptance criteria and the agent authors the detailed specification. The shift across the spectrum is specification authorship, not just supervision level.

Most developers enter this spectrum from one of the two extremes. They either vibe test (excited about agent speed, trust everything) or micromanage (skeptical, trust nothing). Both are natural starting points. The maturity path is learning to read the signals and move fluidly across the healthy range based on evidence, not emotion. An expert-level practitioner operates differently per task. The spectrum is per-task, not per-developer.

Tests also allow you to maintain flow state with agents. Instead of context-switching between "generate" and "verify" modes, you write the specification once and let the agent iterate against it. You stay in the design mindset.

Verification vs. Validation

There is a critical distinction that most developers conflate. Getting it wrong leads to false confidence.

Verification asks whether we built it right. Validation asks whether we built the right thing.

These are fundamentally different questions. Verification is answered by tests, automated, running on every iteration. Validation is answered by stakeholders, evals, and users, at milestones and checkpoints.

Verification catches implementation errors, regressions, and broken contracts. The failure mode is "the code doesn't work." Validation catches wrong requirements, misunderstood intent, and misaligned value. The failure mode is "the code works perfectly but solves the wrong problem."

Why does this matter more with agents? Because agents are excellent at verification. Give them a test, they will iterate until it passes. But agents can't validate. They can't tell you whether the test itself is testing the right thing. That is the human's job. And it always will be.

I have watched teams fall into this trap. The test suite goes green. Everyone feels confident. But nobody asked whether the tests were testing the right behavior in the first place. In the AI Threat Scanner, "reach the home page" was a valid specification because I knew what the home page should do. If I had written a test for the wrong endpoint or the wrong user flow, the agent would have built exactly what I specified. And it would have been wrong.

Issue 9's eval framework is primarily validation infrastructure. Multi-dimensional scorecards, polymorphic correction agents, human-as-evaluator. Those are validation mechanisms. This chapter's TDD patterns are verification mechanisms. The complete system needs both.

Verification (TDD, this chapter) answers whether the implementation satisfies the specification. Validation (Evals, Issue 9) answers whether the specification satisfies the intent.

The Governance Triad (the PM, Architect, and Team Lead agents introduced in Issue 5) maps cleanly to this distinction. The PM validates, asking "Is this the right thing to build?" The Architect verifies, asking "Is this built correctly?" The Team Lead validates and verifies, asking "Is this deliverable and does it meet standards?"

Here's the practical implication. When you write a test-first specification for an agent, you are doing verification design. When you review whether those tests capture the right requirements, you are doing validation. Both happen before the agent writes a single line of code. The maturity is knowing which hat you're wearing.

Before your next sprint, separate your tests into two columns: verification and validation. If you have zero validation tests, that is your gap.

TDD gives you verification for free. The agent iterates until tests pass. Validation is never free. It always requires human judgment. The discipline is not skipping validation just because verification is passing.

Why This Matters Now

Agents are getting faster. The speed gap between generation and human review widens every month. Tests are the only scalable verification mechanism that keeps pace with that speed.

TDD is not new. Kent Beck formalized it nearly three decades ago. What is new is that agents make TDD mandatory, not optional. The discipline that senior developers practiced voluntarily becomes the minimum bar when agents are generating code.

And verification without validation is dangerous. Passing tests feel like progress, but only if the tests themselves are right. The human's job is shifting from writing code to validating specifications. That shift changes everything about how you work with agents. You move from implicit evaluation (eyeballing agent output) to semi-explicit checks (manually running tests) to explicit test specification (test-first) to accumulated institutional knowledge (a growing test suite born from real failures). Each stage compounds on the one before it.

TDD with agents is not about testing code. It is about specifying intent.

What's Next

Every week, The Agentic Shift covers a new topic in agentic-oriented development, from product strategy and security architecture to software engineering best practices. Subscribe to keep up with the shift.

About the Author: David Matousek is Solution Compliance Lead for the BEST Program at the Commonwealth of Massachusetts, sharing practical insights on agentic development and enterprise risk. He leads compliance, risk, and security architecture for the Business Enterprise System Transformation program, and maintains tachi, an open-source agentic threat modeling tool that puts AOD principles into practice. Connect with him on LinkedIn to continue the conversation, share your experiences, or visit davidmatousek.com to explore how these principles apply to your organization.

LinkedIn: https://www.garudax.id/in/davidmatousek/
Website: https://www.davidmatousek.com
Newsletter: https://www.garudax.id/build-relation/newsletter-follow?entityUrn=7429543239306653696
Book a call: https://calendly.com/david_matousek/30min

Test-Driven Development for Agents

David Matousek

Tests as Specification, Not Verification

Why TDD Matters More With Agents

A Practical Example You Can Recreate

One Test, One Goal

The Iteration Loop

Why This Works

The Pattern Generalized

Four TDD Patterns at Work

From Supervisor to Architect

The TDD Delegation Spectrum

What Moves You Between Levels

The Specification Angle

Verification vs. Validation

Why This Matters Now

What's Next

The Agentic Shift

1,032 follower

More articles by David Matousek

Explore content categories

Tests as Specification, Not Verification

Why TDD Matters More With Agents

A Practical Example You Can Recreate

One Test, One Goal

The Iteration Loop

Why This Works

The Pattern Generalized

Four TDD Patterns at Work

From Supervisor to Architect

The TDD Delegation Spectrum

What Moves You Between Levels

The Specification Angle

Verification vs. Validation

Why This Matters Now

What's Next

The Agentic Shift

1,032 follower

More articles by David Matousek

Evals and Loops: Designing Control Flow Through Quality Criteria

Orchestration: Coordinating Specialists at Scale

The Spec-Driven Sprint: Six Stages from Stakeholder Pain to Agent-Built Product

From Prompt Engineer to Agentic Engineer: Five Skills That Span the Gap

One Prompt, Many Specialists: Polymorphic Delegation in Agent Systems

Designing Your Agent's Inheritance Chain

Tools Are the New Methods

Context Windows Are the New Memory Management: Why Your Agent Forgets

Agentic-Oriented Development

The Spec-Driven Sprint: Five Steps from Stakeholder Pain to Agent-Built Product

Explore content categories