Verifiable Throughput: A Control Framework for AI Coding Evolution in Legacy Systems
This post is a reflection on what I learned using AI-assisted coding throughout 2025 while working on a large-scale, business-logic-heavy service. The biggest takeaway: AI dramatically reduces the cost of implementation—but it raises the bar on verification.
“Vibe coding” is incredible for greenfield projects: you can move fast because there’s little surface area, few invariants, and limited blast radius. The agent can invent APIs, reshape the architecture, and iterate freely—and the cost of being wrong is low.
But in existing, business-logic-heavy services, vibe coding hits a wall:
So the question for legacy code isn’t “Can AI write code?” It’s: How do we let AI change code safely—at high velocity—without breaking what the business depends on?
That’s where Verifiable Throughput comes in.
The bottleneck shift: from implementation to verification
AI coding agents have moved the bottleneck in software delivery. Implementation is now cheap and abundant; the scarce resource is verification.
To use AI effectively on legacy systems, teams need to shift from “shipping faster” to Verifiable Throughput: the rate at which you can ship changes with confidence because correctness and performance are continuously—and automatically—validated against reality.
The shift: from judgment to signal
Traditional development leans on human judgment to catch issues: “this looks risky,” “that might be slow,” “this feels wrong.” In mature systems, those instincts are built from context: tribal knowledge, outages, and years of edge cases.
AI agents don’t have that context. They optimize whatever signals we provide.
If the only signal is “tests are green,” an agent will happily generate code that passes tests while quietly degrading latency, increasing cost, or drifting from business rules.
Verifiable Throughput turns your test and measurement stack into a control system:
Guardrails prevent unsafe moves. The gradient guides safe iteration.
The three pillars
To enable autonomous improvement (or even high-velocity assisted improvement) in legacy services, your verification suite must evolve in three ways.
Recommended by LinkedIn
1) Enhanced unit testing: intent as executable specification
In legacy code, the hardest part isn’t writing code—it’s preserving intent. Unit tests become the primary language for communicating that intent to an agent.
2) Correctness testing: guardrails for global stability
Agents optimize locally. Legacy systems fail globally—at interfaces, dependencies, and integration seams.
3) Performance testing: a gradient the agent can follow
Legacy services often have performance shaped by real-world distributions, concurrency patterns, caches, and data skew. “It ran fast locally” is not a meaningful signal.
Operationalizing: the Gate System
Make CI/CD explicitly produce two kinds of feedback:
Correctness Gates (Guardrails)
Performance Gates (Gradient)
This is the loop you want for legacy code: fast, automated rejection of unsafe changes; fast, production-grounded guidance toward better ones.
The path forward
The future looks like humans setting constraints and priorities while AI handles implementation and iteration. In legacy systems, the limiting factor is no longer “can we write the code?”—it’s “can we prove we didn’t break the business?”
Without Verifiable Throughput, you’re forced into a bad tradeoff: restrict agents until they’re marginally useful, or give them freedom and accept instability.
Verifiable Throughput offers a third path: continuous, safe evolution of legacy services—driven by automated guardrails and an optimization gradient grounded in production reality.
Insightful article. AI has made code generation cheap, but shipping a product is fundamentally different. Production systems demand stability, predictability, and trust. The probabilistic nature of AI agents increases the risk of regressions and subtle failures, which makes verification, not generation, the real challenge. Without strong production signals and continuous validation, speed just amplifies breakage rather than value.
Excellent writing! I think in 2026 we will see many software systems mirroring the RLVR setup in training models: we construct a bigger harness with coding agents as part of the system to generate policy gradients, but we humans control the reward function to push the system towards a desirable direction. BTW I still remember the early days of the ai.codes auto complete plugin and your early adoption. How the world has changed!