I use Claude Code with a custom harness (skills, rules, hooks, sub-agents) to drive consistent quality on autonomous work. After hitting a wall on further improvements, I sent Claude to run some structured experiments to figure out which knobs actually matter.
I have a macro-flow for feature definition and decomposition, and a micro-flow for sub-task execution. The experiments focused on the micro-flow, which runs fully autonomously.
It's a TDD loop. Every step has a specialized reviewer that checks its output before downstream stages see it.
- Plan: decompose the sub-task into outcomes, behavior, AC, and scope. Pull in relevant requirements/architecture. No code, no tests.
- Scaffold: minimum API surface (interfaces, models, DI) for the tester to bind to.
- Test (Red): write the tests as executable acceptance criteria.
- Code (Green): implement until tests pass.
- Review: specialized reviewers (architecture, testing, security, etc.) run rounds until no P1/P2 findings remain. Only reviewers that previously surfaced P1/P2s get re-activated.
- Postfix calculator (greenfield, simple)
- LRU cache (greenfield, complex)
- A fix to my back-end service (brownfield)
- Task type (greenfield vs brownfield)
- Per-stage model selection
- Review frequency and placement
- Step granularity (one sub-agent per step vs merged)
- Session continuity (persistent vs fresh) per role
- Sometimes slow is fast. Opus technically costs more per token, but on brownfield it got things right the first time and spent less time in review, so it was cheaper end to end and produced better code. Sonnet sometimes won on greenfield, but it kept drifting out of scope on brownfield. For mixed work, Sonnet implementer with Opus reviewer was the reliable middle ground: top-3 on every task type, second-cheapest overall.
- Review early and often. Skipping the per-stage reviews cut cost but shipped a measurably thinner test suite. Mini-reviews push the next stage to cover harder edges. One big review at the end doesn't.
- Persistent workers, fresh reviewers. Resuming worker sessions across review/fix cycles saved cost and improved quality because the worker accumulated context. Doing the same for reviewers made them worse, since they anchored on prior pushbacks. Fresh reviewers re-derive findings each round and catch more.
- Specialist beats generalist, when the task warrants it. Merging steps into one sub-agent was dramatically cheaper. On a simple coherent task, quality held. On brownfield and on tasks with subtle AC requirements, the merged worker drifted out of scope. Specialization is a guardrail you pay for. Pay it when scope discipline matters.
These results are highly contextual, tied to my harness and task type. I'm leaning toward granular flow with heavy models as the default given the kind of work I'm doing right now. Higher latency and cost per task, but it reduces review friction enough that I can orchestrate more parallel work, which is the lever I actually care about.
I also found that `effort:high` is a sweet spot for me. It is actually cheaper than `medium` because it takes less review rounds and the quality increase between `high` and `xhigh` was insignificant compared to the cost increase.
Love the focus on end-to-end cost; the big win in agentic coding is reducing review thrash without losing quality.