Agentic Coding Experiment

Erik Murphy

Published Apr 27, 2026

I use Claude Code with a custom harness (skills, rules, hooks, sub-agents) to drive consistent quality on autonomous work. After hitting a wall on further improvements, I sent Claude to run some structured experiments to figure out which knobs actually matter.

The flow

I have a macro-flow for feature definition and decomposition, and a micro-flow for sub-task execution. The experiments focused on the micro-flow, which runs fully autonomously.

It's a TDD loop. Every step has a specialized reviewer that checks its output before downstream stages see it.

Plan: decompose the sub-task into outcomes, behavior, AC, and scope. Pull in relevant requirements/architecture. No code, no tests.
Scaffold: minimum API surface (interfaces, models, DI) for the tester to bind to.
Test (Red): write the tests as executable acceptance criteria.
Code (Green): implement until tests pass.
Review: specialized reviewers (architecture, testing, security, etc.) run rounds until no P1/P2 findings remain. Only reviewers that previously surfaced P1/P2s get re-activated.

The experiment

Three tasks:

What I found

Sometimes slow is fast. Opus technically costs more per token, but on brownfield it got things right the first time and spent less time in review, so it was cheaper end to end and produced better code. Sonnet sometimes won on greenfield, but it kept drifting out of scope on brownfield. For mixed work, Sonnet implementer with Opus reviewer was the reliable middle ground: top-3 on every task type, second-cheapest overall.
Review early and often. Skipping the per-stage reviews cut cost but shipped a measurably thinner test suite. Mini-reviews push the next stage to cover harder edges. One big review at the end doesn't.
Persistent workers, fresh reviewers. Resuming worker sessions across review/fix cycles saved cost and improved quality because the worker accumulated context. Doing the same for reviewers made them worse, since they anchored on prior pushbacks. Fresh reviewers re-derive findings each round and catch more.
Specialist beats generalist, when the task warrants it. Merging steps into one sub-agent was dramatically cheaper. On a simple coherent task, quality held. On brownfield and on tasks with subtle AC requirements, the merged worker drifted out of scope. Specialization is a guardrail you pay for. Pay it when scope discipline matters.

Takeaway

These results are highly contextual, tied to my harness and task type. I'm leaning toward granular flow with heavy models as the default given the kind of work I'm doing right now. Higher latency and cost per task, but it reduces review friction enough that I can orchestrate more parallel work, which is the lever I actually care about.

Erik Murphy 3d

I also found that `effort:high` is a sweet spot for me. It is actually cheaper than `medium` because it takes less review rounds and the quality increase between `high` and `xhigh` was insignificant compared to the cost increase.

Brian Cheong 3d

Love the focus on end-to-end cost; the big win in agentic coding is reducing review thrash without losing quality.

1 Reaction

See more comments

Agentic Coding Experiment

Erik Murphy

The flow

The experiment

Recommended by LinkedIn

What I found

Takeaway

Others also viewed

Let the Code Breathe: Clean, Efficient Programming - Let AI Be Your Guiding Light

Why Claude Sonnet 4.5 Became the #1 Coding Model: A Technical Breakdown

The Shift from "Vibe Coding" to Vibe Engineering: A Dispatch from the Frontier

Beyond "Vibe Coding": Building a Pragmatic Spec-Driven Development Workflow with Spec Kit and Claude Code

My Coding Process

Why Coding Standards and Best Practices are Important?

✨ Vibe Coding: Why Every Developer Must Learn to Devibe ✨

Object Calisthenics : Only One Level of Indentation per Method

Refactoring “Ugly Trivia” with Kiro, the Agentic AI Agent

Vibe coding for programmers: sustaining 1000+ lines of code per day

Explore content categories