5 Levels of Agentic Engineering

5 Levels of Agentic Engineering

There Are 5 Levels of Agentic Engineering. Most Engineers Are Stuck at Level 2.

I've been using Claude Code daily for months now. And the most common question I get from other engineers isn't "does it work?" — they've seen the demos. It's "how do I actually use it well?"

The answer is: it depends which level you're operating at.

Because here's what nobody tells you: agentic engineering isn't one thing. There's a wide spectrum from "Claude Code helps me think" to "Claude Code ships 1,000 PRs a week without a human touching the keyboard." Treating them as the same thing is why most engineers either under-invest (using it as a thinking partner only) or over-rotate (expecting it to autonomously build their entire system and then wondering why it went sideways).

Let me break down the five levels I've seen in the wild — including what they're good for, what they're bad for, and when it's time to level up.


Level 1: Interactive Prompting — The Thinking Partner

This is where everyone starts, and honestly, it's the right place to start.

You're in a conversation. You paste code, ask questions, and explore ideas. You're driving; Claude Code is riding shotgun. The classic use case here is what people call "vibe coding" — you describe roughly what you want, Claude Code generates something, you tweak it, iterate, ship. It works surprisingly well for greenfield feature work or throwaway scripts.

But vibe coding has real limits, and I want to be direct about it: vibe coding is great for exploration, dangerous for production systems. When you're not writing the code yourself, you can easily end up with something that looks right but doesn't fit your architecture, skips error handling, or quietly breaks a contract your system depends on.

The legitimate use of Level 1 is learning and exploration. When I'm working in an unfamiliar codebase or trying to understand a new API, I'm not asking Claude Code to write code for me — I'm using it to compress weeks of learning into hours. "Walk me through how this auth middleware works." "What's the idiomatic way to handle retries in Go?" That's Level 1 at its best: a brilliant teammate who's read every docs page and can explain it in plain English.

Good for: exploring unfamiliar codebases, rapid prototyping, learning new languages/frameworks, rubber duck debugging

Bad for: production code without review, anything requiring deep system context you haven't provided


Level 2: Single Agent, Engineer-Directed — The Pair Programmer

This is where most professional engineers are (or should be) operating today.

You're still in control, but you're delegating meaningful work rather than just asking questions. The key difference: you give Claude Code a spec, not just a prompt. You write the requirements, define the interface, specify the constraints — then let it implement. You review the output critically, the same way you'd review a PR from a junior engineer.

The pattern that actually works here:

  1. Write a brief spec (what it does, not how it does it)
  2. Include context: relevant existing code, architectural decisions, constraints
  3. Ask Claude Code to generate with tests
  4. Review like it's a PR — not just "does it look right" but "does this fit how we actually build things here"

I've used this pattern for refactoring tasks that used to eat entire afternoons. "Here's the current implementation, here's the interface I want to end up at, here are the constraints — refactor this and make sure the existing tests still pass." Works really well. The output isn't always perfect on the first pass, but it's a solid 70% that gets me to done in a fraction of the time.

What makes Level 2 actually work is context quality. The engineers who say "Claude Code's output is garbage" are usually the ones giving it three sentences and expecting production-grade code. The engineers getting real leverage are writing almost as much text in the spec as they would have written in the code — they're just offloading the mechanical implementation.

Good for: refactoring with clear specs, boilerplate-heavy features, test generation, migrations

Bad for: open-ended architectural decisions, anything where the "right answer" isn't well-defined yet


Level 3: Orchestration — The Engineering Team

Here's where things get genuinely interesting — and where most engineers haven't gone yet.

Claude Code supports spawning sub-agents. Instead of one agent doing everything, you have a team: an architect that designs the approach, an implementer that writes the code, a reviewer that audits the output. Each agent has a focused role and limited scope. They communicate through structured hand-offs.

The architect + implementer + reviewer pattern is the one I keep coming back to:

  • Architect: given requirements, produce a technical design with explicit trade-offs. No code yet, just decisions.
  • Implementer: given the design, produce working code. It knows the plan; it doesn't need to make architectural choices.
  • Reviewer: given the code and the original design, audit for correctness, edge cases, and deviation from intent.

Why does this work better than a single agent doing all three? Because the cognitive modes are different. An agent that's simultaneously designing and implementing tends to cut corners in the design to make implementation easier. Separation forces rigor.

This pattern genuinely changes the quality ceiling. I've used it on non-trivial features and the output was closer to "senior engineer's first draft" than "intern who just joined." The reviewer pass especially — having a separate agent audit with fresh eyes catches things the implementer glosses over.

The trade-off: orchestration adds complexity. More moving parts, more potential for hand-off failures, more tokens burned. Don't reach for this when Level 2 gets the job done. Reach for it when the problem is complex enough that a single agent loses context coherence or when quality requirements are high enough to justify the overhead.

Good for: complex multi-file features, anything requiring distinct design/build/review phases, higher-stakes code

Bad for: simple tasks (you'll spend more time wrangling agents than shipping), situations where iteration speed matters more than first-pass quality


Level 4: Event-Driven — The Workflow Automation

This is where the paradigm shifts. Levels 1-3 are all you initiating the work. Level 4 is systems initiating the work.

An agent wakes up because something happened. A Jira ticket was created. A PR was opened. A webhook fired. A CI job failed. The agent handles it, produces an output, and goes back to sleep.

Here's the pattern that's emerging in software engineering orgs: Product creates a Jira ticket for a new feature — complete with detailed requirements, acceptance criteria, and design specs. The ticket gets assigned to a sprint and its status toggles to "Ready to Start." That status change fires a webhook, and a coordinating agent picks up the ticket.

The coordinating agent reads the requirements, analyzes the scope, and breaks the work into subtasks — each one a Jira sub-ticket with a focused description. Then it spins up a team of specialized agents. One agent handles the database schema changes and migrations. Another writes the API endpoints. Another builds the frontend components. Another writes the test suite. Each agent picks up its assigned sub-ticket, does the work, and opens a PR linked to that subtask.

Engineers arrive in the morning to find PRs ready for review, each linked to their respective Jira sub-ticket, each scoped to a specific slice of the feature. The human effort shifts from "write all of this code" to "review what the agents produced and decide what ships."

This is the model that scales. You're not asking people to remember to invoke the agent — the system does it automatically based on events. It fits naturally into how engineering workflows already operate.

The webhook pattern generalizes broadly. PR opened → run an automated review agent. Deploy succeeded → run a smoke test agent. Error rate spiked → run a root cause analysis agent. None of these require a human to remember to do anything. They're just part of the pipeline.

What you need to make this work: well-defined trigger conditions, clear scope per trigger (the agent should know exactly what it's responsible for and where to stop), and a review layer before any agent output has real consequences. The human doesn't disappear at Level 4 — they shift from executor to reviewer.

Good for: workflow automation, reducing toil on repetitive process tasks, first-pass triage/analysis

Bad for: anything requiring real-time judgment calls, situations where incorrect automation causes more work than it saves


Level 5: Fully Autonomous — The Outcome Machine

Stripe built a system called Minions. It merges over 1,300 pull requests per week — with no human-written code.

Let that sink in for a second.

Minions operates across a codebase of hundreds of millions of lines of code, primarily Ruby with Sorbet typing — an uncommon stack that standard LLMs aren't specifically trained on. The system handles real engineering work in this codebase, and the flow is remarkably clean: a task starts in a Slack message and ends in a pull request that passes CI and is ready for human review, with no interaction in between.

Under the hood, Minions is built on a fork of Block's open-source Goose agent, customized with what Stripe calls "blueprints" — hybrid workflows that combine deterministic nodes (linting, git operations, pushing changes) with agentic nodes (implementation, CI failure fixes). This split is key: by constraining which decisions require an LLM and which are just fixed operations, they reduce token consumption and compound reliability. The agents run in isolated "devboxes" — pre-warmed EC2 environments that spin up in under 10 seconds with Stripe's code and services already loaded. No production access. No internet access. Full permissions within the sandbox.

The tooling is extensive. Stripe's internal MCP server, called Toolshed, exposes nearly 500 tools spanning internal systems and SaaS platforms — documentation, ticket details, build statuses, code intelligence via Sourcegraph. The agents read the same coding rules that human engineers use in Cursor and Claude Code, scoped to specific subdirectories rather than dumped as global context. When CI fails, the agent gets at most two rounds to fix it — one standard iteration against the full CI suite, then one retry if tests fail without auto-fixes. After that, the branch goes to a human. Diminishing returns on unlimited retries, so they cut it off.

Humans don't supervise the agent's process — they review the output. Engineers receive branches after the automated iterations complete, and their review surface is the PR itself: does the code do what it should, does it pass CI, does it follow the team's patterns. The step-by-step of how the agent got there is logged in a web UI but isn't the primary review artifact.

Amazon Q Developer is a different flavor of Level 5 — depth instead of breadth. Where Stripe's Minions handles many types of tasks across a massive codebase, Amazon used Q Developer internally to execute one specific transformation at enormous scale: upgrading tens of thousands of production Java applications from Java 8 to Java 17.

The numbers are staggering. A team of 5 people upgraded 1,000 production applications in just 2 days. The average upgrade took 10 minutes per application; the longest took less than an hour. Across the full effort, more than 50% of Amazon's production Java applications were upgraded within 6 months, saving an estimated 4,500 developer-years of work and producing $260 million in annualized efficiency gains. 79% of the auto-generated code changes were shipped without any additional human modifications.

The agent autonomously analyzes source code, generates the upgraded code, runs tests, and produces a change ready for human approval. It's the same pattern as Minions — full autonomy in execution, human review of outcomes — just applied to a single, well-defined transformation type rather than general-purpose engineering tasks.

Both examples demonstrate what Level 5 actually requires: the agent has enough autonomy, context, and quality controls that you trust the output without supervising the process. The human review surface is the outcome, not the journey.

The prerequisite for Level 5 is almost always Level 4 working well first. You don't jump from Level 2 to Level 5 and hope for the best. You build trust incrementally: event-driven automation with tight scope, then gradually expand scope as you validate output quality.

The trade-off at Level 5 is serious: when things go wrong, they can go wrong at scale. An agent that merges 1,300 PRs a week has 1,300 places to introduce a bad pattern. Your safety net is test coverage, code review tooling, and the quality of your agent's training/prompting. If those are weak, Level 5 will cost you more than it saves.

Good for: high-volume, well-defined, repetitive engineering tasks where quality can be validated programmatically

Bad for: novel architectural decisions, anything where "wrong" isn't caught by automated checks, early-stage codebases without strong test coverage


So, Which Level Should You Be At?

Most engineers I talk to are operating at Level 1-2. That's fine — there's real leverage there. But if you're not at least exploring Level 3 on complex features, you're leaving productivity on the table. The honest progression:

  • Start at Level 1 to learn the model, understand its failure modes, and calibrate your expectations
  • Move to Level 2 when you've got a workflow for writing good specs and reviewing critically
  • Try Level 3 on your next complex feature — the architect/implementer/reviewer split pays off fast
  • Build a Level 4 trigger for one repetitive process task you handle manually today (PR review triage is a good first one)
  • Level 5 is earned, not assumed — get there by proving the earlier levels work reliably in your environment

The mistake I see most often is engineers jumping to whatever the hype is (right now: autonomous agents, Level 5) without building the foundation. If you don't have strong test coverage and code review discipline, a Level 5 agent isn't going to save you — it's going to generate technical debt at superhuman speed.

Build the foundation. Level up deliberately. The ceiling is genuinely high, but you have to climb the ladder to get there.

When I first started using Claude Code, I thought the question was "can it code?" That's not the question. The question is "what am I asking it to be — a thinking partner, a pair programmer, a team, a workflow, or a factory?" Each one is valuable. Each one has different prerequisites. And knowing which one you need is half the skill.


I'm Jon Cortez, a Principal Software Engineer at Minted . I write about the intersection of AI and software engineering — the real stuff, not the hype. Follow along if you're interested in what Agentic Engineering actually looks like in practice.


The jump from supervised to autonomous is where most teams stall - and it's usually about trust, not capability. We hit this running multi-agent coding workflows. Without a governance layer, one agent quietly rewrote auth middleware while another was fixing a date format. The fix wasn't more monitoring - it was giving agents explicit scopes and making them ask permission to operate outside them. Solid framework. Curious how you think about the safety mechanisms needed at each level.

To view or add a comment, sign in

More articles by Jonathan Cortez

Others also viewed

Explore content categories