The 20x Gap

The 20x Gap

In February 2026, Anthropic published Measuring AI Agent Autonomy in Practice. It is a cute piece of research that you should probably read before your board starts nagging you. Two numbers really stood out to me. I don’t know if they are good or bad, but they are something.

49.7% of agent tool calls across their public API are in service of software engineering.

2.4% are in service of cybersecurity.

That is roughly a 20x gap. For every hour an agent spends writing and shipping code, it spends about three minutes hacking and defending it. That was the snapshot in February. I bet the gap is larger now with how fast Anthropic has grown on the back of its Claude Code product.

What the number actually measures

Anthropic counted tool calls across their public API and complemented that with Claude Code session data. Software engineering dominates because that is where agents got good first, and where the business case is obvious. You type a sentence, a feature appears, your velocity chart goes up and to the right.

Cyber sits at 2.4% not because cyber matters less. It sits there because the operating model has not caught up. Much of security work still looks the way it looked in 2019, and most security tooling still assumes a human is the thing reading the output.

Quick caveat before we keep going. Anthropic publishes the classifier taxonomy in the appendix PDF, and it is worth reading before you interpret these numbers. The exact definitions:

  • Software engineering: “Writing, fixing, reviewing code, debugging, software development”
  • Cybersecurity: “Pentesting, CTF challenges, vulnerability research, security tasks”

Three of the four cyber examples lean offensive. The fourth, “security tasks,” is a catch-all the paper does not define further. So when a developer asks an agent to fix an SQL injection, or a detection engineer writes a SIEM pipeline, or someone patches a known CVE, we genuinely do not know if that lands in Cybersecurity or gets absorbed into the 49.7% software engineering bucket. The classifier is Claude making judgement calls per tool call and we do not get to see the split.

So 2.4% is a narrow slice one way or the other. It is either mostly offensive work (if Claude read the examples literally) or a mix of offensive and some unknown amount of defensive work (if the “security tasks” catch-all was read generously). Either way, a meaningful amount of agent-driven security work is invisible inside the software engineering bucket.

The ratio still points where it points. Way more agent energy goes into named cyber work than we can see in the headline number, and way more goes into building than any of it. But if you quote this to a board, quote it carefully.

The question the board is asking today

If you sit near a board right now, the questions rhyme:

  • Are you using AI?
  • What is our token budget per employee?
  • How many roles can we consolidate?
  • What efficiency are we gaining?
  • When do we see it in the margins?

Every one of these is a productivity question. None of them are governance questions. None are security questions. That is rational, for now. When the cost of being slow is losing the market to whoever ships faster, you optimize for throughput.

The question the board is going to ask

At some point the generation of poorly written, unmanaged attack surface catches up. Agents do not stop writing code because the code was bad. They write more. The pile grows. And the pile is where incidents live.

When that bill arrives, the questions will change:

  • How are we governing agent-generated code?
  • What fraction of our AI spend defends the thing the other the 49.7% is building?
  • Is 2.4% our number too? Should it be higher?
  • Who owns it when an agent ships insecure infrastructure at three in the morning?

The CISO stops being a cost center item on slide 47 and starts being the board’s translator between velocity and survivability.

The ratio we used to run

For a long time the rule of thumb for tech companies was one security person per 20 to 50 engineers. That is not outrageous. It reflected an honest belief about how much code humans could write, how much attack surface that code generated, and how much of it one security engineer could reasonably keep an eye on.

Agents break the denominator. If an agent supervised by an engineer ships three or five or ten times more code, the attack surface multiplies with it. The ratio does not automatically follow. Most orgs I talk to are quietly assuming it stays flat. One security engineer per 20 to 50 humans, regardless of how much code those humans now produce. Some orgs are cutting security headcount faster than engineering headcount because “AI can do cyber”.

Maybe that is fine. Maybe defensive agents close the gap without needing more headcount. Maybe not. The point is nobody is measuring it. The 2.4% is the first public number I have seen that even lets you ask “is our ratio of build-to-defend getting worse?”

Smaller companies are almost certainly not inside that 2.4%. If you are a twenty person startup, your cyber agent usage is probably zero. Your SWE agent usage is ninety something percent of your engineering hours. The gap at the long tail is not 20x. It is infinite.

At the very least this is a number worth monitoring. It is a leading indicator of something, and I would rather we all figure out what before the incident report tells us.

The old operating model

The old operating model looked like this. Scan the environment. Surface findings. Enrich with context. Ticket, triage, assign. Wait for the engineer who owns the resource to fix it. Chase. Close.

Every stage of that chain assumes the bottleneck is information. If we can just surface the right finding, de-duplicated, false positives removed, risk-adjusted, color-coded to the right person, someone will act. The whole stack optimizes for finding fidelity and dashboard clarity. Sometimes it was vendors competing on false positive rate and integrations and how pretty the severity pie chart looks.

That made sense when humans wrote the code and humans fixed the findings. Human throughput bounded both sides of the equation, so a neat ticket queue was mostly sufficient. I mean, there were still problems of scale like every tool racing to produce the most high-risk findings to prove their own value, but some teams managed to figure it out. Agents break the symmetry. One side of the equation got a 10x powerup overnight. The other side in most orgs is still feeding the ticket queue to humans.

What does a 2.4% shop look like under agent-speed build? Findings per week grows with code volume. Or worse, it grows with system complexity which itself grows with code volume. Or even worse, it grows exponentially because unsupervised agents write sh%t code. Humans in the ticket loop do not grow. Queue depth goes vertical. Dashboards get prettier and MTTR gets worse. The honest answer to “are we secure” becomes “we have excellent visibility into how insecure we are.”

The new operating model

The new model is not “hire more humans.” It is not even “buy better dashboards.” It is agents on the defensive side doing the same kind of work agents are already doing on the offensive side, with the same autonomy budget.

That looks like:

  • Analysis and triage done by agents. A finding arrives. An agent pulls the resource, reads the code, reads the policy, decides if it is a real issue, decides who owns it, decides how urgent it is. Humans see the short list, not the raw firehose.
  • Fixing done by agents. For the large class of findings that are reversible and well scoped (a public S3 bucket, an over permissive role, a missing tag, a known CVE with a clean patch) the agent proposes and applies the fix. The study notes only 0.8% of agent actions are irreversible. That is exactly why defensive automation works. Most remediations are safe to try.
  • Guardrails enforced by agents. Hook a security agent into the dev loop directly. Every tool call the build agent makes (every file write, every IAM change, every Terraform plan) gets reviewed in line, the moment it happens. Not at build time. Not at PR time. The second the build agent tries something sketchy, the defensive agent waves hello like a GTA NPC. Not a linter, not a policy engine, an agent that understands intent and can argue with the other agent.
  • Humans moved up the stack. Humans stop being triage workers. They write policy. They decide what defensive agents are allowed to do alone, what needs approval, what is always off limits. They investigate the hard 5% that the agents flag as weird.

The question is not whether this is possible. The study already shows agents doing work like this, just not very much of it. The question is who runs the experiment first, and whether the board of your company is the one asking for it, or the one finding out about it after the fact.

The gap between copilot and teammate

Which brings us to the product question.

Most vendor conversations right now pitch AI as a feature bolted onto an existing dashboard. A copilot in the console. A summary at the top of a report. A natural language filter on a findings list. That is not where the value is.

The value is in execution. Agents that do not just surface findings but investigate them, validate exploitability against the real environment, write the fix, open the PR, and follow it to merge. The gap between “AI inside my SIEM” and “AI running as a teammate on my security team” is the gap between 2023 and 2026.

The industry will spend the next 18 months figuring out which side of that line each product is on. Cyber leaders do not have 18 months. The decisions you make this quarter about where to invest, what to automate, and what to wind down will compound for the rest of the decade.

Three numbers actually worth watching

Forget findings count. Forget the severity pie chart. Here are the numbers that actually tell you whether you are keeping up.

1. Agent-generated code as attack surface. Assume a large share of every PR was written or heavily edited by an agent. Treat it as a first-class source of risk, not a curiosity. Your code review, SAST, IaC scanning, and secrets detection have to be tuned for volume and speed, not just accuracy. Catching 99% of vulnerabilities does not help when the denominator just went up 10x.

2. Machine identities as crown jewels. Every agent that touches your cloud needs an identity, a scope, a lifecycle, and an audit trail. Nobody is reviewing non-human identities at the cadence humans get reviewed. If you cannot answer how many NHIs you have, who created them, what they can reach, and when they last did something, you do not know your blast radius.

3. Manual triage hours as the real KPI. Hours your humans spent triaging, ticketing, chasing, and writing the same remediation for the hundredth time. That is the number your board is actually asking about when they ask about AI efficiency, even if they do not know they are asking about it. If it is flat or growing while you boast about agent adoption, something is off. If it is shrinking, show the receipts.

2.4% is a leading indicator, not a target

I do not think 2.4% is wrong. I think it is early. The boards asking about token budgets today will be asking about agent governance by this time next year, and the specific cyber governance question is going to be some version of “what is our cyber-to-engineering agent ratio, and is it moving in the right direction?”

If your answer is “we are still working on dashboards” that is going to be a long meeting. Pick your three numbers and post them somewhere your board can see. Start running defensive agents on the reversible 80%. Decide what your number should be before someone on a board decides it for you.


Daniel is Chief Innovation officer at Plerion. Hire Pleri, our AI security engineer to help bridge the 20x gap.

Great insights and write-up Stats are not surprising. There are about 47M professional developers globally (yes, AI told me that, didn't fact check) growing 10%+ but only 5.4M security professionals growing 2%.

Interesting perspective! My one criticism - this still makes the distinction that security and development are in some way oppositional. I’m an optimist in that I believe that AI generated code can and will improve in both quality and security, provided it is given quality and security patterns to follow. “Shift-left” mindsets have been pushing this for ages, and within an agent AI world, the shift-left push should be easier, not harder, to do, shouldn’t it?

Nice summary, Daniel. It gave me a new perspective, thanks.

As a Vibe Coder, I know I am making the problem worse. But I do at least try my best to evaluate my attack surface and harden my app before pushing anything into prod. For the majority of my stuff, my philosophy is: - Be secure enough that you avoid drive by attacks from script kiddies - Be resilient enough so that when you get popped by someone more competent, you can get back up and running very quickly. So a combination of security hardening coupled with BCP. Best I can do I reckon. I also know there are hundreds of thousands of people like me building software as vibers, who need security. But I suspect we might be an absolutely terrible ICP to focus on. :-P

To view or add a comment, sign in

More articles by Daniel Grzelak

Others also viewed

Explore content categories