The Wrong Frontier: Why Coding Benchmarks Cannot Tell Us We're Approaching AGI
Everyone celebrating LLMs "solving" software engineering is measuring the wrong thing.
I want to be precise about that claim — because it's easy to dismiss as contrarianism. So let me reason from first principles and see where it leads.
Start With the Metric Itself
A metric is only valid if it measures what actually matters for the outcome being claimed.
The claim being made — by serious people, not just hype merchants — is that LLMs achieving high scores on coding benchmarks represents meaningful progress toward artificial general intelligence. Sam Altman has described coding as one of the clearest demonstrations of near-AGI capability. has pointed to AlphaCode-style results as evidence that general reasoning is within reach.
Dario Amodei In a February 2026 conversation with Dwarkesh Patel, he went further than most. He placed a 90% probability on a "country of geniuses in a data centre" within ten years — and pointed to coding as the domain where he is most confident, precisely because its outcomes are verifiable. "With coding," he said, "except for that irreducible uncertainty, I think we'll be there in one or two years."
These aren't throwaway remarks. They're considered positions from people who understand the technology. And they share a common load-bearing assumption: that coding is a domain in which progress is measurable, and that measurability makes it a reliable marker of AGI's frontier.
That assumption is where the first-principles analysis needs to start.
So what are the benchmarks actually measuring?
HumanEval measures whether a model can complete a function given a docstring. SWE-bench measures whether a model can resolve a GitHub issue in isolation. LiveCodeBench tests competitive programming problems.
All of them share a common structure: a controlled environment, a defined problem, and a verifiable output. The verifiable output is code that compiles and passes a test suite.
This is nothing. It's genuinely impressive. But it is not what software engineering actually is.
The Necessary vs Sufficient Problem
Code that compiles and passes tests is the floor of software quality — not the ceiling.
Every working engineer knows this. Code can be syntactically correct, logically sound within its test coverage, and completely wrong for the system it's meant to serve. Test suites are written by humans who can only anticipate the failure modes they've already imagined. The failure modes that matter most — the ones that cause production incidents, user abandonment, and system fragility — are precisely the ones nobody thought to write a test for.
The benchmark, by design, cannot capture what it didn't know to look for.
Amodei himself acknowledged a version of this when pushed on what "90% of code written by AI" actually means in practice. In the Dwarkesh conversation, he drew an explicit distinction between the proportion of lines written by a model and the proportion of genuine engineering value delivered — describing them as "worlds apart." He laid out a spectrum: 90% of lines written, 100% of lines written, 90% of end-to-end SWE tasks, 100% of today's SWE tasks — each a fundamentally different claim, each measuring something different. The benchmark conflates the first with the last.
Now add the next layer of complexity: software is not consumed in isolation. It is consumed by humans and by agents — increasingly, by both simultaneously. And both introduce non-deterministic behaviour.
A human user brings context, intent, frustration tolerance, and workarounds that no requirements document ever captures. An AI agent brings its own probabilistic reasoning, its own assumptions about error handling, and its own interpretation of what "working correctly" means in a given context. Agents have agency. Their behaviour is variable, not fixed.
Dwarkesh pressed this point directly: even in greenfield projects where engineers use Claude Code from the start, we are not yet seeing the renaissance in software products that full automation should, in theory, produce. Amodei's response was telling — he pointed to change management, security provisioning, compliance loops, and the simple fact that "the model can do that, but I have to tell the model to do that." That final clause is not a diffusion problem. It is a judgment problem. The decision about what to tell the model, when, and what contextual understanding to inform it, is not in the benchmark.
When you compose non-deterministic consumers on top of code that was only verified in deterministic conditions, you have not proven the code works. You have proven it works in the absence of the conditions that matter most.
The Tacit Knowledge Problem
Here is the deeper issue — and this is where the benchmark critique becomes something more fundamental.
Michael Polanyi observed that "we know more than we can tell." In software engineering, this is not a philosophical observation. It's an operational reality.
Human engineers accumulate knowledge that never makes it into a repository. The institutional memory of why a particular architectural decision was made three years ago, before the team changed. The understanding that a specific API behaves differently under load than its documentation suggests. The intuition that a certain class of edge cases always surfaces six months after launch, not in QA. The judgment calls that a technically correct solution will create a maintenance burden that outweighs its immediate elegance.
None of this is written down. It's transmitted through proximity — through code reviews, post-mortems, war stories, mentorship, and the accumulated scar tissue of systems that failed in unexpected ways.
LLMs are trained on the explicit residue of software engineering. The commits. The documentation. The Stack Overflow answers. The papers. The blog posts.
This is the layer that got written down.
The tacit layer — the judgment, the institutional memory, the failure patterns that were learned but never documented — is structurally inaccessible to a model that can only train on what was made explicit.
And AGI, if the term means anything coherent, requires the judgment layer. It requires knowing what to do when the test passes, but the system is still wrong. It requires understanding the context that was never written down. It requires navigating the gap between what was specified and what was meant.
Passing a benchmark tells us a model can replicate the documented layer of software engineering. It tells us nothing about the judgment layer. And it is precisely the judgment layer that separates a junior engineer who can implement a spec from a senior engineer who knows when the spec is wrong.
Here is the sharpest confirmation that this distinction matters — and it comes from Amodei himself. In the Dwarkesh conversation, he draws an explicit boundary around his AGI confidence: he is "very confident on tasks that can be verified" but flags "a little bit of fundamental uncertainty" on tasks that are not verifiable — planning a Mars mission, fundamental scientific discovery like CRISPR, writing a novel. "It's hard to verify those tasks," he says.
This is the first-principles argument the optimist makes. Amodei's AGI confidence is load-bearing on verifiability. Remove verifiability, and his certainty drops. The implication is direct: if software engineering in production is not a fully verifiable task — and the argument above is that it is not — then it cannot carry the evidentiary weight being placed on it. Coding benchmarks are valid evidence of progress in verifiable, controlled coding tasks. They are not evidence of progress in the unverifiable judgment layer that defines engineering at the level AGI requires.
"But What About RLHF and Production Feedback Loops?"
A sophisticated reader will push back here — and the pushback deserves a direct answer.
The argument goes that RLHF (reinforcement learning from human feedback) and, increasingly, deployment-time feedback signals are beginning to close the tacit knowledge gap. Models trained on user corrections, preference rankings, and real production behaviour are absorbing what was previously untransmitted. The tacit layer, the argument goes, is becoming trainable.
Recommended by LinkedIn
Amodei makes the strongest case for this. In the Dwarkesh conversation, he describes RL scaling as following the same log-linear improvement curves seen in pre-training — and argues that just as pre-training on narrow datasets (early GPT models trained on fanfiction) failed to generalise, but pre-training on broad internet data produced genuine generalisation, RL training on broad task distributions will eventually generalise beyond the tasks it was trained on. The implication: tacit knowledge gaps are a matter of breadth and scale, not a structural ceiling.
This is partially true. And it's the most important partial truth in this debate.
RLHF does capture some tacit signal—the preferences of the humans doing the rating at the moment of rating for the task they were shown. That is genuinely valuable. It is not the same thing as accumulated institutional knowledge.
Here is why the gap remains structural, not just a matter of scale.
Tacit knowledge in software engineering is consequential and longitudinal. An engineer learns that a particular pattern creates fragility, not because they rated it poorly in a feedback session, but because they were paged at 2 am 2 months later, when it failed in production. The learning is inseparable from the consequences, the context, and the text of the specific system that is failing.
RLHF compresses this into a preference signal at a point in time. It captures what a human rater thought was better in a controlled session — not what a production system revealed was wrong across months of real load, real users, and real edge cases.
The deeper problem is selection bias. Feedback loops only close on failures that were noticed, correctly attributed, and fed back into training. The silent failures — the architectural decisions that created a five-year maintenance burden, the edge case that only surfaces at enterprise scale, the integration assumption that breaks when a third-party API changes its behaviour — generate no feedback signal at all. They are precisely the category of tacit knowledge that matters most, and they remain structurally invisible to any training process that depends on explicit human correction.
Amodei's analogy to pre-training generalisation is instructive but incomplete. Pre-training generalised because internet text, despite being narrow relative to all human knowledge, is explicit — it was written down, indexed, and therefore trainable. The tacit layer of software engineering is defined by its resistance to being written down. It cannot generalise from RL training on broader task distributions because the signal it would need to generalise from was never captured in the first place.
Closing the feedback loop is necessary progress. It is not sufficient to claim access to the tacit layer.
What the Right Metrics Would Look Like
If we take this seriously, the question becomes: what would a valid metric for AGI-level coding capability actually measure?
Three properties would need to be present:
1. Outcome in production, not output in isolation. The metric should measure whether software performs correctly when used by real users and agents under real-world conditions — not whether it passes a pre-defined test suite. This means longitudinal evaluation: does the system behave correctly six weeks after deployment, under usage patterns that weren't anticipated at the time of writing?
2. Reliability under agent composition. As software increasingly runs inside agentic pipelines — where one model's output becomes another model's input — the relevant question is not "does this code work" but "does this code work reliably when consumed by a non-deterministic agent?" That requires evaluation frameworks that introduce deliberate variability in the consuming agent and measure whether the code holds.
3. Judgment under ambiguity. The hardest and most important metric: can the system identify when a problem is underspecified, when a technically correct solution is contextually wrong, or when the right answer is to refuse the task rather than complete it? This is not measurable by a test suite. It requires evaluation by domain experts assessing decisions, not outputs.
None of these metrics is easy to construct. That is precisely why they haven't been constructed. We have defaulted to measuring what is measurable — and then made claims that extend far beyond what the measurements support.
The Steelman, and Why It Doesn't Hold
The strongest counterargument is this: human software engineers are also evaluated primarily on whether their code compiles and passes tests. If that's the bar for human intelligence applied to coding, why is it the wrong bar for AI?
It's a fair challenge. But it misses a structural difference.
Human engineers are not evaluated on a snapshot. They are embedded in a continuous feedback loop — with users, with systems, with colleagues, with production incidents. They learn from the feedback that never makes it into a ticket. They accumulate tacit knowledge over the ears of consequence.
The benchmark evaluates a single transaction. A model receives a problem, produces a solution, and is scored. There is no feedback loop. There is no consequence. There is no accumulation of the kind of knowledge that makes a senior engineer different from a junior one.
Even Amodei acknowledges this asymmetry implicitly. When Dwarkesh asks whether AI automating software engineering would mean software engineers are out of a job, Amodei's answer is no — because "there are new higher-level things they can do, where they can manage." He points to design documents and architectural decisions, directing the model toward the right problem. That higher-level layer — the judgment about what to build, why, and for whom — is precisely the tacit layer the benchmark cannot reach. Amodei is conceding the existence of the judgment gap even while arguing the benchmark is valid.
The metric isn't just measuring the wrong thing. It's measuring a fundamentally different activity and calling it the same thing.
The Conclusion First Principles Lead To
Coding benchmarks are valid measures of a model's ability to produce syntactically correct, locally coherent code in controlled conditions.
They are not valid measures of software engineering capability in production environments.
They are not valid measures of the ability to navigate tacit knowledge and institutional context.
They are not valid measures of performance under non-deterministic agent consumption.
And therefore, they cannot serve as evidence that coding is the frontier where AGI will first emerge — or that high benchmark scores represent AGI-level capability.
The most clarifying thing about the Dwarkesh/Amodei conversation is not where they disagree — it's where Amodei draws his own lines. His confidence peaks in verifiable domains. It softens for Mars missions, CRISPR discoveries, and novels. He doesn't place software engineering fully in the first category. He places it on a spectrum, acknowledging that the judgment layer exists, that closing the loop takes time, and that "the model can do that, but I have to tell the model to do that."
That acknowledgement is the crack in the frontier thesis. If the human judgment about what to tell the model — informed by context that was never written down — remains the rate-limiting step, then we have not crossed the threshold the benchmark claims to measure.
The frontier isn't where we can measure most easily. It's where the hardest judgment calls live. And those, almost by definition, are the ones that were never written down.
This post was shaped in part by Dario Amodei's February 2026 conversation with Dwarkesh Patel — worth reading in full for anyone thinking seriously about AGI timelines and how we measure progress toward them.
What metrics do you think would actually capture AGI-level capability in software engineering? Are you building evaluation frameworks that go beyond compilation and test coverage? I'd value perspectives from those closer to the benchmark research.
#AIStrategy #ArtificialGeneralIntelligence #SoftwareEngineering #EnterpriseAI #AIGovernance
Requires deeper discussions