Why the Same Model Performs Differently

Haroon Choudery

Published Apr 14, 2026

Why the same model can vary by 16 points, and how harness design determines agent performance in production

An interesting benchmark has been floating around recently.

Opus 4.6 scored 77% inside Claude Code. The exact same model scored 93% inside Cursor. It was the same model, but a different result. And the only thing that changed was the environment around it.

That environment has a name. It's called a harness. And if you're deploying AI agents without understanding what they are, I’d say you're flying blind.

What a Harness Is And Why It Matters

An AI model, at its core, only does one thing: takes text in and produces text out. That's it. Left to its own devices, it cannot read your files, run commands, edit code, or touch your database. It generates text, that's the whole job.

So, how does Claude Code rewrite a codebase? How does an agent book a meeting or update a CRM?

Tool calls. The model outputs a piece of syntax — essentially "run this command" — and then stops. The harness, a piece of software running around the model, picks that up, executes the command, takes the result, adds it back to the conversation history, and sends everything back to the model to continue. That loop — model asks, harness executes, result feeds back — is running hundreds of times every time you use any agentic tool.

Stanford researcher Mihail Eric made this concrete with an article that circulated widely this year. His argument: the core of Claude Code is not magic. It is 200 lines of Python. Three tools — read file, list files, edit file — a system prompt, and a loop. That is the whole architecture.

What Cursor did was spend thousands of engineering hours on those prompts and tool descriptions. They have people whose entire job is to update the system prompt every time a new model ships — testing obsessively, adjusting descriptions, steering the model away from bad habits. That investment is audible in the benchmark. 16 percentage points on the same model.

However, Anthropic's own engineering team found something that should make every builder uncomfortable. Harness assumptions go stale as models improve. Context anxiety that required full resets in Sonnet 4.5 simply disappeared in Opus 4.5. If you over-engineer control flow, the next model update breaks your system. Manus refactored their harness five times in six months. LangChain is rebuilt four times a year. Vercel removed 80% of its agents’ tools, and performance went up.

The harness is the product. When your agent drifts off-task, stops following instructions mid-workflow, or makes redundant tool calls, that is a harness problem.
More context makes models dumber. When Sonnet 4.6 crosses roughly 50,000-100,000 tokens in its context window, accuracy drops sharply. Stuffing your entire codebase in is not the solution; it is the problem. Good harnesses feed the model what it needs when it needs it.
Your AGENTS.md file is a harness input. What it does is front-load context so the model doesn't spend tool calls discovering it. The fewer unnecessary tool calls your agent makes, the faster and more reliably it works. The context you give it upfront is context it doesn't have to find.

What This Means For You:

If you are deploying AI agents inside your organization, the first question worth asking is not "which model?" It is "What is our harness?"

Most teams don't have a real answer. They have a prompt, maybe a framework, and a hope that the model figures the rest out. That holds for demos. It does not hold across long workflows, multiple users, and real-world edge cases.

The gap between teams with mature harnesses and teams without one is still wide open. The companies that close it first will have agents running reliably when everyone else is debugging why theirs stopped at step 80.

Recommended by LinkedIn

Agent Harness: Understanding Claude Code’s Superpower…

Paul Fruitful 3 months ago

The Control Plane is Not the Harness

Krishna Gade 4 days ago

When an 8B Model Is Enough — And When It’s Time to…

Persis Duaik 5 months ago

Clutch. Just launched.

OpenClaw made it easy to get an agent running. Clutch makes it safe to run that agent at work.

Secure multi-agent deployment, built for teams that need more than a single-machine setup. We just launched.

Request a demo.

Stanford's 2026 AI Index just dropped, and here’s what they say. Coding benchmark performance went from 60% to near 100% of the human baseline in a single year. Generative AI reached 53% global population adoption faster than the PC or the internet. And the US-China gap on model performance has narrowed to 2.7 percentage points. The US ranked 24th globally in per-capita AI adoption, and only 33% of Americans expect AI to make their jobs better.
PwC surveyed 1,200 executives across 25 industries. Here’s what they found. The leaders are using AI to pursue new revenue and reinvent business models instead of just cutting costs. They are nearly twice as likely to run agents in autonomous, self-optimising modes. The gap is widening. PwC is explicit: 2026 is the year the divide between leaders and laggards becomes durable rather than correctable.
Revolut just launched an AI assistant for 13 million UK customers. The assistant, called AIR, replaces Revolut's entire menu navigation with a conversational interface covering spending, investments, subscriptions, and card controls. Underneath it is PRAGMA, a foundation model Revolut built internally on data from 70 million users, handling fraud detection, credit scoring, and customer support. 80% of support tickets now resolve without a human.

A lot of organizations are frustrated that their AI agents aren't living up to expectations. The model does what they ask in a demo, then falls apart in production after twenty minutes of real work.

Every single time, when we dig in, it's the same thing. They picked a model. They wrote a prompt. They shipped. Nobody had built the harness.

The benchmark I opened with is the clearest illustration I've seen of why this matters. You can get a 16-point performance improvement on the same model just by improving the environment around it. Not a new model. Not a bigger context window. Just better infrastructure.

Most organizations are leaving that on the table. The ones that won't are the ones investing in the boring work of building, testing, breaking, and rebuilding the layer the model runs inside.

Haroon

P.S. If you're starting to think seriously about harness infrastructure for agents running at team scale, Clutch is worth a look.

Why the Same Model Performs Differently

Haroon Choudery

Why the same model can vary by 16 points, and how harness design determines agent performance in production

What a Harness Is And Why It Matters

Recommended by LinkedIn

Clutch. Just launched.

AI Ready

1,125 follower

More articles by this author

Others also viewed

The importance of design and architecture in today's world of AI

Code Smell 256 - Mutable Getters

When AI Writes 10,000 Lines, Architecture Is What Matters

The Harness and the Model: Why the 'Vibe' Defines the 'Code'

Agent Harnessing: The Engineering Layer That Turns Models Into Systems

Agentic Design Patterns: What They Actually Are (Beyond the Textbooks)

Iterator Pattern

Beyond Prompting: Structuring Systems with Specifications and Manifests, a practical approach with OpenSpec

Redefining System Design When Humans Stop Coding

Inception

Explore content categories

Why the same model can vary by 16 points, and how harness design determines agent performance in production

What a Harness Is And Why It Matters

Recommended by LinkedIn

Clutch. Just launched.

AI Ready

1,125 follower

Big Tech is spending $725B and supply is still the bottleneck

May 1, 2026

Your Word doc is now an agent

Apr 30, 2026

OpenAI is now a Bedrock vendor

Apr 29, 2026

Your AI Stack Has a Guardrail Problem

Apr 28, 2026

AI stopped being a product

Apr 24, 2026

AI Brief: Your AI vendor supply chain just became your attack surface

Apr 23, 2026

Anthropic just tied itself to AWS for a decade

Apr 22, 2026

Switching AI Foundations Is About to Get Expensive

Apr 21, 2026

Anthropic Unveiled Its Most Powerful Model and Decided Not to Release It

Apr 9, 2026

Anthropic just passed OpenAI in revenue. Here's what's behind it.

Apr 7, 2026

Others also viewed

The importance of design and architecture in today's world of AI

Code Smell 256 - Mutable Getters

When AI Writes 10,000 Lines, Architecture Is What Matters

The Harness and the Model: Why the 'Vibe' Defines the 'Code'

Agent Harnessing: The Engineering Layer That Turns Models Into Systems

Agentic Design Patterns: What They Actually Are (Beyond the Textbooks)

Iterator Pattern

Beyond Prompting: Structuring Systems with Specifications and Manifests, a practical approach with OpenSpec

Redefining System Design When Humans Stop Coding

Inception

Explore content categories