Why the Same Model Performs Differently
Why the same model can vary by 16 points, and how harness design determines agent performance in production
An interesting benchmark has been floating around recently.
Opus 4.6 scored 77% inside Claude Code. The exact same model scored 93% inside Cursor. It was the same model, but a different result. And the only thing that changed was the environment around it.
That environment has a name. It's called a harness. And if you're deploying AI agents without understanding what they are, I’d say you're flying blind.
What a Harness Is And Why It Matters
An AI model, at its core, only does one thing: takes text in and produces text out. That's it. Left to its own devices, it cannot read your files, run commands, edit code, or touch your database. It generates text, that's the whole job.
So, how does Claude Code rewrite a codebase? How does an agent book a meeting or update a CRM?
Tool calls. The model outputs a piece of syntax — essentially "run this command" — and then stops. The harness, a piece of software running around the model, picks that up, executes the command, takes the result, adds it back to the conversation history, and sends everything back to the model to continue. That loop — model asks, harness executes, result feeds back — is running hundreds of times every time you use any agentic tool.
Stanford researcher Mihail Eric made this concrete with an article that circulated widely this year. His argument: the core of Claude Code is not magic. It is 200 lines of Python. Three tools — read file, list files, edit file — a system prompt, and a loop. That is the whole architecture.
What Cursor did was spend thousands of engineering hours on those prompts and tool descriptions. They have people whose entire job is to update the system prompt every time a new model ships — testing obsessively, adjusting descriptions, steering the model away from bad habits. That investment is audible in the benchmark. 16 percentage points on the same model.
However, Anthropic's own engineering team found something that should make every builder uncomfortable. Harness assumptions go stale as models improve. Context anxiety that required full resets in Sonnet 4.5 simply disappeared in Opus 4.5. If you over-engineer control flow, the next model update breaks your system. Manus refactored their harness five times in six months. LangChain is rebuilt four times a year. Vercel removed 80% of its agents’ tools, and performance went up.
What This Means For You:
If you are deploying AI agents inside your organization, the first question worth asking is not "which model?" It is "What is our harness?"
Most teams don't have a real answer. They have a prompt, maybe a framework, and a hope that the model figures the rest out. That holds for demos. It does not hold across long workflows, multiple users, and real-world edge cases.
The gap between teams with mature harnesses and teams without one is still wide open. The companies that close it first will have agents running reliably when everyone else is debugging why theirs stopped at step 80.
Recommended by LinkedIn
Clutch. Just launched.
OpenClaw made it easy to get an agent running. Clutch makes it safe to run that agent at work.
Secure multi-agent deployment, built for teams that need more than a single-machine setup. We just launched.
A lot of organizations are frustrated that their AI agents aren't living up to expectations. The model does what they ask in a demo, then falls apart in production after twenty minutes of real work.
Every single time, when we dig in, it's the same thing. They picked a model. They wrote a prompt. They shipped. Nobody had built the harness.
The benchmark I opened with is the clearest illustration I've seen of why this matters. You can get a 16-point performance improvement on the same model just by improving the environment around it. Not a new model. Not a bigger context window. Just better infrastructure.
Most organizations are leaving that on the table. The ones that won't are the ones investing in the boring work of building, testing, breaking, and rebuilding the layer the model runs inside.
Haroon
P.S. If you're starting to think seriously about harness infrastructure for agents running at team scale, Clutch is worth a look.