Field notes from three weeks of real testing

Field notes from three weeks of real testing

I've been using Claude Code daily for months and pushed my entire team onto it. It genuinely changed how we build software. When I started planning a large deployment, one question came up that I had never thought to ask before: does the interface actually need Claude to work well? So I ran an experiment to find out.

In This Edition

  1. Why I ran this experiment
  2. What four models produced on the same task, and what each cost
  3. Where benchmarks match reality, and where the gap shows up
  4. What it costs per engineer per month
  5. The tradeoffs you need to decide before rolling this out
  6. What it means for your AI tooling budget


1. Why I Ran This Experiment

We launched an AI-driven development certification program at Ideas2IT Technologies for around 600 engineers, built entirely around Claude Code. When I worked out what licensing 600+ seats would actually cost, the number was significant enough that it forced a question I had never thought to ask.

Claude Code is configurable. The model it calls for reasoning and code generation sits underneath the interface, and you can swap it. I wanted to know what happens when you do.


2. What I Tested and What Happened

I tested five models on the same task: build a Flask web application with SQLite, HTML frontend, CRUD operations, unit tests, and git setup.

The five were GPT-OSS 20B (local), Qwen3-Coder 30B (local), Kimi K2.5 (API), DeepSeek V3.2 (API), and Claude Sonnet 4.6 (API). Every model got the same prompt through Claude Code. I tracked code quality, whether it completed, response time, and what it cost.

Article content

The two local models did not finish. GPT-OSS 20B had broken tool calling and produced nothing usable. Qwen3-Coder 30B got partway through but the output quality was too basic to work with. Both are free to run, which sounds appealing until the output is not usable.

All three API models completed the task fully.

  • Kimi K2.5 produced production-grade code in 5 to 15 seconds and cost $0.33 per run.
  • DeepSeek V3.2 did the same at $0.15, and its frontend UI was noticeably better designed than Kimi's, which genuinely surprised me.
  • Claude Sonnet produced identical quality to Kimi at $1.66, five times the cost.

Throughout all three API runs, Claude Code behaved exactly as it does with Claude natively. The commands, the workflow, the file operations were all the same. When I asked engineers to review the outputs without telling them which model produced what, none of them could tell the difference.


3. Where Benchmarks Match Reality, and Where the Gap Shows Up

Benchmarks tell you which model scores higher on a controlled test. They do not tell you whether that difference shows up in the work your engineers actually do every day.

Article content

Kimi beats Claude Sonnet on five of eight benchmarks, including SWE-Bench Verified at 76.8% versus roughly 72%. On BrowseComp, which tests agentic web tasks, Kimi leads both Claude models at 74.9%. Kimi trails on the Overall AI Reasoning Index: 47 versus 52 and 53 for Sonnet and Opus. That gap surfaces on architecturally complex and genuinely novel problems. Worth noting: Cursor, valued at $29.3B with $2B+ ARR, built their Composer 2 on Kimi K2.5.

The table below is more practical than benchmarks for most teams. It comes from a month-long daily coding comparison by LLMx Tech in February 2026.

Article content

For everyday engineering work including writing APIs, generating tests, building frontends, Kimi holds up well against Sonnet. On generating frontends from mockups or images, it is actually the strongest of the three. Where Claude Opus pulls ahead is architecture, large refactors, and legacy code. That gap is real and consistent. The question worth asking is how much of your team's week lives in that bottom half of the table versus the top.

If benchmarks interest you, I went deeper and wrote about it on Medium, breaking down what the scores actually tell you and what they don't.


4. What It Costs per Engineer per Month

Every Claude Code request carries a roughly 16,000-token system context and that is how it keeps track of your project across a session. 

Kimi K2.5 caches that context automatically. Cached tokens cost $0.10 per million versus $0.60 uncached, an 83% discount. 

Since the same context travels with every request throughout a session, your actual Kimi spend ends up well below what the raw token price suggests.

Monthly Cost per Engineer at Active Daily Usage

Article content

  • $7.86 - Kimi K2.5 per engineer/month with caching
  • $44.44 - Claude Sonnet 4.6 per engineer/month same usage
  • $75.64 - Claude Opus 4.6 per engineer/month same usage

These numbers assume 20+ prompts per day, 22 working days, roughly 18K input and 4K output tokens per prompt, and approximately 40% cache hits for Kimi. All figures come from stated API pricing.

At team scale, the difference compounds further. For 50 engineers over a six-week training program at 15 prompts per day, Kimi totals approximately $563. The same cohort on Claude Max costs $7,500. On Claude Team, $11,250. The gap is not marginal, it is a different order of magnitude for the same interface and the same workflow.


5. The Tradeoffs You Need to Decide Before Rolling This Out

Every cost reduction comes with something you give up. I want to be direct about what that is here, because I have seen teams skip this conversation and run into problems later.

Two of these need an active decision from you before you roll anything out. They are not technical problems you can patch later. The other two show up in day-to-day use and are manageable once your engineers know about them.

Article content

Most AI coding tools bundle two things into one price: the interface and the model. For a long time that was fine because the model was the only thing that mattered. That has changed.

Claude Code's value is in the interface. The way it reads your project, maintains context, and runs the agentic loop is what changes how engineers work. The model underneath it is configurable, and the alternatives have caught up on everyday engineering tasks.

The teams that ask this question early, what interface do we want and which model do we actually need for which work, are making a sharper decision than teams that default to the bundled rate and never question it.

Based on what we found, we decided to move forward with Kimi K2.5 and DeepSeek V3.2 as our Claude Code backends. 


Closing Thought

If you are running Claude Code or evaluating AI coding tools at team scale, I hope this experiment gives you a sharper framework for what to ask before you commit to a pricing tier. Running the experiment yourself, even on a small task, takes a few hours and tells you more than any benchmark reading.

Have you tested something similar with your team? I would genuinely like to know what you found in the comments.

If you found this useful, subscribe and follow along. I write about what AI-driven development actually looks like from inside a real engineering organization: experiments, tool evaluations, team transitions, and honest lessons from building with AI every day.

Good one. really insightful! Even for individual use, most of us tend to try different models and tools, but rarely do any structured benchmarking like this. This was a great reminder. After reading this, I’m actually tempted to run a few comparisons myself to figure out which models/tools best fit my use cases.

Three weeks of real testing beats any vendor benchmark. The field notes on context management and multi-file reasoning gaps are exactly the signals enterprise teams need before committing to production deployments.

Very nice article and good comparison.

To view or add a comment, sign in

More articles by Arunkumar Ganesan

Others also viewed

Explore content categories