Why AI Code Generation Is Solved and Still Accelerating

Why AI Code Generation Is Solved and Still Accelerating

Some people who are deep in the AI world have started saying that AI code generation is a solved problem. I agree.

Hold on, before you close the tab.

I'll bet a lot of you just disagreed without reading another word. That reaction is exactly why I'm writing this. For the past two and a half years, I've been using AI for coding daily, almost seven days a week. Over a year on Claude Code specifically. Thousands of hours of hands-on time. I teach engineers and engineering leaders how to adopt these tools. And what I'm watching every six months is the same conversation, with the goalposts moved.

This article is about agentic coding capability specifically. Not AI in general. Not whether AGI is near. Just: is AI good enough at writing software to call the problem solved, and is it going to keep getting better? My answer is yes and yes. Here's why.

What "Solved" Actually Means

The wrong definition: AI never makes mistakes. AI writes code exactly the way a senior engineer at my company would write it. AI handles every edge case I throw at it without prompting.

By that standard, no human engineer is solved either. I've worked with hundreds of engineers across my career. Every single one of them made mistakes. Sometimes they caught their own mistakes. Sometimes they didn't, and the mistakes shipped to prod. None of them write code exactly the way another senior engineer would. None of them anticipate every edge case without context.

The right definition is much simpler. AI code generation is solved when, given the same context a competent engineer would need to do the work, it produces results as reliable as that engineer would.

Context here is whatever a real engineer would need to ship the task: clear intent, the constraints that real engineering work always involves (language, framework, version, testing approach, architectural conventions), and the actual feature or bug to address. How that context reaches the model matters less than that it reaches the model. Prompts, scaffolding, skills, conversation history, documented conventions, the project's existing code. Give it what a human would need, and it produces work in the same league.

AI code generation is solved when, given the same context a competent engineer would need to do the work, it produces results as reliable as that engineer would. Any stricter bar is moving the goalposts to a place no human meets either.

There's a pattern hiding inside the skeptic argument that's the strongest piece of evidence I have. Every six months, the skeptics point to a different set of mistakes. The mistakes from a year ago aren't being made anymore. The mistakes pointed to today won't be made six months from now. That goalpost movement is the skeptic argument quietly conceding the point.

Why I Say It's Solved

I'm telling you what I see every day.

Code quality is night and day from a year ago, and still night and day from six months ago. A year ago required heavy review and rework. Six months ago required moderate review. Today the code is often shippable on the first pass when the plan is solid.

Instruction following has stepped up substantially. A year ago the model would drift from explicit instructions on longer tasks. Today it more often holds the spec, asks clarifying questions when constraints conflict, and respects negative instructions. Does it always remember? No. Sometimes you've got to remind it. Humans do that too. The drift rate now is dramatically lower than it was even six months ago.

Long-horizon focus is the most recent shift, and it's the one that surprised me. With Opus 4.6 and especially 4.7, tasks that used to fall apart after thirty minutes of agentic work now hold together for multi-hour sessions. The model maintains intent across many subtasks, picks the thread back up after a tool call, and stays on the plan. Not perfectly. But when the plan is detailed and approved up front, the model follows it more often than not. In my own work, well over 90% of the time, which is honestly a higher rate than I see from human engineers on equivalently complex tasks.

End-to-end implementation quality is now coherent, not stitched. When the plan is solid, what comes out is far more often a single coherent implementation: production code, unit tests, and integration tests wired together as one piece of work, rather than three things produced in isolation and forced to fit.

Self-correction is now inside the loop. The model catches more of its own mistakes than it used to. It reviews its own code. It thinks through its answer before committing. A year ago this had to be scaffolded externally. Today it happens on its own much more often.

One honest caveat the skeptics most often miss: none of this is the model alone. It's the model plus the human, plus the scaffolding the human builds around the model. The teams who tell me Claude routinely goes off the rails are almost always teams that haven't yet done the work to provide good context, build the right workflow, and develop the skills (literally, in the Claude Code sense) that keep the model on track. The gap between "AI coding works great" and "AI coding doesn't work" is, in my experience, almost entirely a gap in human skill and scaffolding.

The Benchmarks Corroborate It

I'm a benchmark skeptic. Benchmarks tell one part of the story, never the whole story. I'm not pointing at the 90-day data to claim definitive proof of anything. I'm pointing at it as corroboration of what I'm already seeing in daily use.

The last 90 days produced four major frontier coding-model releases: Anthropic Opus 4.6 (Feb 5, 2026), Anthropic's research-tier Mythos (April 7), Anthropic Opus 4.7 (April 16), and OpenAI's GPT-5.5 (April 23). Coding capability was the headline for every one of them.

Anthropic's two flagships in ten weeks: SWE-bench Pro went from 53.4% (Opus 4.6) to 64.3% (Opus 4.7). Nearly eleven points on the harder benchmark in ten weeks, inside the 90-day window. The labs are now leading with the harder benchmark precisely because the older one is saturating. Saturation is what the late stage of solving a problem looks like.

The Mythos signal is the most interesting piece of recent evidence and the one almost no one's talking about. Anthropic released Mythos to a small set of partners and publicly conceded that Opus 4.7 trails it. Why isn't Mythos generally available? Anthropic concluded that its cybersecurity capabilities required restricted deployment. A frontier lab voluntarily withholding its highest-capability coding model is a fundamentally different kind of evidence than a benchmark number. It says capability is now ahead of what's comfortable to ship broadly.

And It's Going to Keep Getting Better. Three Reasons.

If it's solved, why am I still writing? Because solved isn't the end of the story. The same forces that solved the problem are still pushing, and the next twelve months are going to look like the last ninety days, only more so.

Driver One: Enterprise Revenue Is Where the Money Is

This is the surface story. Anthropic generates roughly 80% of revenue from enterprise customers, Claude Code crossed $2.5B in annualized run-rate within roughly a year of launch, and Anthropic just passed OpenAI in total ARR at around $30B. They got there by making coding their differentiation.

OpenAI publicly pivoted in March 2026 from "consumer hype to business reality." Sora got scaled back, consumer experiments got shelved, resources got refocused on coding and enterprise ahead of a planned IPO. OpenAI's own internal forecast acknowledges that consumer paid-WAU conversion will plateau at only 8.5% by 2030.

Two labs, identical convergence target. The commercial incentive to keep improving is locked in.

Driver Two: Recursive Self-Improvement

Here's the part most coverage misses, and the one that matters most for your mental model of where this goes next.

The second reason the labs are racing on coding is recursive self-improvement. AI improving AI. The idea that a sufficiently capable AI can write, refactor, and optimize the code that AI itself is built from, accelerating its own progress in a way that humans alone can't.

To see why this matters, look at what AI actually is. AI is made of code. Training pipelines are code. Evaluation harnesses are code. Model architectures are code. Agent scaffolding is code. So a lab that gets better at making AI good at code is getting better at making AI good at the very stuff AI itself is built from.

Why coding specifically and not any other domain? Because code has automatic verification. Tests pass or they don't. Loss goes down or it doesn't. Compilation succeeds or it fails. Every other domain (language, reasoning, design, science) needs human evaluation to close the loop. Coding is the only domain where AI can verifiably improve AI without humans in the loop.

In March 2026, Andrej Karpathy released autoresearch, an LLM training script that an AI agent reads, modifies, tests, and optimizes autonomously. The repo went viral. Karpathy called it "the final boss battle" that all LLM frontier labs are now racing to fight.

Inside the labs, public signals from OpenAI suggest the same loop is now running on production-scale infrastructure. Greg Brockman has publicly described using GPT-5.3-Codex internally to find bugs in OpenAI's own training runs, manage rollout, and analyze evaluation results. OpenAI's chief scientist, Jakub Pachocki, told MIT Technology Review in March 2026 that OpenAI's research is "building towards automating scientific research," with a stated timeline of an automated AI research intern by September 2026 and a fully automated multi-agent research system by 2028.

OpenAI's leadership is saying, on the record, that they're building AI to do AI research. The substrate they're building it on is code.

The labs aren't pouring resources into coding only because it's the most lucrative product. They're pouring resources into coding because it's the one product category that improves the lab. Enterprise revenue is the runway. Recursive self-improvement is the prize.

Driver Three: Real-World Usage Feeds the Next Generation

There's a third, more practical reason. It's almost never discussed.

The two largest commercial AI labs are now training their next-generation models on the real-world coding sessions of millions of consumer-plan users. As of an August 28, 2025 policy update, Anthropic's Free, Pro, and Max plans (including Claude Code on those tiers) feed training data by default, with retention up to five years. Users have to actively opt out. Team, Enterprise, API, Government, and Education plans are excluded. OpenAI's policy has the same shape.

Think about what that data stream actually contains. Real coding tasks. Real bugs. Real edge cases. Real moments where the model produced bad output and the user followed up with a correction. Multiply by millions of developers, every single day. The dataset of what real engineers actually struggle with when they use AI to write code exists nowhere else.

The Honest Caveats

I'm not telling you AI takeoff is here. The argument doesn't need that.

I'm also not telling you AI is currently writing 90% of the code at any particular company. Capability and impact aren't the same thing, and people conflate them constantly. Electricity existed long before wires were run to every home. The wires didn't change what electricity could do. They determined whether electricity reached your house. Same with AI right now. The capability is here. Whether it's reaching your team is a different question entirely.

Across the teams I work with, the two biggest missing wires are these. First, most teams haven't accepted that an engineer's primary role is now specification: patient, detailed articulation of intent, goals, and constraints in a way an AI agent can actually act on. Most engineers were trained to do the work, not to brief the worker. They haven't put in the reps yet to be good at the new job.

Second, most teams are retrofitting existing pre-AI workflows with AI tools sprinkled on top, rather than stepping back, reimagining how the work should flow when an AI agent is a core participant, and codifying new AI-native workflows the whole team uses consistently. If your organization isn't seeing the value yet, the gap almost certainly isn't in the model. It's in the wires.

What This Means for Engineering Leaders

The skill that compounds isn't typing. It's specification. The model can implement well now, but only when the plan is solid. Engineers who get good at defining what good looks like, in language a model can act on, compound. Engineers who stay in the typing layer don't. This is the most important career calibration any engineer can make in 2026.

Stop arguing about whether it's solved. Start designing for the world in which it is. Teams that have rebuilt their workflows around current-generation agentic coding (planning gates, integrated test generation, self-review loops, well-scoped specifications) are out-shipping teams that haven't. The gap is widening every quarter.

Engineering leaders specifically: make AI-native non-negotiable, and put real budget behind it. This is where most engineering organizations are quietly stalling out. They let individual engineers experiment with AI tools, but they don't step back as a leadership team and ask the harder question: what does our workflow actually look like when an AI agent is a core participant, not a side-tool?

Answering that takes real time and real money. It requires designing, codifying, and continuously refining AI-native workflows the whole team uses consistently. And it requires a message from the top that doesn't waver. We're moving toward an autonomous, agentic software engineering world. The vision isn't up for debate. The how is. Teams define the how. Leaders insist on the what. AI-native is not optional.

Closing

AI code generation is solved by the only definition of solved that survives contact with reality. Code as reliable as a competent engineer's, given the same context that engineer would need.

If you still disagree, ask yourself: by your definition, is human code generation solved? Because if your bar is "never makes a mistake," no human ships either.

The labs aren't done. They're not even close to done. The mistakes the skeptics are pointing to right now? Six months from now, those won't be made either. The labs aren't just incentivized to make that happen. They're now using AI itself to do it.


This article is also published on my blog. For ongoing writing on AI-native leadership and engineering, visit [codingthefuture.ai](https://codingthefuture.ai).

I'm Tim Kitchens, Founder of Coding the Future with AI (https://codingthefuture.ai/). I help leaders and teams build real AI fluency, not performative adoption. If you'd like to talk about what this looks like in your organization, book a consultation (https://codingthefuture.ai/book-a-consultant/).

To view or add a comment, sign in

More articles by Tim Kitchens

Others also viewed

Explore content categories