Teaching a Model to Code
Every AI coding assistant you've heard of has a dirty secret: it's reading your code.
Every keystroke, every file, every private API key, all shipped off to someone else's servers. You paste in your company's proprietary logic, hit enter, and that code leaves your machine. Most developers know this on some level. Most of us try not to think about it too hard.
We decided to build something different: a coding assistant that runs entirely on your machine, handles long-context problems across hours of work, and never phones home. No cloud. No data exfiltration. Just a model running locally that’s actually good enough to use.
At this point, most people assume the tradeoff is obvious: if it runs locally, it won’t be as good.
That assumption is wrong, but not for the reason you might expect.
The thing nobody tells you about local coding models: It’s easy to get teach a model to respond correctly once to a perfectly crafted prompt.
It’s incredibly hard to teach a model to behave like a developer across a multi-hour programming session.
To teach a model that, you need a kind of training data that doesn’t exist. At least, not unless you’re Anthropic or OpenAI.
This is the story of how we did just that: Built a data and RL pipeline for long-trajectory, multi-turn AI coding conversations across 52 languages and over 10,000 repos.
The training data everyone uses is fundamentally wrong
The standard playbook for training a code model goes something like this: scrape GitHub, filter for quality, fine-tune. And it works… if all you want is autocomplete. A model trained that way can finish your function. It can guess the next line. That's table stakes.
But that's not what a useful coding assistant actually does.
Think about what happens when a real developer sits down with an unfamiliar codebase and a bug report. They don't just start typing.
They explore.
They read files.
They search for symbols.
They run the tests, see what fails, trace through the call chain.
They form a theory, make an edit, run the tests again.
If it breaks, they back up and try something else.
The whole process might take hours. It involves dozens of decisions, dead ends, and course corrections.
None of that is captured in a GitHub scrape. Static code is a snapshot of the answer. It tells you nothing about the process of getting there. And process is exactly what an agent needs to learn.
We needed something different. We needed full trajectories; complete, multi-turn episodes of an AI navigating a real codebase, reading files, running commands, making mistakes, recovering, and ultimately solving the problem. That data doesn't exist in the wild.
So we built a system to generate it. We call it AgentGen.
The insight that changed everything
Here's the thing that unlocked the whole approach, and it's so simple it's almost embarrassing: git history is a goldmine of implicitly labeled software engineering tasks.
Think about it. Every commit that changes source code and has associated tests is, effectively, a solved problem with a built-in verifier. The commit message tells you what the task was. The diff tells you the solution. The test suite tells you whether it's correct. Millions of these exist, across every language, in public repositories. The work has already been done and verified by real humans. We just needed to turn it into training episodes.
But a commit diff isn't something you can hand to a model and say "learn from this." A diff shows you the destination. It doesn't show you how to get there. An agent doesn't magically know which files to edit. It needs to discover the solution through exploration — the same way a developer would. Reading files. Searching for symbols. Understanding the architecture. Running tests. Iterating.
So we built a system that generates those exploration trajectories, automatically, at scale.
How an episode actually gets made
Let me walk you through what happens for a single training episode, because I think the mechanics are important.
We start with a gold-standard commit: One that's linked to an issue, accompanied by test changes, or part of a well-described PR. Something meaningful, not a typo fix.
Then we spin up a full environment. The repository gets checked out to the state before the commit was made. The agent sees only what a developer would have seen at that moment: the task description and a codebase it needs to navigate.
Recommended by LinkedIn
We give it a full toolkit: 15+ tools for file I/O, regex search, codebase search, shell execution, and more. We build a retrieval index that combines dense embeddings, lexical search, and a code graph that understands symbol relationships and call chains. And we provide a user simulator that can present the task and provide hints or corrections mid-conversation, because that's how real development works. You ask questions. You get clarification.
Then the agent works the problem. It reads files. It searches for relevant code. It forms a plan, makes edits, runs tests, hits a wall, tries again. Every single step. Every tool call, every file read, every search query and its results gets recorded.
But here's the part that matters most: we don't just record what the agent did. We record what information was available at each step. The retrieval context that gets injected before each response is captured exactly as the model would see it at inference time. This is critical. The training data matches the inference format precisely, so the model learns to work with its own retrieval pipeline rather than fighting against it.
That last point sounds like a technical detail. It's not. It's the difference between a model that tolerates retrieval context and a model that actively leverages it.
One that learns which retrieved snippets are relevant, how to cross-reference them, and when to ask for more information versus acting on what it has.
Scale changes everything (and breaks everything)
Building one episode is straightforward. Building 200,000 of them across 50+ programming languages while maintaining quality is a completely different problem.
Start with language distribution. If you train on 90% Python, congratulations: you've built a Python assistant. Your model will be mediocre at everything else. We use a carefully tuned distribution that follows real-world language prevalence: heavier on Python, JavaScript, and TypeScript, while ensuring meaningful coverage of systems languages like Rust, Go, and C++, and enterprise languages like Java, C#, and yes, even COBOL (we see you finance industry).
A multi-pass generation strategy with redistribution handles the inevitable gaps: if we can't hit our target for a rare language, that budget flows elsewhere without distorting the overall balance.
Then there's task diversity. Real coding work isn't just "implement this feature." Developers debug mysterious failures. They review pull requests. They explain unfamiliar code. They solve algorithmic problems. Our episodes span five categories: replay tasks that reproduce real commits (~60%), code review (~9%), debugging (~5%), fill-in-the-middle completions (~20%), and pure reasoning problems (~5%). The mix is deliberate. A model that only knows how to implement features is half a developer.
And then there's cost.
Generating hundreds of thousands of agent trajectories with a frontier model is expensive, full stop. We use smart routing: a lightweight classifier evaluates task difficulty upfront and sends easy and medium tasks to faster, cheaper models, while reserving the expensive model for the hard problems. Quality stays high where it matters. The compute budget doesn't explode.
The retrieval problem nobody talks about
One of the most important, and most subtle, things we figured out is how to teach a model to actually use retrieval context.
Here's the problem. At inference time, our model has access to a retrieval pipeline that surfaces relevant code snippets, type definitions, and related files. Great. But a model trained on raw code has no idea what to do with that context. It's never seen it before. It doesn't know that the chunk of code suddenly appearing in its context window is there to help. It doesn't know how to integrate retrieval results into its reasoning.
So during episode generation, we run our full retrieval pipeline at every single turn. The retrieved context gets injected as a system message right before the agent responds. When we export the episode for training, those injection points are preserved. But we mask the loss on the retrieval content itself. The model learns to consume retrieval context without trying to memorize or reproduce it.
The whole process takes less than 50ms each turn.
The result is a model that doesn't just tolerate retrieval-augmented context. It relies on it. It learns which snippets matter, how to cross-reference across them, and when to request more versus acting on what it has. This is something you can't bolt on after the fact. It has to be baked into training.
We actually had to change our grading rubric, because the agent wasn’t reading files before editing, because it didn’t need to. The context engine was providing exactly what it needed before it had to ask for it.
Autocomplete isn't dead… it just needed better data
Not all coding assistance is conversational. Autocomplete (predicting what comes next as you type) is the bread and butter of daily developer experience. We didn't ignore it.
Our fill-in-the-middle pipeline generates training data specifically for this. For each commit, we extract code spans at AST boundaries and split them into prefix, target, and suffix. But raw fill-in-the-middle is table stakes. What makes our approach different is context-aware completions: we assemble the surrounding files, imported dependencies, and retrieval evidence into the prompt. The model learns to make completions informed by the broader codebase, not just the current file.
We generate these in three tiers. Lightweight: just retrieval snippets, for fast short-context completions. Full context: dependency files plus retrieval evidence, matching the actual inference format, this is where most of the value lives. And files-only: dependency context without retrieval, as a graceful fallback. Each example is assembled to hit a target sequence length, with intelligent padding that exercises long-context attention without wasting tokens.
What comes out the other end
The pipeline produces three output formats from a single run: supervised fine-tuning episodes (complete trajectories with tool calls, retrieval context, and per-turn slice points), reinforcement learning prompts (task descriptions with metadata, where the model generates its own trajectories and learns from reward signals), and fill-in-the-middle examples split between supervised and RL training.
Everything is resumable. Crash midway through 190k episodes? The pipeline picks up where it left off, skips processed commits, and appends to existing output files. Deterministic episode IDs ensure consistency across restarts. When you're running a pipeline this large, reliability isn't a nice-to-have. It's survival.
Why this matters
The data from AgentGen isn't just "more training data." It's a fundamentally different kind of training data: one that teaches behavioral patterns rather than static knowledge.
A model trained on these episodes learns when to explore broadly versus when to act on what it knows. It learns how to use tools effectively, like don't grep for a function name when you can search the code graph; don't read an entire file when you only need the class definition. It learns to recover from mistakes: if an edit breaks tests, read the error, understand what went wrong, try a different approach. And it learns to work with retrieval naturally, integrating context across snippets and knowing when it needs more information.
These aren't things you can learn from reading code. You learn them by doing the work. We built a system that does the work hundreds of thousands of times so the model doesn't have to start from zero.
In Part 2, we'll cover how we take a massive model and compress it down to run on your laptop… without losing these hard-won capabilities. It'll be expensive and it'll be difficult. But it'll be worth it.
The original version of this post is available here.
Jacob Warren Data quality is crucial. We've seen teams underestimate the impact of their data generation processes, leading to wasted resources. Your approach to building a robust pipeline is a solid step in the right direction.
Me> Why don't we just use VERL? Jacob> It doesn't handle multi-shot agent training Adam> So we're building our own? From scratch? Jacob> Yes Me> 🤯
We're thinking about open sourcing the RL pipeline. There's a few really neat ones out there already like verl, but I haven't found any that were conducive to back-and-forth with a user sim, or custom agent loops out of the box. Let me know if you'd like to see the code?