Lessons Learned Getting Coding Agents to Handle Complex Tasks
Task and data decomposition to work effectively with coding agents on complex tasks with large data sets

Lessons Learned Getting Coding Agents to Handle Complex Tasks

We keep pushing more ambitious tasks onto coding agents — implementing large features, mining insights from vast amounts of data, synthesizing research from hundreds of sources. And we keep hitting the same wall. These agents are impressively capable, but they have real limits. Give them too much at once and you get context rot, hallucinations, silently dropped details, and output that reads plausibly but falls apart under scrutiny. The agents don't warn you they're struggling. They just quietly produce something mediocre.

None of this should surprise us, though. We've been dealing with complexity and scale in software for decades. Multithreading, microservices, map-reduce, sharding — they're all variations of the same idea: when something is too big or too complex for a single unit to handle, you break it down and distribute it. We've internalized this so deeply for systems and data that we do it almost instinctively. But for some reason, when it comes to coding agents, we tend to forget all of that and throw the entire problem at them in one shot. Coding agents today are remarkably good when given a clearly defined single task with bounded context. They lose their way when the task is sprawling or the data is massive. It's our job to do what we've always done — decompose the work into well-defined smaller chunks and feed them to the agents one at a time so they can do what they're good at.

The good news is that there are well-proven ways to do exactly this and get genuinely high-quality output even on very complex tasks. The common thread across all of them is divide and conquer — applied in different ways depending on the problem.

For complex feature implementation, spec-driven development has been a huge win. Frameworks like GitHub's spec-kit break a feature down through rigorous stages — spec.md captures requirements, plan.md captures architecture, tasks.md breaks implementation into bite-sized steps — each artifact stored on the filesystem, each stage building on the last. When a coding agent goes to implement, it doesn't need to hold the entire feature in its head. It reads the relevant artifact, picks up one small step, does the work, moves on. The filesystem keeps the agent grounded with just the right context at the right time, and the result is dramatically better than asking the agent to figure out a large feature from a vague description.

But spec-driven development doesn't help when the problem isn't "build this feature" but "wrangle hundreds of documents and distill them into a coherent report." Here too, if you just hand the agent all the data and a complex prompt, you'll hit the same limits — context windows overflow, attention degrades, and the output comes out diluted and shallow. The agent bites off more than it can chew, and these over-eager agents are absolutely guilty of doing that if you don't guard against it.

The fix is the same principle, different shape. You do the upfront thinking to break the task into stages where each stage deals with a small chunk of data, does a thorough job with it, and writes results to the filesystem. Then you plan how to consolidate the pieces — merging insights in small batches, hierarchically, never letting any single step exceed what the agent can handle well. You bake all of that thinking into a Python orchestrator that runs the whole plan, invoking the coding agent as needed, giving it a fresh context window for each small task, and never allowing it to drown in accumulated data.

At the heart of both approaches — spec-kit for features, orchestrators for data — is the same discipline: context engineering. Designing your system so that the right information reaches the LLM at the right time, in the right quantity, without overwhelming it. This article is about the orchestrator-worker side of that coin — what works, what doesn't, and the common mistakes to avoid.

A Concrete Example

Let's take a concrete example of a complex task that has to wrangle large amounts of data and walk through how to design an orchestrator for it.

Say, you want to harvest all the knowledge stored in your Jira that contains the collective wisdom of the team solving issues over the years and build a knowledge base out of it. You have indexed all your team's JIRA tickets into a RAG system and wired it up as an MCP server so Claude Code can search it. Now you want to ask Claude something ambitious: "Look for all stability-related issues in JIRA — crashes, hung threads, memory leaks, that kind of thing — and do a detailed review of every resolved ticket. Pull in the descriptions, comments, PR diffs, everything. Then synthesize all of it into a lessons-learned document with common failure patterns, decision trees for debugging different types of failures, patterns of failures in different modules etc."

That's a real ask. And Claude Code will take a crack at it. It'll start searching JIRA using the MCP server, pulling ticket details, reading through them. But here's what happens in practice: it finds maybe 180 relevant tickets. It starts doing deep dives on the first few and they're pretty good. By ticket 20 it's getting sloppy. By ticket 50 the context window is groaning under the weight of all the accumulated ticket data, search results, PR diffs, and partial analyses. The synthesis at the end — the part that's supposed to tie it all together — ends up shallow and full of gaps because the model is drowning in its own prior output.

The problem isn't that Claude isn't smart enough. The problem is that it was asked to bite off more than any single session can chew. The community has started calling this "context rot" — as the context window fills up, the model's ability to accurately attend to everything in it degrades.

Now imagine a different approach. Instead of handing Claude the whole problem, you sit down for an hour and think through the stages:

1. Use Claude with the MCP server to do targeted semantic searches — "crash," "hung thread," "stability regression" — and just collect the ticket IDs into a file. That's it. Don't analyze anything yet.

2. For each ticket ID, spawn a fresh Claude Code session that does a deep-dive RCA synthesis on that one ticket alone — pulling in descriptions, comments, PR diffs — and writes a detailed markdown file. Claude code does an amazing job capturing the RCA of a single ticket which includes detailed decision trees, before/after code snippets with explanations etc. One session per ticket. 180 sessions, 180 focused analyses in 180 markdown files.

3. Normalize each RCA markdown file into a common JSON structure - fields like issue trigger, root cause category, affected module, severity. Spawn a fresh Claude Code session to do this conversion for a small batch of files (Do token aware batching - more on that later in the document). Now you have uniform data you can work with programmatically.

4. Classify and cluster the documents by their root cause patterns.

5. Do hierarchical merging within each cluster — batch 4-6 related analyses together (respecting token limits), have Claude merge them, then merge the batches. Each merge gets a fresh session that only sees the documents it's merging.

6. Write the final document section by section, with each section getting a fresh Claude session loaded with just the relevant merged data for that topic.

Notice what's happening between stages. Every stage writes its output to the filesystem — JSON files, markdown files, a cluster plan. The next stage reads what it needs from disk and nothing more. The filesystem becomes the pipeline's shared memory, a blackboard that each stage reads from and writes to. No single Claude session needs to hold the entire picture. Each one picks up the specific artifacts left behind by the previous stage, does its focused work, and drops its results for the next one.

And here's a piece that proves invaluable: a progress.json file that the orchestrator maintains as a persistent work queue. Every time a ticket analysis completes or a merge finishes, the orchestrator atomically updates this file — which ticket IDs are done, which failed, which are still pending, what phase we're in. This is a lifeline. Processing 180 tickets takes hours. There could be internet disconnects, VPNs drop, API rate limits hit etc. leading to aborted runs. When you come back, you run --resume and the orchestrator reads progress.json, sees that tickets 1 through 137 are done, ticket 138 failed, and picks up at 138. No wasted work. No reprocessing. You can even re-run just the failures with --only-failed after figuring out what went wrong.

The total number of Claude sessions goes from 1 (overwhelmed) to 300+ (each one focused on a single well defined task and working within its limits). The quality difference is dramatic. It's the difference between a document you'd be embarrassed to share and one that actually captures what the team has learned across 180 resolved issues.

The upfront effort — that hour of thinking through the decomposition, the intermediate data formats, the progress tracking - is what makes it work. You're doing the hard cognitive work of figuring out how the problem breaks apart so that each piece is small enough for Claude to handle well. The orchestrator code that ties it all together is maybe 400 lines of Python. The payoff is enormous.

The Orchestrator-Worker Pattern

The solution is a Python orchestrator. It breaks the work into phases, and for each unit of work within a phase, it spawns a brand-new Claude Code session with just the data for that one task. It does both task and data decomposition to feed Claude Code with small digestible chunks.

The quality improvement from this structure is dramatic. The shift comes from one structural change: giving each task a clean context window with only the data it needs. The core idea is simple: Python controls the flow, Claude does the reasoning, and the filesystem connects them.

Why write code for the orchestration?

There's another way to do this that doesn't involve writing an orchestrator. You could just sit there and be the orchestrator yourself. Run step 1 manually in Claude Code, wait for it to finish, review the output, then prompt it with step 2, wait again, prompt step 3, and so on. Human-in-the-loop orchestration. It works for small stuff. But it falls apart fast.

A pipeline like this takes several hours end to end. 180 document extractions alone can run for a couple of hours. Then merging, summarization, writing — each phase takes its own chunk of time. If you're the one shepherding every step, you're sitting there for an entire workday babysitting prompts. And if you step away and lose track of where you were? You're scrolling through outputs trying to figure out which document you left off on.

When the orchestration is in code, you think deep upfront and write the orchestration code, test your code on a small set of data, and when you are satisfied, you kick it off and walk away. It runs overnight. You wake up and the job is done — or if something failed, the progress tracker tells you exactly where it stopped and you can fix the issue and resume. That alone is worth writing the orchestrator.

But there's a less obvious benefit that turns out to be just as valuable: the orchestrator is a record of your thinking. All that effort put into figuring out how to decompose the problem — the phase breakdown, the batching strategy, the quality thresholds, the intermediate data formats — it's captured in code. It's not lost after the task is done. It's not scattered across a chat history you'll never find again.

It's the same idea behind GitHub spec-kit storing planning artifacts in spec.md, plan.md, tasks.md — all that thinking about the problem doesn't evaporate after the feature ships, it lives on as artifacts you can revisit and build on. An orchestrator is the same thing for data processing: your decomposition strategy, captured as code. Next time there's a similar problem — different domain, different documents, but same general shape of "extract, classify, merge, synthesize" — you don't start from scratch. You adapt the orchestrator. The prompts change, the schemas change, but the structure carries over. That's a compounding advantage you don't get from manual prompting, no matter how good you are at it.

Why Fresh Sessions Matter So Much

When you keep working in a single Claude Code session, every previous exchange is sitting in the context window competing for attention. By step 10, the model is sifting through the residue of steps 1 through 9. Old tool call results, previous outputs, intermediate reasoning — it all accumulates. The model doesn't forget it, but it starts paying less attention to it, and more importantly, it has less room for the thing you actually want it to focus on right now. When you start a fresh session for each task, that task gets the full context window. This is a game changer when it comes to the quality of the output the model produces in each step.

The Filesystem Is Your Agent's Long-Term Memory

It's worth spending a minute on the filesystem thing because people underestimate how much mileage you get from just writing JSON files to disk.

In this kind of pipeline, every phase writes its outputs into its own directory. Phase 1 extraction creates 180 JSON files. Phase 2 clustering writes a plan. Phase 3 merging writes unified cluster files. Phase 4 writing creates 21 markdown sections. It's all sitting right there on disk.

This buys you so much:

You get crash recovery. Pipeline dies at document 147? Restart it and the progress tracker knows to skip documents 1-146 and pick up at 147. This is invaluable when you're processing hundreds of items and each one takes 30-60 seconds. Nobody wants to restart from scratch because of a network hiccup.

You get inspect ability. When the merged output for cluster 17 looks weird, you open the intermediate JSON and see exactly what went in. You don't have to replay a conversation or dig through logs. The data is right there.

You get selective re-runs. Five documents failed extraction out of 180? Run --only-failed instead of reprocessing everything.

You get phase independence. Unhappy with how the sections read? Tweak the writing prompts and re-run just Phase 4. Phases 0-3 are untouched.

The Manus team (one of the more battle-tested production agent systems out there) put it in a way that sticks: treat the filesystem as structured external memory. The context window is working memory. The filesystem is long-term memory. Don't confuse the two.

Token-Aware Batching

Here's something that bites people often. You have 180 documents of wildly varying sizes — some are 2,000 tokens, some are 8,000. You need to merge related ones in batches. The naive approach is "put 6 in each batch." But 6 large documents blow past the context limit, and 6 small ones waste most of it.

The fix is simple but most people skip it: estimate tokens before batching. A small utility that does rough token estimation (~4 characters per token, adjusted for whitespace), then bin-packs documents into batches with a hard token budget.

The estimation doesn't need to be exact. What matters is that you have a budget and you respect it.

Of course, let's use Claude to build the orchestrator

You can use Claude Code itself to help design the decomposition. Use it interactively to help figure out the phase breakdown, design the JSON schemas, write the token estimation code, build the progress tracker. You can also use claude code to build the python orchestrator code once you have full clarity on the design of the pipeline. Claude helps you build the pipeline code which in turn invokes claude code to execute the various phases in the pipeline. It is a strange world we inhabit where you use an AI system to build the harness that will then be used to control that same AI system to do the actual work

This Will Get Easier

To close with some perspective: everything described here is fundamentally a set of workarounds for the current limitations of these coding agents. Context windows are finite. Attention degrades over long inputs. Token budgets constrain how much data you can process at once.

The pace of improvement is wild though. In a year or two, a lot of this orchestration complexity will likely be handled natively by the platforms.

But right now, in early 2026, we're in the phase where the engineers who can do this decomposition well — who can think through how to break a complex task into LLM-sized pieces, manage context intelligently, and orchestrate it all with reliable code — are the ones building things that actually ship. It's all about understanding the system you're working with, respecting its constraints, and engineering around them.

It's the oldest trick in computer science - Divide and conquer. The twist is that now we're applying it to manage the cognitive limitations of an AI system instead of the computational limitations of a machine.


Very nice read Anil Bapat. Pretty educative. I always look forward to your insights. Keep writing

To view or add a comment, sign in

More articles by Anil Bapat

Others also viewed

Explore content categories