The Context Window Problem

Evan Wee, CSM

Published Mar 16, 2026

What the Machine remembers..

Every AI system working on a long task faces the same wall eventually. The context fills. The tokens run out. The machine forgets.

It is not a bug. It is architecture. The transformer attention mechanism at the heart of every major language model has a finite working memory: a context window measured in tokens. Feed it more than it can hold and something has to give. Three philosophies have emerged for answering that problem. The way each one works reveals something deeper than an engineering preference. It reveals a set of assumptions about what it means to remember, and what it means to think.

I want to explore each through an analogous film. Not because the metaphors are decorative, but because the questions these films ask are the same questions the engineers were asking. How do you build something capable of sustained, coherent work when memory is broken, unlimited, or distributed?

And does each approach get its happy ending?

The Morning Video

"The heart has its reasons which reason knows nothing of." -- Blaise Pascal

In 50 First Dates, Lucy wakes up every morning with no memory of the previous day. A brain injury has locked her in a permanent reset. Henry, who has fallen in love with her, faces a specific human problem: how do you build a relationship with someone who cannot hold a session?

His solution is a video. Every morning, Lucy watches a recording that covers the essential facts of her life: who she is, what has happened, who Henry is, and why she loves him. It is not the full story. It cannot be. But it is enough to continue.

This is compaction.

Anthropic's implementation triggers when a conversation approaches the token limit. The API runs an additional sampling step: the model reads everything it has processed and generates a compaction block, a dense summary replacing the full history. On the next request, the model wakes up from the summary, not the raw record.

The implementation is clean. The compaction block can be passed back into subsequent requests, prompt-cached to reduce cost, and customised with your own summarisation instructions if the default is too broad. For long-running agentic workflows, the kind where a model iterates through hundreds of tool calls over hours, compaction remains a core strategy. It extends effective context without requiring you to restructure your application.

The tradeoff is the same one Henry faces. The summary is not the history. Nuance compresses away. A detail that seemed minor at turn 40 may have been load-bearing at turn 80. The model cannot tell you what it forgot, because the forgetting was the point.

The relationship survives. But you should understand what the morning video leaves out.

A densely handwritten journal open to a single page of sparse, condensed bullet points - representing the compression of memory into essential facts. — Compaction: Summarising into bullet points?

The Implanted Memory

"To be is to be perceived." -- George Berkeley

Blade Runner 2049 is a film about a replicant named K who discovers a childhood memory that turns out not to belong to him. It was implanted. But it was real enough to shape his identity, his choices, his willingness to sacrifice everything. The film asks: if you hold enough memory with sufficient fidelity, does it matter whether it was yours to begin with?

The answer to the context problem here is capacity. Raise the ceiling until most workloads never touch it.

Google pioneered this approach. Gemini ships with a one-million-token context window as standard. The Pro tier extends to two million. And as of March 2026, Anthropic has joined the same bet: Claude's Opus 4.6 and Sonnet 4.6 now offer a one-million-token context window at standard pricing. No long-context premium. Generally available. The same company that built the morning video decided it also wanted the implanted memory.

That convergence is worth pausing on. The engineers who understood compression better than anyone still decided raw capacity was worth having.

To make those numbers concrete: one million tokens can hold fifty thousand lines of code, eight average-length novels, or five years of your text messages. Research benchmarks report retrieval above ninety-nine percent on information access tasks across that range. The claim is that the ceiling is high enough that most workloads never need to engineer around it.

For certain problems, this is genuinely the right answer. Single-shot analysis of a large codebase, question-answering over a document corpus that cannot be easily segmented, in-context learning with hundreds of examples: these are cases where holding everything in view simultaneously is the value, not a workaround. Retrieval-augmented generation looks elegant until you realise you have spent six months building a chunking and retrieval pipeline for something a large context window handles natively.

The cost question has shifted since this article was first conceived a week ago: A million tokens of input used to be expensive, and the pricing compounded quickly in production. Google's mitigation was context caching: cached content charged at a 75-90% discount when reused. With the news of a 1 million token context window, Anthropic has also changed the calculus entirely. Standard pricing across the full window, no multiplier. A 900K-token request billed at the same per-token rate as a 9K one. The economic objection to long context is weaker than it was, and it all happened in a week.

But Blade Runner's deeper question still holds. K's implanted memory was internally consistent, richly detailed, and completely misleading. A large context guarantees capacity. It does not guarantee that the model attends to the right things, weights the right signals, or draws the right conclusions from what it holds.

Capacity and comprehension are not the same thing.

Recommended by LinkedIn

An AI model isn't a program. It's a landscape. And…

Mayukh Mukherjee, CFA 1 week ago

The Five Decays of AI Systems

Aleksei Moiseev 2 months ago

Data is the lifeblood of I.T., AI isn't going away.

Scott C. 7 months ago

The memory is real. It is cheaper than it was. Whether the model is wiser for holding it is a different question entirely.

A vast digital archive visualisation, rows of structured data stretching to the horizon. — Could a knowledge store ever reach infinity?

The Clone

"None of us is as smart as all of us." -- Ken Blanchard

In Multiplicity, Doug Kinney solves his work-life problem by cloning himself. One clone handles the job. One handles the family. One handles the house. Each specialises, each operates in parallel, and for a while the system works better than the original. Then the clones start diverging from each other. And when one clone makes a copy of itself, the result is degraded: a simpler, noisier version of the original, capable of less.

Cursor's multi-agent architecture is Multiplicity, built correctly.

Their engineering team ran experiments at scale: hundreds of concurrent agents working on a single project, producing over a million lines of code, including a working web browser implementation spanning a thousand files and a Solid-to-React migration that ran for three weeks and logged two hundred and sixty-six thousand additions. The context window problem is solved not by extending or compressing any single agent's memory, but by ensuring no single agent ever needs more memory than a standard context window provides.

The key architectural insight was that flat coordination fails. In early experiments, agents sharing a codebase through distributed file locking hit bottlenecks and became brittle. Optimistic concurrency made agents risk-averse: they avoided hard tasks rather than risk collision with a peer. The solution was hierarchy. Planners explore the codebase and create tasks, spawning sub-planners recursively for complex domains. Workers focus exclusively on completing assigned tasks, with no inter-worker coordination whatsoever. Judges evaluate whether progress is real and whether to continue iterating.

Planner thinks. Worker executes. Judge validates. Nobody tries to do all three.

The failure modes from the film are the failure modes of the system. Planners that over-scope produce tasks too large for workers to complete within a bounded context. Workers that drift introduce inconsistency that propagates before a judge catches it. And Multiplicity's sharpest warning, the degraded clone-of-a-clone, is a genuine risk in recursive sub-planner architectures where the original intent dilutes across generations of task decomposition.

The clones work. Keep the hierarchy clean. Do not let them make copies of themselves ad infinitum.

An overhead diagram or illustration showing a hierarchical tree structure: one root node branching into planners, each branching into workers, with a separate judge layer. Clean, minimal, technical. — This starts to look like a really expensive token call..

Why the Divergence

"Two roads diverged in a wood, and I took the one less traveled by, and that has made all the difference." -- Robert Frost

Lucy gets her happy ending in 50 First Dates. K in Blade Runner 2049 finds something more ambiguous: purpose not in being "special", but in doing something genuinely human. Doug in Multiplicity gets his life back, roughly. All three reach a form of resolution. But the resolutions are not equal, and the paths are not interchangeable.

Choosing the wrong one for your problem does not mean you fail. It means you spend a long time in the wrong film.

Here is what has changed since these philosophies first diverged: they are no longer mutually exclusive. Anthropic built compaction, proved it worked, and then shipped a one-million-token context window at standard pricing anyway. They did not abandon the morning video. They added the implanted memory alongside it. That decision, by the team that understood the tradeoffs of both approaches better than most, tells you something important: this is not a single-answer problem. It never was.

The selection criteria are architectural, and the framing has shifted from which company to which problem:

Use compaction when the workload is a long-running sequential conversation or agentic loop where session continuity matters more than perfect fidelity to early context. It is low-friction to implement, handles context management at the API level, and requires no structural changes to your application. Know what the morning video will summarise away, and write custom compaction instructions to protect the details that matter most.
Use long context when the workload is single-shot analysis of large, stable corpora: existing codebases, document collections, reference datasets that do not change between queries. Retrieval quality at this range is excellent and integration overhead is low. The economics are more defensible than they were even a month ago, particularly when your provider charges no long-context premium.
Use multi-agent coordination when the workload is parallelisable and decomposable: structured as independent tasks with clear interfaces and unambiguous success criteria. The architecture overhead is real. You need a task management layer, a planner hierarchy, worker scope discipline, and judge agents that can distinguish progress from drift. It is not the right choice for a conversational assistant. It is the right choice for large-scale autonomous execution where the work can be divided before it begins.

None of these is the correct answer. Each is correct for the problem it was designed to solve. And sometimes the correct answer is two of them at once.

What the Machine Remembers

The context window problem is not going away. The transformer architecture has a ceiling. That ceiling just got significantly higher. The question persists: is it high enough that most systems never hit it, or is the ceiling always worth engineering around?

What these three approaches share is not a mechanism. It is a goal. Each one is trying to build something that keeps working when the naive approach runs out of room. Each makes a different bet about what matters most: the continuity of a single long session, the richness of a large held corpus, or the coordination of distributed effort scoped to fit.

The most interesting development is that the bets are converging. The company that built compaction now ships long context. The company that pioneered long context offers caching to make it economical. The company that built multi-agent coordination gives each agent the largest context window it can get. The philosophies were never as separate as the marketing suggested. The engineering is meeting in the middle.

Lucy's morning video is a system for preserving love across a broken memory. K's implanted memory shapes an identity: replicant, human or not. Doug's clones carry pieces of a life that became too large for one person to hold.

The machine faces the same problem. What it remembers, and how, determines what it becomes.

Memory is not neutral. Neither is architecture. And neither, it turns out, is choosing just one.

Paweł Józefiak 1mo

Has anyone actually measured what percentage of context gets used productively vs. carrying overhead? I track token spend across a persistent agent and the ratio surprises me: instruction files, memory dumps, and error context regularly eat 40% before any real work starts. Compaction and extended context both help, but the underlying problem is how much scaffolding the agent needs to remember who it is.

1 Reaction

Chris Pollock 1mo

As always, you've surfaced the nuance and made it accessible with some thoughtful analogies.

1 Reaction

Evan Wee, CSM 1mo

Article at https://www.garudax.id/pulse/context-window-problem-evan-wee-csm-nqj9c

See more comments

To view or add a comment, sign in

The Context Window Problem

Evan Wee, CSM

The Morning Video

The Implanted Memory

Recommended by LinkedIn

The Clone

Why the Divergence

What the Machine Remembers

More articles by Evan Wee, CSM

Others also viewed

AI Pulse: The Midas Algorithm. Is Greed Engineering Our Obsolescence?

When models disagree: Which one should you trust with your capital plan?

When the Same Agent Takes Ten Different Roads

Action Dial: When AI stops suggesting and starts doing

The Interface Hasn't Caught Up

Beyond the Brain: The Hidden Half of AI Agent

When Smarter AI Isn't Worth It

Home build AI FACTORY

What Is a SuperPill in Our Human-AI Systems?

Why Evaluation Is Becoming the Center of Gravity in the AI Industry

Explore content categories

The Morning Video

The Implanted Memory

Recommended by LinkedIn

The Clone

Why the Divergence

What the Machine Remembers

More articles by Evan Wee, CSM

Spy vs. Spy: When AI-Accelerated Teams Outrun Their Own Comprehension

The Sorcerer's Apprentice: When AI-Generated Code Floods the Workshop

I Used All My Tokens

The Scorpion in Your Prompt

Goldilocks in the Labyrinth: How I Route Work Across Replit, Cursor, and Claude Code

Application Ascension - What it really means to move to the cloud

Modernizing a Monolith application

Using Intel's tick-tock strategy in software development

Others also viewed

AI Pulse: The Midas Algorithm. Is Greed Engineering Our Obsolescence?

When models disagree: Which one should you trust with your capital plan?

When the Same Agent Takes Ten Different Roads

Action Dial: When AI stops suggesting and starts doing

The Interface Hasn't Caught Up

Beyond the Brain: The Hidden Half of AI Agent

When Smarter AI Isn't Worth It

Home build AI FACTORY

What Is a SuperPill in Our Human-AI Systems?

Why Evaluation Is Becoming the Center of Gravity in the AI Industry

Similar topics

How Large Language Models Solve Problems Without Introspection

Explore content categories