Token Composition of Agentic Workloads
Agentic AI workloads have a fundamentally different token profile than chatbot interactions. Agents run iterative loops — plan, retrieve, call tools, incorporate results, repeat — replaying accumulated context on every iteration. The token composition that results is dominated by cached prior context, not new compute.
Software Development Workflow
Callan Fox at WEKA analyzed ~20 billion tokens from real Claude Code sessions (presented at the NVIDIA GTC 2026 Dynamo session). The analysis assumes unlimited cache.
Each API request decomposes into:
Key numbers from the dataset:
The NVIDIA Dynamo team provided some data too : 11.7× read/write ratio in Claude Code — the system reads from KV cache 12 times per write. Agent swarms achieved 97.2% aggregate cache hit rate.
Manus AI reported a 100:1 input-to-output token ratio and called KV cache hit rate "the single most important metric for a production-stage AI agent."
Customer Support, Content Generation Workflow
The structural pattern is the same: iterative agent loops that replay growing context.
These two workloads also converge on the same conclusion: cached prior context dominates, new compute per turn is marginal, and the working set must persist across minutes to hours.
With content-aware scheduling (by user, repos, company etc.) , we can increase the cache hit rates further.
So how much does this actually matter at the system level? Let's model it.
Modeling Prefill vs Decode
To make this concrete, a rough calculation is below:
KV-cache reuse frees 32 GPUs from prefill. Those GPUs move to decode duty → 2x more throughput on identical hardware.
KV-Cache Reuse reduces prefill compute and shifts GPUs into decode, thus increasing the AI factory throughput in tokens/s
Note that Amdahl's law kicks in. If we begin with 50% prefill and 50% decode, then allocating 100% of GPUs to decode limits our max speedup to 2x. Same rack, same silicon — different memory architecture -if we begin with 50:50 prefill decide, we can increase the AI factory throughput up to 2X with KV-Caching techniques - bounded by Amdahl's law.
Not Everything Belongs in the Context Window
The numbers above assume the full conversation history is replayed into the context window. In practice, systems are starting to adapt.
The Claude Code source leak (March 2026) revealed a three-layer memory architecture designed to minimize context window usage:
This is a bandwidth-aware tiering design.
DeepSeek's Engram architecture takes a complementary approach at the model level. Engram separates static pattern recall (O(1) hash-based lookup into parametric memory tables) from dynamic reasoning (MoE expert computation).
Both approaches reduce the active KV cache working set in HBM — but they don't eliminate the data, they relocate it. Topic files, Engram tables, and searchable transcripts all live in the warm tier now. The memory hierarchy problem isn't solved; it's reshaped, with more pressure pushed down one level into the warm tier. The access patterns shift from "reload everything into context" to "fetch on demand and search in place" — which makes warm-tier bandwidth, capacity, and tiering intelligence even more important.
What Drives tokens/$
In the agentic era, primary drivers of factory througput are then are KV-Cache tiering, HBM bandwidth and intelligent schedulers.
But the model architecture is also shifting beneath this.
While these trends shrink HBM pressure — they often redistribute the working set into lower memory tiers, increasing demand for warm-tier and cold-tier bandwidth and capacity.
The tip of the iceberg stays visible — and stays important. HBM bandwidth and decode throughput remain primary levers. But the bulk of new research and architecture is now reaching below the waterline, into the warm tier, and the cold tier. That's where the next round of gains will come from. The interesting work has moved underwater. 🧊
Reference Links
Love this blog. I had been looking for Weka (Callan Fox) analysis as I briefly saw this at GTC. This is similar to what we surmised (guessed mostly), but - of course- this is real, so my guesswork based on probabilistic distributions wasn't so accurate - had more burstiness in decode, for instance.
Great stuff. I'm seeing the same thing with the emphasis starting to reduce a bit on HBM tier on-die. Instead, I'm seeing network-shared memory within scale up domain become extremely important across capacity, bandwidth, and latency vectors. In addition, at the top of the pyramid, SRAM-based dataflow has quickly become critical with the Groq acquihire, and there are also elephant flow loads of cold KV cache towards the bottom of the pyramid that are starting to become challenging. Scale (without blowing up latency), flexibility, and heterogeneity seem to be consistent principles across these fast evolutions of AI workloads and their demands on data center infrastructure.