Token Composition of Agentic Workloads

Token Composition of Agentic Workloads

Agentic AI workloads have a fundamentally different token profile than chatbot interactions. Agents run iterative loops — plan, retrieve, call tools, incorporate results, repeat — replaying accumulated context on every iteration. The token composition that results is dominated by cached prior context, not new compute.

Software Development Workflow

Callan Fox at WEKA analyzed ~20 billion tokens from real Claude Code sessions (presented at the NVIDIA GTC 2026 Dynamo session). The analysis assumes unlimited cache.

Article content
Callan Fox, WEKA Token Composition - Each stacked bar is a API Call request

Each API request decomposes into:

  • Decode output (blue): Barely visible - the code generation, or tool call etc. ~1% of tokens.
  • New prefill (salmon): A user prompt, a tool result. ~5% of tokens.
  • Cached prior context (grey): The conversation history replayed each turn. ~94% of tokens.

Key numbers from the dataset:

  • 94% average cache hit rate across ~20B tokens (with unlimited cache). We can acheive these numbers with data-affinity-aware scheduling into the racks.
  • Median inter-request time: 7 seconds (tool loops); mean: 4+ minutes (human pauses). With long running agents in the horizon, human pauses become less relevant in future
  • Optimal cache Time-To-Live (TTL): ~1 hour for coding, requiring ~1M tokens cached per session. With long running agents, we should expect TTL to quickly go from hours to days.

The NVIDIA Dynamo team provided some data too : 11.7× read/write ratio in Claude Code — the system reads from KV cache 12 times per write. Agent swarms achieved 97.2% aggregate cache hit rate.

Manus AI reported a 100:1 input-to-output token ratio and called KV cache hit rate "the single most important metric for a production-stage AI agent."

Customer Support, Content Generation Workflow

The structural pattern is the same: iterative agent loops that replay growing context.

  • Customer support agents iterate through: receive ticket → search KB → check entitlement → draft reply → file ticket. Each step replays system prompt (~2K tokens), tool schemas (~5K tokens), RAG policy context (~4K tokens), and growing conversation history, and relevant documents. Cross-session sharing is higher than coding — hundreds of agents share identical prefixes.
  • Content generation follows a similar pattern (research → draft → review → revise) where each revision replays all prior context. The cycle replays the original research context 3-4× and pulls in a large number of external documents, read into multiple context windows. Fox's data on non-coding Claude use cases showed optimal TTL of 8-24 hours — longer than coding, suggesting larger working sets.

These two workloads also converge on the same conclusion: cached prior context dominates, new compute per turn is marginal, and the working set must persist across minutes to hours.

With content-aware scheduling (by user, repos, company etc.) , we can increase the cache hit rates further.

So how much does this actually matter at the system level? Let's model it.


Modeling Prefill vs Decode

To make this concrete, a rough calculation is below:

  • Llama 3.1 405B (FP8, TP=8) on VR200 NVL72 (72 Rubin GPUs, 9 model instances).
  • Agentic coding session — 60 minutes, ~80 API calls, 64K average context, weighted average ~370 output tokens/call.
  • KV cache evicted between calls, reloaded from DRAM warm tier.
  • Cache hit rate: 94% (Fox/WEKA measured, max possible reuse).

KV-cache reuse frees 32 GPUs from prefill. Those GPUs move to decode duty → 2x more throughput on identical hardware.


Article content
KV-Cache reuse reduces prefill compute, shifting GPUs to decode, thus increasing AI Factory throughput

KV-Cache Reuse reduces prefill compute and shifts GPUs into decode, thus increasing the AI factory throughput in tokens/s

Note that Amdahl's law kicks in. If we begin with 50% prefill and 50% decode, then allocating 100% of GPUs to decode limits our max speedup to 2x. Same rack, same silicon — different memory architecture -if we begin with 50:50 prefill decide, we can increase the AI factory throughput up to 2X with KV-Caching techniques - bounded by Amdahl's law.

Not Everything Belongs in the Context Window

The numbers above assume the full conversation history is replayed into the context window. In practice, systems are starting to adapt.

The Claude Code source leak (March 2026) revealed a three-layer memory architecture designed to minimize context window usage:

  • Layer 1 — Index: A lightweight pointer file (~150 chars per entry), always loaded. Stores locations, not data. Tiny token cost.
  • Layer 2 — Topic files: project knowledge, fetched on demand only when the index suggests relevance at runtime. Only a few relevant files are loaded per query
  • Layer 3 — Transcripts: Raw session logs. Never read into context. Only grepped/searched for specific identifiers.

This is a bandwidth-aware tiering design.

  1. Layer 1 stays hot (always in context).
  2. Layer 2 is warm (loaded when needed, then evicted).
  3. Layer 3 is cold but searchable — that search operation — grep, regex, keyword lookup — is itself a form of near-memory compute: processing happens where the data lives, and only the search results promote to context.

DeepSeek's Engram architecture takes a complementary approach at the model level. Engram separates static pattern recall (O(1) hash-based lookup into parametric memory tables) from dynamic reasoning (MoE expert computation).

  • The memory tables are fixed-size, read-only during inference, and can be offloaded to host DRAM with low overhead even at 100B parameters.
  • This architectural decoupling means static knowledge doesn't need to consume context window capacity at all.

Article content
DeepSeek Engram and Claude Bandwidth-Aware Tiering Design

Both approaches reduce the active KV cache working set in HBM — but they don't eliminate the data, they relocate it. Topic files, Engram tables, and searchable transcripts all live in the warm tier now. The memory hierarchy problem isn't solved; it's reshaped, with more pressure pushed down one level into the warm tier. The access patterns shift from "reload everything into context" to "fetch on demand and search in place" — which makes warm-tier bandwidth, capacity, and tiering intelligence even more important.


What Drives tokens/$

  • Intelligent memory/storage tiering is critical to driving KV-Cache reuse.
  • With KV-cache reuse, ~90% of GPU time is HBM-memory-bandwidth-constrained.
  • Tiering, data-affinity and workload-aware schedulers also play a major role.

In the agentic era, primary drivers of factory througput are then are KV-Cache tiering, HBM bandwidth and intelligent schedulers.

But the model architecture is also shifting beneath this.

  • Techniques like multi-latent attention (DeepSeek MLA) and Google TurboQuant trade compute for data — reducing HBM bandwidth demand for the decode step.
  • Claude Code's 3-layer memory and Engram reduce the active context window the model needs to load.

While these trends shrink HBM pressure — they often redistribute the working set into lower memory tiers, increasing demand for warm-tier and cold-tier bandwidth and capacity.

The tip of the iceberg stays visible — and stays important. HBM bandwidth and decode throughput remain primary levers. But the bulk of new research and architecture is now reaching below the waterline, into the warm tier, and the cold tier. That's where the next round of gains will come from. The interesting work has moved underwater. 🧊

Reference Links

  1. Callan Fox, "Importance of Context Platform Engineering" — LinkedIn (~20B token dataset, GTC 2026 Dynamo session)
  2. NVIDIA Dynamo, "Full-Stack Optimizations for Agentic Inference" — Dynamo Docs (11.7× R/W ratio, Claude Code patterns)
  3. Manus AI, "Context Engineering for AI Agents" — manus.im (100:1 I/O ratio)
  4. Google Research, "TurboQuant: Redefining AI Efficiency" — Google Research Blog (ICLR 2026, 6× KV compression)
  5. DeepSeek, "Engram: Conditional Memory via Scalable Lookup" — GitHub / arXiv
  6. CachedAttention, "Cost-Efficient LLM Serving for Multi-Turn Conversations" — arXiv (TTFT -87%, cost -70%)
  7. Llama 3.1 405B architecture — Hugging Face Blog (126 layers, 8 KV heads, 128 head dim)
  8. OpenRouter, "State of AI 2025: 100T Token Usage Study" — openrouter.ai (sequence length trends, agentic shift)





Love this blog. I had been looking for Weka (Callan Fox) analysis as I briefly saw this at GTC. This is similar to what we surmised (guessed mostly), but - of course- this is real, so my guesswork based on probabilistic distributions wasn't so accurate - had more burstiness in decode, for instance.

Great stuff. I'm seeing the same thing with the emphasis starting to reduce a bit on HBM tier on-die. Instead, I'm seeing network-shared memory within scale up domain become extremely important across capacity, bandwidth, and latency vectors. In addition, at the top of the pyramid, SRAM-based dataflow has quickly become critical with the Groq acquihire, and there are also elephant flow loads of cold KV cache towards the bottom of the pyramid that are starting to become challenging. Scale (without blowing up latency), flexibility, and heterogeneity seem to be consistent principles across these fast evolutions of AI workloads and their demands on data center infrastructure.

To view or add a comment, sign in

More articles by Rakesh Cheerla

  • Scale-Up Network 2.0 for the Agentic Era

    Through 2024, we optimized scale-up networks for one traffic pattern: large, synchronized collectives (AllReduce…

    14 Comments
  • Why Inference Must Be at the Core of Platform Security

    Why Inference Must Be at the Core of Platform Security Security is Prediction At its most fundamental level, security…

    4 Comments
  • Memory/Storage Tiering for AI Inference

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

    6 Comments
  • Surfacing Our Inner Thoughts

    Our words shape our reality. Language is more than a tool for communication—it is the forge where ideas are tested…

    5 Comments
  • Emerging Patterns in AI Workflows, and Their Impact on Scale-Out Networking

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

    2 Comments
  • The Shift towards Fabric-Resident Stateful Services

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

  • AI Scale-up Networking Trends

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

    26 Comments
  • Simplifying the dataflow with a switch fabric!

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

    3 Comments
  • The Balancing Act: Compute, Memory & Network

    The performance of applications is heavily influenced by compute cores, memory, and network interfaces. Striking the…

    4 Comments
  • Computing and Software Trends

    Let's start with evidence: Neural Processing Engines: Intel AI PC, Qualcomm NPU and NVIDIA Jetson are examples of PC…

Explore content categories