Injecting Maximum Context Without Exploding the Context Window

Joseph Gillespie

Published Dec 5, 2025

Week 7: Injecting Maximum Context Without Exploding the Context Window

(The difference between 60 % and 95 %+ autonomous success)

If prompt engineering is the highest-leverage activity, then context management is the silent killer of multi-agent systems. I’ve watched entire agentic workflows collapse because one agent got a 120k-token context dump and started hallucinating, looping, or simply timing out. Conversely, the biggest leaps we’ve ever seen in Vellox Reverser and Detect came from solving the question: How do we give every agent exactly the context it needs - no more, no less - at the exact moment it needs it?

Here are the battle-tested principles we live by at Booz Allen when building production-grade, multi-agent systems:

Context Is Not a Dump, It’s a Precision Strike Never send the entire session history. Instead, every hand-off between agents contains a surgically crafted Context Payload with four sections only: Goal Reminder (1-2 sentences) Relevant Prior Findings (structured JSON from previous agents) Fresh Tool Results (last 2-3 calls max) Action Request (what this specific agent must do right now) Average size in Vellox Detect: 4-9k tokens. Never more than 15k, even on 128k models.
Dynamic Summarization Engine (Mandatory) We can run a tiny dedicated “Summarizer-Agent” (think Claude-3-Haiku or Llama-3.1-8B) that watches the shared memory store. Every time an agent finishes, the Summarizer distills its output into a canonical 200-400 token summary that replaces the full raw output in long-term memory. Result: After 40+ turns in a complex malware investigation, total context stays under 18k tokens instead of ballooning to 120k+.
Hybrid RAG + Episodic Memory Architecture Vector Store (Chroma / Pinecone): Stores every artifact ever produced (disassembly snippets, YARA rules, IOCs, screenshots). Short-Term Working Memory: Only the last 3-5 episodic summaries + current goal. Retrieval Strategy: At the start of every agent turn we retrieve the top-7 most relevant chunks (using metadata tags like file_id, malware_family, agent_name) and inject them after the structured payload, not before. This keeps recall high while total tokens stay low.
Chunk-and-Tag Discipline Every piece of knowledge we store gets mandatory metadata tags:

Recommended by LinkedIn

New OWASP Guide Raises the Bar for Agentic AI Security

Ola Ekdahl 9 months ago

🤖 AI vs Hackers: Defense You Can’t Ignore!

LambdaTest is now TestMu AI 7 months ago

Design Patterns for Securing LLM Agents against Prompt…

Luca Sambucci 10 months ago

{
  "source_agent": "Decompiler-Agent",
  "malware_sample": "e41d3b9f...",
  "confidence": 0.98,
  "timestamp": "2025-11-28T14:22:01Z",
  "type": "unpacker_findings | yara_rule | ioc_list"
}

This lets us retrieve with surgical precision instead of “give me everything about this sample.”

Pre-Retrieval Filtering via Judge Before hitting the vector DB, we run a 0.2-second “Relevance-Judge” prompt: “Given the current goal […], which of these 7 candidate chunks are actually required right now? Return only indices.” Frequently cuts retrieval from 7 → 2-3 chunks.
Sliding-Window + Summary Refresh Loop When we do approach the context limit (e.g., 100k+ on Grok-4 or Claude-3.5), we trigger a full-context refresh: Summarizer-Agent rewrites the entire working memory into a 3 000-token “Executive Summary v2”. Old episodic details are archived to vector store only. New turn starts clean. We’ve run 120+ step investigations this way without ever hitting the wall.
Model-Specific Context Budgeting We assign models by context tolerance: Claude-3.5-Sonnet or Grok-4: up to 100k for the final Judge/Reporting agent Llama-3.1-70B or Mixtral-8x22B: 32k max for mid-workflow agents Haiku or Gemma-2-9B: <8k for summarizers and routers The prompt literally includes: “You are running in a 32k context budget. Be concise.”

To view or add a comment, sign in

Injecting Maximum Context Without Exploding the Context Window

Joseph Gillespie

Recommended by LinkedIn

More articles by Joseph Gillespie

Others also viewed

Multi-Agent-System for Penetration Testing

Hacker Jailbreaks Claude AI to Write Exploit Code and Steal Government Data

Emerging from the Mythos

When Technology Needs Human Beings More Than Ever

Securing Agents against Prompt Injection

Berkeley's StruQ: Channel Separation Doesn't Break Our Toys

Anthropic Just Proved Traditional Security is Dead (Here’s What Comes Next)

27-Year-Old Bugs vs. 1 Million Tokens: The New Era of AI Offence

The Codex Incident Strengthens the Case for Behavior Analytics

The Agentic Frontier, Part 1: Architectural Responses to Intrinsic LLM Vulnerabilities

Explore content categories

Recommended by LinkedIn

More articles by Joseph Gillespie

Memory Is a Design Choice, Not a Side Effect

Vellox Reverser: Agentic Malware Reverse Engineering for the Post-Mythos Era

Prompt Engineering for Complex Multi-Agent Workflows

Others also viewed

Multi-Agent-System for Penetration Testing

Hacker Jailbreaks Claude AI to Write Exploit Code and Steal Government Data

Emerging from the Mythos

When Technology Needs Human Beings More Than Ever

Securing Agents against Prompt Injection

Berkeley's StruQ: Channel Separation Doesn't Break Our Toys

Anthropic Just Proved Traditional Security is Dead (Here’s What Comes Next)

27-Year-Old Bugs vs. 1 Million Tokens: The New Era of AI Offence

The Codex Incident Strengthens the Case for Behavior Analytics

The Agentic Frontier, Part 1: Architectural Responses to Intrinsic LLM Vulnerabilities

Explore content categories