Injecting Maximum Context Without Exploding the Context Window

Injecting Maximum Context Without Exploding the Context Window

Week 7: Injecting Maximum Context Without Exploding the Context Window

(The difference between 60 % and 95 %+ autonomous success)

If prompt engineering is the highest-leverage activity, then context management is the silent killer of multi-agent systems. I’ve watched entire agentic workflows collapse because one agent got a 120k-token context dump and started hallucinating, looping, or simply timing out. Conversely, the biggest leaps we’ve ever seen in Vellox Reverser and Detect came from solving the question: How do we give every agent exactly the context it needs - no more, no less - at the exact moment it needs it?

Here are the battle-tested principles we live by at Booz Allen when building production-grade, multi-agent systems:

  • Context Is Not a Dump, It’s a Precision Strike Never send the entire session history. Instead, every hand-off between agents contains a surgically crafted Context Payload with four sections only: Goal Reminder (1-2 sentences) Relevant Prior Findings (structured JSON from previous agents) Fresh Tool Results (last 2-3 calls max) Action Request (what this specific agent must do right now) Average size in Vellox Detect: 4-9k tokens. Never more than 15k, even on 128k models.
  • Dynamic Summarization Engine (Mandatory) We can run a tiny dedicated “Summarizer-Agent” (think Claude-3-Haiku or Llama-3.1-8B) that watches the shared memory store. Every time an agent finishes, the Summarizer distills its output into a canonical 200-400 token summary that replaces the full raw output in long-term memory. Result: After 40+ turns in a complex malware investigation, total context stays under 18k tokens instead of ballooning to 120k+.
  • Hybrid RAG + Episodic Memory Architecture Vector Store (Chroma / Pinecone): Stores every artifact ever produced (disassembly snippets, YARA rules, IOCs, screenshots). Short-Term Working Memory: Only the last 3-5 episodic summaries + current goal. Retrieval Strategy: At the start of every agent turn we retrieve the top-7 most relevant chunks (using metadata tags like file_id, malware_family, agent_name) and inject them after the structured payload, not before. This keeps recall high while total tokens stay low.
  • Chunk-and-Tag Discipline Every piece of knowledge we store gets mandatory metadata tags:

{
  "source_agent": "Decompiler-Agent",
  "malware_sample": "e41d3b9f...",
  "confidence": 0.98,
  "timestamp": "2025-11-28T14:22:01Z",
  "type": "unpacker_findings | yara_rule | ioc_list"
}        

This lets us retrieve with surgical precision instead of “give me everything about this sample.”

  • Pre-Retrieval Filtering via Judge Before hitting the vector DB, we run a 0.2-second “Relevance-Judge” prompt: “Given the current goal […], which of these 7 candidate chunks are actually required right now? Return only indices.” Frequently cuts retrieval from 7 → 2-3 chunks.
  • Sliding-Window + Summary Refresh Loop When we do approach the context limit (e.g., 100k+ on Grok-4 or Claude-3.5), we trigger a full-context refresh: Summarizer-Agent rewrites the entire working memory into a 3 000-token “Executive Summary v2”. Old episodic details are archived to vector store only. New turn starts clean. We’ve run 120+ step investigations this way without ever hitting the wall.
  • Model-Specific Context Budgeting We assign models by context tolerance: Claude-3.5-Sonnet or Grok-4: up to 100k for the final Judge/Reporting agent Llama-3.1-70B or Mixtral-8x22B: 32k max for mid-workflow agents Haiku or Gemma-2-9B: <8k for summarizers and routers The prompt literally includes: “You are running in a 32k context budget. Be concise.”

 

To view or add a comment, sign in

More articles by Joseph Gillespie

Others also viewed

Explore content categories