Optimizing Context Windows in Agentic Loops

Explore top LinkedIn content from expert professionals.

Summary

Optimizing context windows in agentic loops means carefully managing what information an AI agent keeps “in mind” as it works through tasks, making sure it has just the right details at each step without overwhelming its memory or losing important knowledge. This involves structuring, updating, and refining the agent’s context so it can make smarter decisions and avoid mistakes during complex, ongoing processes.

  • Organize and filter: Structure the agent’s context into clear categories like task goals, evidence, operational state, and rules, then filter out irrelevant or outdated information as tasks evolve.
  • Summarize and compress: Regularly condense context details by using summaries and pruning strategies, keeping only what’s crucial for current and future steps while reducing unnecessary clutter.
  • Store and retrieve smartly: Save important facts and results outside the active context, and use memory systems to recall relevant information only when the agent truly needs it during its workflow.
Summarized by AI based on LinkedIn member posts
  • View profile for Kumaran Ponnambalam

    AI / ML Leader & Author

    21,456 followers

    Have you heard of 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗔𝘀𝘀𝗲𝗺𝗯𝗹𝘆 (𝗗𝗖𝗔) in 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴? It is the practice of constructing the model’s input ON DEMAND from multiple context sources—based on the user goal, current task state, tool outputs, risk level, and token budget. Static prompts assume the world is stable. Real systems aren’t, especially in Agentic AI.  1. Users change their mind mid-flight  2. Tools return surprises  3. Policies differ by tenant/workflow  4. Long-horizon tasks need stepwise context, not one giant dump DCA turns context into a living artifact that evolves across turns and phases. Think of your context as a bundle with explicit compartments: 1. 𝗧𝗮𝘀𝗸 𝗙𝗿𝗮𝗺𝗲 ( Goal, scope, constraints, definition of done, what’s in/out, timeline, rules ) 2. 𝗚𝗿𝗼𝘂𝗻𝗱𝗶𝗻𝗴 𝗘𝘃𝗶𝗱𝗲𝗻𝗰𝗲 ( Retrieved docs, structured records, citations, tool outputs, ranked, deduped, freshness-aware ) 3. 𝗢𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝗮𝗹 𝗦𝘁𝗮𝘁𝗲 ( Plan, progress, decisions made, open questions, scratch summaries, checkpoints, rollback points ) 4. 𝗣𝗼𝗹𝗶𝗰𝘆 + 𝗚𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹𝘀 ( Safety rules, compliance requirements, PII handling, Tenant policies, workflow-specific rules ) 5. 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻 𝗟𝗮𝘆𝗲𝗿 ( System/developer instructions, style, format contracts, schemas, validators ) Dynamic context assembly is the orchestrator that decides what to include, in which step, how much and in what order 𝗗𝗖𝗔 𝗕𝗲𝘀𝘁 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀 1. Context is a contract where each department has a purpose & budget 2. Prefer structured snippets over raw text 3. Always include provenance ( source, timestamp, confidence) for grounding content 4. Separate evidence from instructions 5. Checkpoint summaries every N steps, so state does not rot over long horizons 6. Make trimming deterministic. 7. Treat tool outputs as first-class context, but sanitize and normalize them. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗮𝗻𝘁𝗶-𝗽𝗮𝘁𝘁𝗲𝗿𝗻𝘀 1. Shoving more docs in without reason 2. Mixing different compartments into one single blob 3. Not managing agent state 4. No freshness / authority scoring for grounding truth #ContextEngineering #AIAgents #RAG #LLMOps #GenAIOps #AgenticAI #EnterpriseAI

  • View profile for Om Nalinde

    Building & Teaching AI Agents to Devs | CS @IIIT

    158,319 followers

    I used this guide to build 10+ AI Agents Here're my 10 actionable items: 1. Turn your agent into a note-taking machine → Dump plans, decisions, and results into state objects outside the context window → Use scratchpad files or runtime state that persists during sessions → Stop cramming everything into messages - treat state like external storage 2. Be ridiculously picky about what gets into context → Use embeddings to grab only memories that matter for current tasks → Keep simple rules files (like CLAUDE md) that always load → Filter tool descriptions with RAG so agents aren't confused by irrelevant tools 3. Build a memory system that remembers useful stuff → Create semantic, episodic, and procedural memory buckets for facts, experiences, instructions → Use knowledge graphs when embeddings fail for relationship-based retrieval → Avoid ChatGPT's mistake of pulling random location data into unrelated requests 4. Compress like your context window costs $1000 per token → Set auto-summarization at 95% context capacity with no exceptions → Trim old messages with simple heuristics: keep recent, dump middle → Post-process heavy tool outputs immediately - search results don't live forever 5. Split your agent into specialized mini-agents → Give each sub-agent one job and its own isolated context window → Hand off context with quick summaries, not full message histories → Run sub-agents in parallel when possible for isolated exploration 6. Sandbox the heavy stuff away from your LLM → Execute code in environments that isolate objects from context → Store images, files, complex data outside the context window → Only pull summary info back - full objects stay in sandbox 7. Make summarization smart, not just chronological → Train models specifically for agent context compression → Preserve critical decision points while compressing routine chatter → Use different strategies for conversations vs tool outputs 8. Prune context like you're editing a novel → Implement trained pruners that understand relevance, not just recency → Filter based on task relevance while maintaining conversational flow → Adjust pruning aggressiveness based on task complexity 9. Monitor token usage like a hawk → Track exactly where tokens burn in your agent pipeline → Set real-time alerts when context utilization hits dangerous levels → Build dashboards correlating context management with success rates 10. Test everything or admit you're just guessing → A/B test different context strategies and measure performance differences → Create evaluation frameworks testing before/after context engineering changes → Set up continuous feedback loops auto-adjusting context parameters Last but not the least, be open to new ideas and keep learning Check out 50+ AI Agent Tutorials on my profile 👋 .

  • View profile for Elvis S.

    Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

    85,587 followers

    Anthropic just posted another banger guide. This one is on building more efficient agents to handle more tools and efficient token usage. This is a must-read for AI devs! (bookmark it) It helps with three major issues in AI agent tool calling: token costs, latency, and tool composition. How? It combines code executions with MCP, where it turns MCP servers into code APIs rather than direct tool calls. Here is all you need to know: 1. Token Efficiency Problem: Loading all MCP tool definitions upfront and passing intermediate results through the context window creates massive token overhead, sometimes 150,000+ tokens for complex multi-tool workflows. 2. Code-as-API Approach: Instead of direct tool calls, present MCP servers as code APIs (e.g., TypeScript modules) that agents can import and call programmatically, reducing the example workflow from 150k to 2k tokens (98.7% savings). 3. Progressive Tool Discovery: Use filesystem exploration or search_tools functions to load only the tool definitions needed for the current task, rather than loading everything upfront into context. This solves so many context rot and token overload problems. 4. In-Environment Data Processing: Filter, transform, and aggregate data within the code execution environment before passing results to the model. E.g., filter 10,000 spreadsheet rows down to 5 relevant ones. 5. Better Control Flow: Implement loops, conditionals, and error handling with native code constructs rather than chaining individual tool calls through the agent, reducing latency and token consumption. 6. Privacy: Sensitive data can flow through workflows without entering the model's context; only explicitly logged/returned values are visible, with optional automatic PII tokenization. 7. State Persistence: Agents can save intermediate results to files and resume work later, enabling long-running tasks and incremental progress tracking. 8. Reusable Skills: Agents can save working code as reusable functions (with SKILL .MD documentation), building a library of higher-level capabilities over time. This approach is complex and it's not perfect, but it should enhance the efficiency and accuracy of your AI agents across the board. anthropic. com/engineering/code-execution-with-mcp

  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    165,274 followers

    Is ACE the next Context Engineering Technique? ACE (Agentic Context Engineering) is a new framework that beats current state-of-the-art optimizers like GEPA by treating context as an evolving, structured space of accumulated knowledge. What is ACE? ACE treats context as an evolving space rather than a static prompt. Instead of rewriting the entire context it manages it as a collection of discrete, structured items (strategies, code snippets, error handlers) that are incrementally accumulated, refined, and organized over time based on performance feedback. ACE vs. GEPA (Current SOTA) GEPA (Genetic-Pareto) is a popular method that uses evolutionary algorithms to iteratively rewrite and optimize prompts for brevity and general performance, but it can suffer from "brevity bias" and "context collapse", erasing specific, detailed heuristics needed for complex domain tasks. ACE builds a comprehensive context. It prioritizes retaining detailed domain insights and uses non-LLM logic to manage context growth, ensuring that hard-learned constraints and edge-case strategies are preserved rather than summarized away. How it works: 1️⃣ Three components: a Generator (to solve tasks), a Reflector (to analyze outcomes), and a Curator (to manage the context). 2️⃣ The Generator attempts a task using the current context, creates a reasoning trajectory and environment feedback (e.g., code execution results). 3️⃣ The Reflector provides feedback to extract concrete insights, identifying successful tactics or root causes of errors. 4️⃣ The Curator synthesizes these into structured, itemized "delta" entries (specific additions or edits to knowledge bullets). 5️⃣ Programmatically merge these delta updates into the context, ensuring the context grows and refines incrementally for the next task. Insights: 💡 GEPA optimize for concise prompts, ACE prioritizes comprehensive, detailed context. 📈 ACE outperformed baselines by +10.6% on agentic benchmarks and +8.6% on complex financial reasoning. 📚 ACE's incremental "delta" update approach reduced adaptation latency by an average of 86.9% compared to methods that rewrite full prompts. 📝 Generator, Reflector and Curator Prompts are part of the paper appendix. Paper: https://lnkd.in/eBknvYcR

  • View profile for Adam Chan

    Bringing developers together to build epic projects with epic tools!

    10,324 followers

    Stop worshipping prompts. Start engineering the CONTEXT. If the LLM sounds smart but generates nonsense, that’s not really “hallucination” anymore… That’s due to the incomplete context one feeds it, which is (most of the time) unstructured, stale, or missing the things that mattered. But we need to understand that context isn't just the icing anymore, it's the whole damn CAKE that makes or breaks modern AI apps. We’re seeing a shift where initially RAG gave models a library card, and now context engineering principles teach them what to pull, when to pull, and how to best use it without polluting context windows. The most effective systems today are modular, with retrieval, memory, and tool use working together seamlessly. What a modern context-engineered system looks like: • Working memory: the last few turns and interim tool results needed right now. • Long-term memory: user preferences, prior outcomes, and facts stored in vector stores, referenced when useful. • Dynamic retrieval: query rewriting, reranking, and compression before anything hits the context window. • Tools as first-class citizens: APIs, search, MCP servers, etc., invoked when necessary. 𝐄𝐱𝐚𝐦𝐩𝐥𝐞: In an AI coding agent, working memory stores the latest compiler errors and recent changes, while long-term memory stores project dependencies and indexed files. The tools fetch API documentation and run web searches when knowledge falls short. The result is faster, more accurate code without hallucinations. So, if you’re building smart Agents today, do this: • Start with optimizing retrieval quality: query rewriting, rerankers, and context compression before the LLM sees anything. • Separate memories: working (short-term) vs. long-term, write back only distilled facts (not entire transcripts) to the long-term memory. • Treat tools like sensors: call them when evidence is missing. Never assume the model just “knows” everything. • Make the context contract explicit: schemas for tools/outputs and lightweight, enforceable system rules. The good news is that your existing RAG stack isn’t obsolete with the emergence of these new principles - it is the foundation. The difference now is orchestration: curating the smallest, sharpest slice of context the model needs to fulfill its job… no more, no less. So, if the model’s output is off, don’t just rewrite the prompt. Review and fix that context, and then watch the model act like it finally understands the assignment!

  • View profile for Anthony Alcaraz

    GTM Agentic Engineering @AWS | Author of Agentic Graph RAG (O’Reilly) | Business Angel

    46,791 followers

    To build effective agents, you need sophisticated context engineering. But to achieve sophisticated context engineering at scale, you need agentic systems managing that context ⁉️ Everyone assumes larger context windows solve the problem. They don't. Transformers have an n² attention problem: every token attends to every other token. As context grows, the model's ability to capture these pairwise relationships gets stretched thin. Why Manual Curation Fails at Scale Consider a real agent workflow: multi-hour codebase migration, complex research synthesis, or financial analysis across hundreds of documents. Your agent generates: → Thousands of tool outputs → Multi-step reasoning chains → Execution traces with success/failure signals → Architectural decisions and dependencies → Domain-specific heuristics discovered through trial-and-error A human cannot process this velocity of information and make real-time decisions about what to compress, persist to memory, or discard. The cognitive load exceeds human reaction time capabilities. The Agentic Context Engineering Solution Research from Stanford's ACE (Agentic Context Engineering) framework proves this approach works in production. They implement a three-agent architecture: Generator: Produces reasoning trajectories and surfaces effective strategies Reflector: Critiques execution traces to extract concrete lessons Curator: Synthesizes updates into structured, itemized contexts Results: 10.6% improvement on agent benchmarks, 8.6% on domain-specific tasks. They matched IBM's production-level system while using smaller open-source models. The Technical Mechanisms That Matter Three core techniques emerged across all research: 1️⃣ Incremental Delta Updates: Instead of rewriting entire contexts (which causes "context collapse"), use structured bullets with metadata. Update only relevant sections. ACE reduced adaptation latency by 87% using this approach. 2️⃣ Just-in-Time Retrieval: Don't pre-load everything. Agents maintain lightweight identifiers (file paths, graph entity IDs) and dynamically load data using tools. Anthropic's Claude Code demonstrates this: it uses commands like head, tail, and grep to analyze large datasets without loading full objects into context. 3️⃣ Grow-and-Refine with De-duplication: Let contexts expand adaptively while using semantic embeddings to prune redundancy. This prevents both information loss and context bloat. GEPA (Genetic-Pareto prompt evolution) demonstrates this with reflective optimization. An agent analyzes execution traces, identifies which context elements were useful or misleading, and autonomously proposes improvements. It achieved 10-19% better performance than reinforcement learning while using 35× fewer rollouts. Knowledge graphs are essentially pre-computed indexes of high-signal relationships. Instead of hoping an LLM extracts relationships from unstructured text in context, you make them explicit and queryable.

  • View profile for Benedikt Stemmildt 👨🏼‍💻🧙🏼‍♂️

    Help Engineering Teams Thrive with AI | Faster Delivery. Better Code. Fulfilled Teams. | 20+ Years CTO/Architect | Transform Scattered Adoption into Systematic Practices | Speaker with 40+ Conference Talks

    6,720 followers

    Most engineers treat AI context windows like infinite RAM. Your agent fails not because the model is bad, but because you're flooding 200K tokens with noise and wondering why it hallucinates. After building agentic systems for production teams, I've learned: 𝗔 𝗳𝗼𝗰𝘂𝘀𝗲𝗱 𝗮𝗴𝗲𝗻𝘁 𝗶𝘀 𝗮 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝘁 𝗮𝗴𝗲𝗻𝘁. Context engineering isn't about cramming more information in. It's about systematic management of what goes in and what stays out. 𝗧𝗵𝗲 𝗥𝗲𝗱𝘂𝗰𝗲 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆: 𝗦𝘁𝗼𝗽 𝗪𝗮𝘀𝘁𝗶𝗻𝗴 𝗧𝗼𝗸𝗲𝗻𝘀 𝗧𝗵𝗲 𝗠𝗖𝗣 𝗦𝗲𝗿𝘃𝗲𝗿 𝗧𝗿𝗮𝗽: Most teams load every MCP server by default. I've seen 24,000+ tokens (12% of context) wasted on tools the agent never uses. 𝗧𝗵𝗲 𝗙𝗶𝘅: • Delete your default MCP.json file • Load MCP servers explicitly per task • Measure token cost before adding anything permanent This one change saves 20,000+ tokens instantly. 𝗧𝗵𝗲 𝗖𝗟𝗔𝗨𝗗𝗘.𝗺𝗱 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: Teams build massive memory files that grow forever. 23,000 tokens of "always loaded" context that's 70% irrelevant to the current task. 𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: • Shrink CLAUDE.md to absolute universal essentials only • Build `/prime` commands for different task types • Load context dynamically based on what you're actually doing 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: ``` /prime-bug → Bug investigation context /prime-feature → Feature development context /prime-refactor → Refactoring-specific context ``` Dynamic context beats static memory every time. 𝗧𝗵𝗲 𝗠𝗲𝗻𝘁𝗮𝗹 𝗠𝗼𝗱𝗲𝗹 𝗦𝗵𝗶𝗳𝘁 Stop thinking: "How do I get more context in?" Start thinking: "How do I keep irrelevant context out?" 𝗪𝗵𝗮𝘁 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗲𝘀 𝗪𝗶𝗻𝗻𝗲𝗿𝘀 𝗳𝗿𝗼𝗺 𝗟𝗼𝘀𝗲𝗿𝘀: ✓ Winners: Measure token usage per agent operation ✗ Losers: "Just throw everything in the context" ✓ Winners: Design context architecture before writing prompts ✗ Losers: Keep adding to claude.md when agents fail Your agent's intelligence ceiling is your context management ceiling. --- What's the biggest waste of tokens in your AI setup right now? #ContextEngineering #AgenticEngineering #AIAgents #DeveloperProductivity #SoftwareArchitecture [Human Generated, Human Approved]

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,609 followers

    In most of today’s agentic systems, the real performance ceiling comes from memory. Agents repeatedly process the same context, treat each user turn as an isolated event, and attempt to update long histories while still generating responses. This constant rewriting increases latency, inflates token use, and often produces contradictory or incomplete recollections across sessions. A recent paper named LightMem addresses this problem directly. The authors argue that the issue is not the capacity of the memory but the timing of its operations. They draw inspiration from how human memory functions, separating immediate perception, short-term processing, and long-term consolidation into distinct stages. The architecture includes three layers. 1. Sensory memory compresses and filters incoming tokens before reasoning. Only high-value information is retained, and utterances are grouped by topic using a hybrid of attention-based segmentation and semantic similarity. 2. Short-term memory captures these topic segments as structured entries containing summaries, embeddings, and the original turns. This organization preserves coherence and prevents unrelated discussions from blending together. 3. Long-term memory performs consolidation later, when the system is idle. It merges related entries in parallel and applies temporal constraints so that later interactions cannot overwrite earlier ones. This design decouples memory management from inference. The agent no longer spends computation cycles updating its knowledge while responding to the user. Instead, it records new information efficiently and reconciles it later in a controlled, asynchronous phase. On the LONGMEMEVAL benchmark, LightMem improves question-answering accuracy by up to 9.6 percentage points while reducing token usage by more than thirtyfold and API calls by an order of magnitude. Runtime is reduced by nearly twelve times. The implementation steps are pragmatic: 1. Add a lightweight pre-compression stage before any retrieval or summarization to filter low-value tokens and reduce input size. 2. Segment conversations by topic boundaries rather than by fixed windows so that related turns remain grouped and context stays coherent. 3. Store compact summaries and corresponding embeddings for each topic instead of keeping full transcripts. 4. Move consolidation to a background process that merges similar entries in batches, using timestamps or semantic similarity to maintain consistency and prevent overwriting earlier information. As memory becomes structured and asynchronous, the need for ever-longer context windows may diminish. The priority may shift from expanding capacity to improving precision and how information is filtered, stored, and reconciled over time. Perhaps, the next frontier in performance will not come from larger models but from better memory architecture.

  • View profile for Bijit Ghosh

    CTO | CAIO | Leading AI/ML, Data & Digital Transformation

    10,438 followers

    If you’ve ever tried pushing an agent to execute multi-hour, multi-day tasks, full-stack app builds, scientific workflows, multi-step reasoning pipelines, you’ve likely run into the same paradox: Even frontier models collapse across context windows because the system around them isn’t engineered for continuity. It’s an architectural issue, less a token-limit issue. Long-running agents operate as discrete, stateless episodes. Every new session begins with zero latent state, zero gradient memory, and zero structured priors. We’re essentially reinitializing the cognitive graph on every iteration and expecting stable policy continuation. That never works. The two canonical failure modes show up instantly: 1. Over-projection: the agent tries to “one-shot” an entire task graph, exhausting the window and leaving an incoherent intermediate state. 2. False convergence: a fresh session perceives partial structure and prematurely emits a terminal policy (“job complete”) without validating task coverage. The solution isn’t larger context windows, it’s rethinking the persistent substrate that agents operate on. The architecture that actually works and one I've validated in production: 1. An initializer agent that expands the prompt into a fully enumerated requirement graph (JSON > Markdown), seeds the workspace, normalizes directory structure, and writes canonical artifacts (progress logs, feature DAGs, init scripts, baseline tests). 2. A coding agent constrained to single-feature deltas, with strict invariants: deterministic diffs, merge-safe end states, self-verification, and atomic E2E validation using MCP-backed tools (browser automation, execution sandboxes, environment introspection). 3. Externalized memory via immutable artifacts, git history as temporal state, JSON features as state machines, test harnesses as behavioral oracles. 4. A “bootstrap ritual” for every session: inspect state → evaluate divergence → reproduce environment → validate invariants → only then execute policy. 5. Mandatory environment hygiene to prevent compounding state entropy. This is the only pattern that consistently turns agents from episodic pattern generators into long-horizon, state-aligned autonomous workers.

  • View profile for Animesh Koratana

    PlayerZero CEO

    7,739 followers

    The time horizon of agentic tasks is doubling every ~7 months. Here's what's actually moving the needle for us: 1. Context engineering: Ship a tight task contract (goal, constraints, checks, schema), blend retrieval (semantic+keyword+recency), and compact ruthlessly with ~20% token headroom. 2. Governance & approvals: Use risk-tiered gates (self-approve low, escalate high), keep auditability with replayable diffs, and encode guardrails (scopes, rate limits, sandboxes) with instant rollback. 3. Keeping agent focus: Inject a tiny per-loop <system-reminder> (goal, schema, tools, cite-or-unknown, stop) and checkpoint/verify on cadence so small errors don’t compound into mission drift. Why this works: Long tasks rarely fail from “not smart enough”—they fail from context drift and error compounding. A crisp contract + lightweight governance keeps the system honest, and a relentless focus cue keeps the model on mission at minute 47 the same way it was at minute 1. Build these three, and you don’t just stretch time horizon—you make it reliable.

Explore categories