I've been building RAG systems wrong this whole time. Turns out we're processing 90% irrelevant information. REFRAG fixes this massive inconsistency. Most retrieved passages in your RAG are irrelevant, containing sparse or redundant information. You search the user’s query and dump all results into an LLM long context. Good LLMs can get the answer, but you’re wasting money and time. The new approach, called REFRAG, by a Meta research team propose a new solution: 1. Create a single compressed embedding for each chunk 2. Use compressed chunk embeddings as inputs, in place of tokens, to your LLM 3. A lightweight RL policy selects the most relevant chunks to keep whole So instead of 100 token embeddings for a chunk, and for 50 chunks - just use one single vector. This approach requires training both the embedding model and the LLM to adjust to the new methodology. The authors use 𝘤𝘶𝘳𝘳𝘪𝘤𝘶𝘭𝘶𝘮 𝘭𝘦𝘢𝘳𝘯𝘪𝘯𝘨, starting with single chunk reconstruction and gradually increasing complexity. 𝗞𝗲𝘆 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 The numbers are staggering: • 30.85× 𝘧𝘢𝘴𝘵𝘦𝘳 time-to-first-token (3.75× better than previous SOTA) • 16× 𝘭𝘢𝘳𝘨𝘦𝘳 context windows • 𝘡𝘦𝘳𝘰 𝘢𝘤𝘤𝘶𝘳𝘢𝘤𝘺 𝘭𝘰𝘴𝘴 across RAG, multi-turn conversations, and summarisation • Outperforms LLaMA on 16 RAG tasks while using 2-4× fewer decoder tokens Imagine processing 16× more context at 30× the speed! I'll definitely be trying this once the open source code gets released. Until that point, read the paper here: https://lnkd.in/eW64kQf8
Context Laundering in Large Language Model Workflows
Explore top LinkedIn content from expert professionals.
Summary
Context laundering in large language model workflows refers to the process of unintentionally introducing irrelevant, outdated, or misleading information into an AI’s memory and reasoning chain, which can degrade performance and reliability. As large language models handle longer conversations or task histories, managing what information gets carried forward becomes crucial to avoid confusion and maintain trustworthy results.
- Streamline information: Only carry forward the most relevant and recent details, dropping outdated or redundant data to prevent confusion in AI reasoning.
- Build structured context: Use clear documentation and task breakdowns so each AI step accesses precise, purposeful information rather than an overloaded history.
- Apply deliberate cleanup: Regularly remove or refresh stored context to prevent memory rot and minimize the risk of errors spreading through workflows.
-
-
Something i learnt the hard way....Bigger context window ≠ smarter AI agent. In fact, it often makes it worse. I've seen this mistake in multiple builds. Team upgrades the model. Gets a massive context window. Then starts dumping everything into it. Logs. History. Documents. Old prompts. Feels powerful. Until the agent starts behaving... strangely. Let us see what's actually happening. This is something we don't talk about enough: Context Rot The model isn't thinking more. It's just gets distracted. It spends its attention on irrelevant noise and falls back to repeating past patterns instead of reasoning fresh. Context Poisoning ..is even worse. One slightly wrong piece of information sneaks in and now every downstream step builds on it. Silently. That's when production systems start giving confidently wrong outputs. Here's the ground reality... Memory is not a model problem. It's a systems design problem. If you're building agentic workflows, this is where things usually break and how to fix them 1. RAG is NOT memory: RAG is like a reference. Static. Read-only. But agent memory?? Needs to evolve, needs to write back, needs to remember user-specific context over time. If your system can't do that, it's not memory. It's lookup. 2. If you don't forget, you will fail: Most teams design memory like a warehouse. Store everything. Bad idea. what you need is TTL (time-based eviction), Recency weighting, Active cleanup. Otherwise your agent starts reasoning on outdated information. 3. Similarity alone is not enough: Most pipelines rely only on semantic similarity. But in real systems relevance = similarity + recency + importance. Without this, important details surface over critical facts. And the agent drifts. 4. The weird one (but real): This surprised even me. When context is perfectly structured, models sometimes get distracted. But when information is slightly shuffled. They actually retrieve facts better. Not intuitive. But shows how fragile attention really is. We treat context window like storage. Actually it's not. It's a scarce cognitive resource. Don't let your agents hoard data. Give them space to think. Like to know how your are solving this. Are you building custom memory systems or relying on frameworks like LangGraph / Redis and adapting them?
-
One of the biggest challenges I see with scaling LLM agents isn’t the model itself. It’s context. Agents break down not because they “can’t think” but because they lose track of what’s happened, what’s been decided, and why. Here’s the pattern I notice: 👉 For short tasks, things work fine. The agent remembers the conversation so far, does its subtasks, and pulls everything together reliably. 👉 But the moment the task gets longer, the context window fills up, and the agent starts forgetting key decisions. That’s when results become inconsistent, and trust breaks down. That’s where Context Engineering comes in. 🔑 Principle 1: Share Full Context, Not Just Results Reliability starts with transparency. If an agent only shares the final outputs of subtasks, the decision-making trail is lost. That makes it impossible to debug or reproduce. You need the full trace, not just the answer. 🔑 Principle 2: Every Action Is an Implicit Decision Every step in a workflow isn’t just “doing the work”, it’s making a decision. And if those decisions conflict because context was lost along the way, you end up with unreliable results. ✨ The Solution to this is "Engineer Smarter Context" It’s not about dumping more history into the next step. It’s about carrying forward the right pieces of context: → Summarize the messy details into something digestible. → Keep the key decisions and turning points visible. → Drop the noise that doesn’t matter. When you do this well, agents can finally handle longer, more complex workflows without falling apart. Reliability doesn’t come from bigger context windows. It comes from smarter context windows. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
The last week was full of learning and discussions with the AI research community at NeurIPS, where Prof Shivani Shukla and I presented two papers that challenge how we think about deploying Gen and agentic AI systems in a secure and safe manner. After months of rigorous research and experimentation, our research group was delighted to have shared findings that bridge critical gaps in our understanding of LLM behavior and human-AI collaboration for the following two papers/posters:- 1. Security Knowledge Dilution in Large Language Models Paper:- https://lnkd.in/dPkPtCRD for workshop Deep Learning for Code in Agentic Era (https://lnkd.in/eMpGGAwg) Our controlled study of 400 experiments revealed a striking finding:- LLMs experience a 47% degradation in security expertise when exposed to large volumes of irrelevant context. This has profound implications for deploying AI systems in security-critical environments where context windows are flooded with operational data. 2. A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions Paper:- https://lnkd.in/dQEPpGmV for workshop DynaFront : Dynamics at the Frontiers of Optimization, Sampling, and Games (https://lnkd.in/eAJK52Bb) Presenting our mathematical framework for understanding how language models navigate competing objectives in real-time interactions, essential for building robust agentic AI systems that can balance multiple constraints simultaneously. These aren't just academic exercises. As we deploy increasingly autonomous AI agents in enterprise environments, understanding how context affects domain expertise and how models reconcile competing objectives becomes mission-critical for responsible AI deployment. The conversations at NeurIPS pushed us to think harder about building systems that are not just powerful, but reliably safe and effective at scale. Grateful to everyone who engaged with our work and challenged our assumptions, that's where the real breakthroughs happen. For those building agentic AI solutions:- How are you addressing context management and multi-objective optimization in your deployments? These challenges are only growing as we scale. #NeurIPS2025 #AIResearch #AgenticAI #AIGovernance #MachineLearning #ResponsibleAI
-
𝗔𝗜-𝗔𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗖𝗠: 𝗙𝗿𝗼𝗺 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗥𝗼𝘁 𝘁𝗼 𝗥𝗶𝗴𝗼𝗿𝗼𝘂𝘀 𝗦𝗰𝗮𝗳𝗳𝗼𝗹𝗱𝗶𝗻𝗴 Your AI-assisted product change starts brilliantly. The first analysis is excellent, and the second builds reasonably well. By the fourth interaction, the AI contradicts earlier decisions and forgets critical constraints. This isn't AI failure; it's context degradation. Large language models have fixed context windows. As conversation accumulates, earlier exchanges compress or disappear. The scaffolding pattern, as demonstrated by Benedict Smith, addresses this through structured techniques mapping directly to CM governance. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 maintains structured project files providing consistent information to each AI interaction. Effective implementations use explicit configuration state documents capturing scope, affected components, constraints, and design intent. This is standard change control documentation. Organizations maintaining rigorous CM baselines already have this discipline. 𝗧𝗮𝘀𝗸 𝗱𝗲𝗰𝗼𝗺𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 breaks workflows into atomic, verifiable units. Instead of "complete this change," decompose it into discrete tasks: generate CAD modifications, run FMEA, and validate BOM consistency, each as a separate interaction with clear acceptance criteria. 𝗦𝘂𝗯-𝗮𝗴𝗲𝗻𝘁 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 deliberately discards context between tasks. Each discrete task executes in a fresh AI instance with only relevant context, preventing error propagation. According to research on context degradation, effective context windows are "much smaller than advertised token limits." This phenomenon—"context rot"—means LLM performance degrades as the context window fills, making scaffolding essential. Scaffolding aligns with governance requirements. Organizations maintaining rigorous CM2 baselines, clear change processes, and structured documentation already have what scaffolding requires. PLM systems should become an infrastructure that scaffolded workflows interact with, not monolithic interfaces that engineers navigate manually. Context files maintained in version control capture design intent. Validation agents enforce constraints automatically. Human approval gates preserve accountability. One could start with scaffolding for specific, bounded workflows where governance requirements are well-understood. Engineering change orders affecting well-characterized part families. Build expertise where failure is recoverable before extending to safety-critical applications. If your team can't explicitly articulate CM requirements for structured prompting, do they lack the discipline needed to manage CM effectively, even without automation? What's your experience with sustained AI workflows? Have you encountered context degradation in multi-day configuration management tasks? #AI #CM2 #ConfigurationManagement #PLM #ProductLifecycleManagement #CM #IpX #MDUX
-
Stop blaming the model. Fix the context. Most “broken” agents and RAG stacks aren’t model problems. They’re context problems. Wrong chunks. No memory hygiene. Too many tools in the prompt. Zero control over what actually reaches the model. In agentic systems, context is the job: → Agents decide what to load, what to drop, what to re-query. → RAG decides which 3–5 chunks get a golden ticket into the window. Blow that, and your “smart” agent is just an expensive guess machine. When we tightened context for a service workflow, hallucinations dropped, answers stopped drifting, ops finally trusted the system. The only “feature” we shipped was better context engineering and a clean write-back into sessions.context_state. Sharing this Weaviate context engineering guide because it nails the difference between “prompting a model” and “designing the world it thinks inside.” Save this if you’re serious about agents, not just chatbots. Link to the full report: https://lnkd.in/gXX4e2PM
-
Up until now, much of domain specific knowledge injection to LLMs has answered the question: "How do we get the right context INTO the window?", but with the success of coding agents and recursive language models, that question has changed to: "How do we let the model NAVIGATE context itself?" Large language models have a limited context window, or maximum amount of tokens that can be input as its entire context. This is a hard constraint resulting from the transformer architecture itself, and while modern models have pushed context windows into the hundreds of thousands (even millions) of tokens, more context doesn't always mean better results. Research has shown that model performance actually degrades as input length increases, a phenomenon known as context rot, where models struggle to reliably use information buried deep in long sequences, especially when surrounded by similar but irrelevant content. The solution up until now has been Retrieval Augmented Generation (RAG), chunking and embedding documents into vector databases, then retrieving the most relevant pieces via semantic similarity. This works, but it frames context management purely as a search problem, and scaling it starts to feel more like building a search engine than an AI system. What coding agents like Claude Code, Cursor, and Codex stumbled into was a different approach entirely: give the LLM a terminal and let it explore. Filesystem-based context navigation lets models directly explore, preview, and selectively load content using tools they already understand. Instead of engineering a pipeline to deliver the right context, the model finds it itself. Recursive Language Models (RLMs) formalize this further, with a slight distinction: in a coding agent, opening a file or running a tool dumps results back into the context window. RLMs instead store the prompt and all sub-call results as variables in a code environment, only interacting with them programmatically. Recursion happens during code execution, meaning the model can spawn arbitrarily many sub-LLM calls without polluting its own context, orchestrating understanding of 10M+ tokens without ever having to look at all of it at once. This gives us two differently motived options: RAG gives you fast, narrow retrieval great for latency-sensitive apps like chatbots. RLM-style frameworks trade speed for deeper exploration, better suited when thorough analysis matters more than response time. To learn more about context rot, how coding agents changed context delivery, and how recursive language models are formalizing it all, check out my latest video here: https://lnkd.in/ehszSKV7
From Retrieval to Navigation: The New RAG Paradigm
https://www.youtube.com/
-
Context Engineering: quietly becoming one of the most important design problems in AI-native development. We’re watching something interesting happen in real time: LLMs are getting smarter—faster reasoning, better planning, more fluid code generation. But they’re still not remembering well. So instead of waiting for “memory” to mature, we’re building systems around that gap. That’s where Context Engineering comes in. A common approach from teams is to start with AI-assisted coding or AI PR reviews, codify rules and guidance into monorepo or something highly accessible to the LLM, then achieve decent results with AI pair programming. Along the way prune dead code and docs and improve your system architecture where you can. That's essentially level 1 context engineering. ⸻ So what is Context Engineering? It’s shaping, curating, and delivering relevant information to a language model at inference time, so it can behave more intelligently—without actually learning or remembering anything long-term. We see this everywhere in dev tools now: • Cursor: thread-local + repo-aware context stacking • Claude Code: claude.md as a persistent summary of dev history • Prompt-engineered PRDs living next to source files • Custom eval + test suites piped into the session as scaffolding • Vector stores / RAG / MCP servers acting like external memory prosthetics I use all of these in my developer workflow day to day. It's a constant time and effort commitment to optimize for the best context for my AI coding assistants - currently using Claude Code primarily with Cursor as backup. Agents running in GitHub are Copilot with some autonomous troubleshooting with GPT-4.1. We're basically tricking the LLM, with a lot of effort, into feeling like it remembers, by embedding the right context at the right time—without overwhelming its token window. ⸻ Why this matters -> LLMs today are like brilliant interns with no long-term memory: You get great results if you prep them with your wisdom but constrain the thinking boundaries. -> Context Engineering becomes the new “prompt discipline”—but across system design, repo architecture, and real-time tooling. We’re not teaching models to remember (yet). This is a major AI gap-and something we're working on at momentiq. We’re teaching ourselves how to communicate, in a relatively inefficient manner, with high-leverage, stateless minds. Context Engineering is working for now and absolutely should be a focus for teams on the AI-native journey. ⸻ Question → How are you engineering context into your LLM workflows today? What's your best practice for context management for your AI code assistants? And where does it still break down? #AIEngineering #ContextEngineering #SoftwareDevelopment #DevTools
-
Researchers from Meta built a new RAG approach. (with 2-4x less token usage + 16x larger context window) Most of what we retrieve in RAG setups never actually helps the LLM. In classic RAG, when a query arrives: - You encode it into a vector. - Fetch similar chunks from vector DB. - Dump the retrieved context into the LLM. It typically works, but at a huge cost: - Most chunks contain irrelevant text. - The LLM has to process far more tokens. - You pay for compute, latency, and context. That’s the exact problem Meta AI’s new method REFRAG solves. It fundamentally rethinks retrieval and the diagram below explains how it works. Essentially, instead of feeding the LLM every chunk and every token, REFRAG compresses and filters context at a vector level: - Chunk compression: Each chunk is encoded into a single compressed embedding, rather than hundreds of token embeddings. - Relevance policy: A lightweight RL-trained policy evaluates the compressed embeddings and keeps only the most relevant chunks. - Selective expansion: Only the chunks chosen by the RL policy are expanded back into their full embeddings and passed to the LLM. This way, the model processes just what matters and ignores the rest. Here's the step-by-step walkthrough: - Step 1-2) Encode the docs and store them in a vector database. - Step 3-5) Encode the full user query and find relevant chunks. Also, compute the token-level embeddings for both the query (step 7) and matching chunks. - Step 6) Use a relevance policy (trained via RL) to select chunks to keep. - Step 8) Concatenate the token-level representations of the input query with the token-level embedding of selected chunks and a compressed single-vector representation of the rejected chunks. - Step 9-10) Send all that to the LLM. The RL step makes REFRAG a more relevance-aware RAG pipeline. Based on the research paper, this approach: - has 30.85x faster time-to-first-token (3.75x better than previous SOTA) - provides 16x larger context windows - outperforms LLaMA on 16 RAG benchmarks while using 2–4x fewer decoder tokens. - leads to no accuracy loss across RAG, summarization, and multi-turn conversation tasks That means you can process 16x more context at 30x the speed, with the same accuracy. The code has not been released yet by Meta. They intend to do that soon. ____ Find me → Avi Chawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
-
Most people think agents fail because the model “didn’t understand the request.” 🙈 In reality, many agents fail because their context window gets flooded with old tool results — JSON blobs, search results, code edits, logs… all quietly accumulating in the background 💥📚 Anthropic introduced a very clever solution: Context Editing. It automatically clears older tool results (and thinking blocks) once the prompt grows too large. Super elegant — but limited to Claude. So we brought that idea into LangChain… and made it model-agnostic 🔓⚙️ That means you now get Anthropic-style context management with OpenAI, Google Gemini, xAI, local models etc. The new Context Editing Middleware lets you: 🧹 Automatically clear older tool outputs 🔧 Keep only the recent results you care about 🚫 Exclude specific tools from being cleared 📉 Reduce token usage dramatically 🧠 Extend long-running agent workflows without blowing up context windows I walk through the concept, show how Anthropic’s strategy works under the hood, and explain how we adapted it for LangChainJS in today’s video 👇 🎥 https://lnkd.in/gx66R5YE 📚 https://lnkd.in/gejh5GFP If you’re building tool-heavy agents, this is a big quality-of-life improvement — and it works across your entire model stack. #LangChain #AI #agents #openai #anthropic #contextengineering
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development