The interview is for a Generative AI Engineer role at Cohere. Interviewer: "Your client complains that the LLM keeps losing track of earlier details in a long chat. What's happening?" You: "That's a classic context window problem. Every LLM has a fixed memory limit - say 8k, 32k, or 200k tokens. Once that's exceeded, earlier tokens get dropped or compressed, and the model literally forgets." Interviewer: "So you just buy a bigger model?" You: "You can, but that's like using a megaphone when you need a microphone. A larger context window costs more, runs slower, and doesn't always reason better." Interviewer: "Then how do you manage long-term memory?" You: 1. Summarization memory - periodically condense earlier chat segments into concise summaries. 2. Vector memory - store older context as embeddings; retrieve only the relevant pieces later. 3. Hybrid memory - combine summaries for continuity and retrieval for precision. Interviewer: "So you’re basically simulating memory?" You: "Yep. LLMs are stateless by design. You build memory on top of them - a retrieval layer that acts like long-term memory. Otherwise, your chatbot becomes a goldfish." Interviewer: "And how do you know if the memory strategy works?" You: "When the system recalls context correctly without bloating cost or latency. If a user says, 'Remind me what I told you last week,' and it answers from stored embeddings - that’s memory done right." Interviewer: "So context management isn’t a model issue - it’s an architecture issue?" You: "Exactly. Most think 'context length' equals intelligence. But true intelligence is recall with relevance - not recall with redundancy." #ai #genai #llms #rag #memory
Understanding Large Language Model Context Limits
Explore top LinkedIn content from expert professionals.
-
-
🚨 The biggest misunderstanding about LLM limits I see every week Today someone asked advice to analyze a 500,000-character file. They thought: “Easy, I’ll just convert the PDF into .txt and paste it into the model.” Except… that’s not how large-context models work. What actually happened was: 1) The model accepted the file. 2) It looked like it processed the whole thing. 3) It even responded confidently that the analysis was done But when the user asked what it really did, it finally admitted: It only analysed ~30% of the text The rest never even made it into memory. And honestly? This happens all the time. Why this happens: GPT-5-class models can handle ~272k tokens for input (≈ 200k words) ~128k tokens for output A 500k-character document → far beyond that limit. So the model quietly samples, truncates, or drops earlier context as it processes. This isn’t an error but an intrinsic limitation of the model. A limitation by design, even: Imagine ChatGPT's 800 million weekly users uploading huge documents on OpenAI servers all at once... ...not even all data centers on Earth would be enough. But most people don’t realize it. ⚠️ The hidden risk When context goes over the limit: -The model won’t throw an error -It won’t warn you -It will reply with confidence anyway And you’re left assuming it processed everything correctly Which is exactly how bad analysis, missed insights, and false certainty happen. ✔️ What to do instead If you’re working with very large documents: -Chunk the text intentionally -Use multi-pass or hierarchical summaries -Feed sections in controlled sequences -Or use external retrieval rather than raw uploads In other words: If the file is bigger than the model’s brain, upgrade the workflow, not the file format. Final thoughts AI can be useful for certain tasks, but it’s not magic. And it’s definitely not reading half-million-character documents in one go. Know your tools. Know their limits. And don’t let confidence trick you into thinking you got a full analysis when you only got 30%. ---- Follow me Chiara Gallese, Ph.D. for an honest analysis of AI limitations and risks
-
Mid-conversation with Claude yesterday, I got this message: "Compacting our conversation so we can keep chatting. This takes about 1-2 minutes." At 62% capacity, I watched it reorganize its thoughts. And I realized: most AI users have no idea this is happening. Here's what you need to understand. Context windows are AI's working memory. It's the total text the model can "see" at once—your prompts, its responses, uploaded documents, everything. Claude offers 200,000 tokens (roughly 150,000 words) for paid users. Sounds massive until you're deep into a complex project. When you hit that ceiling, something has to give. Claude's approach: Auto-compaction kicks in around 95% capacity. Earlier messages get summarized, keeping what the AI thinks matters most. Your full history is preserved—you can scroll back—but the AI's "working memory" gets compressed. Each compression cycle loses granularity. Manus AI takes a different path. Rather than compacting in place, it externalizes memory to the file system—creating todo.md files to maintain focus, saving intermediate results externally, spinning up sub-agents with their own context windows for discrete tasks. When context fills up, it uses "recoverable compression"—dropping content but keeping URLs and file paths so it can retrieve information later if needed. Neither is perfect. Both involve tradeoffs. The takeaway: Context limits are real constraints on how much complexity AI can handle in a single session. If your team uses AI for research, strategy, or extended projects, you need to understand this. Three practical tips: → Checkpoint manually at 70% rather than waiting for auto-compaction at 95%. You control what's preserved. (This only works in Claude Code using the /compact switch) → Summarize at natural breakpoints. Ask the AI to capture key decisions before moving on. You may manually bring this across to another chat or ask it to save it to memory. → For complex projects, externalize documentation (E.g use Projects in Claude and ChatGPT). Don't rely solely on conversation memory. As context windows expand—Claude's testing 1 million tokens for some API users—this will matter less. But for now, understanding your AI's memory limits is the difference between productive collaboration and frustrating repetition.
-
Up until now, much of domain specific knowledge injection to LLMs has answered the question: "How do we get the right context INTO the window?", but with the success of coding agents and recursive language models, that question has changed to: "How do we let the model NAVIGATE context itself?" Large language models have a limited context window, or maximum amount of tokens that can be input as its entire context. This is a hard constraint resulting from the transformer architecture itself, and while modern models have pushed context windows into the hundreds of thousands (even millions) of tokens, more context doesn't always mean better results. Research has shown that model performance actually degrades as input length increases, a phenomenon known as context rot, where models struggle to reliably use information buried deep in long sequences, especially when surrounded by similar but irrelevant content. The solution up until now has been Retrieval Augmented Generation (RAG), chunking and embedding documents into vector databases, then retrieving the most relevant pieces via semantic similarity. This works, but it frames context management purely as a search problem, and scaling it starts to feel more like building a search engine than an AI system. What coding agents like Claude Code, Cursor, and Codex stumbled into was a different approach entirely: give the LLM a terminal and let it explore. Filesystem-based context navigation lets models directly explore, preview, and selectively load content using tools they already understand. Instead of engineering a pipeline to deliver the right context, the model finds it itself. Recursive Language Models (RLMs) formalize this further, with a slight distinction: in a coding agent, opening a file or running a tool dumps results back into the context window. RLMs instead store the prompt and all sub-call results as variables in a code environment, only interacting with them programmatically. Recursion happens during code execution, meaning the model can spawn arbitrarily many sub-LLM calls without polluting its own context, orchestrating understanding of 10M+ tokens without ever having to look at all of it at once. This gives us two differently motived options: RAG gives you fast, narrow retrieval great for latency-sensitive apps like chatbots. RLM-style frameworks trade speed for deeper exploration, better suited when thorough analysis matters more than response time. To learn more about context rot, how coding agents changed context delivery, and how recursive language models are formalizing it all, check out my latest video here: https://lnkd.in/ehszSKV7
From Retrieval to Navigation: The New RAG Paradigm
https://www.youtube.com/
-
MIT researchers found a way around context limitations. (Here's why it changes everything) For months, everyone accepted this idea: LLMs can only handle around 100,000 tokens. End of story. MIT researchers just broke that assumption. They did not make the context window bigger. They changed how models use context. Their new approach, Recursive Language Models, can work with millions of tokens - up to 100x more than what was considered realistic. Accuracy stays strong. Costs are comparable or even go down. So what changed? LLMs use LLMs recursively. Instead of pushing huge documents into the model, they treat the document like a system the model can query. The model does not read everything at once. The document lives outside the context window as a variable that the model can inspect with code. Think about how you use Google or a book. You do not memorize everything. You search for what you need. Same idea here. Why this matters: - Context limits were shaping how we built AI tools. - We summarized data. - We filtered information before sending it to models. All of that was just a temporary fix. Now models can work with full, messy, real-world data. What developers can do now: - Work with massive codebases. - Scan years of git history. - Query huge documentation sets. - Build tools that use data that was impossible to handle before. This points to something deeper. What we call “hard limits” in tech are often just design choices. MIT didn’t remove a limit. They changed how the problem is framed. And that shift is what creates real breakthroughs.
-
91.3% accuracy vs 0%. Same model. Same task. The only difference: treating your prompt as code instead of text. Recursive Language Models (RLMs) from MIT have completely changed how I think about handling long context in LLMs. Instead of cramming everything into the context window, RLMs treat your prompt as part of the 𝘦𝘯𝘷𝘪𝘳𝘰𝘯𝘮𝘦𝘯𝘵 that the model can programmatically explore. 𝗧𝗵𝗲 𝗖𝗼𝗿𝗲 𝗜𝗻𝘀𝗶𝗴𝗵𝘁 Once you hit the context limit in an LLM, you're done. But LLMs are trained for code as well, right? Why not use their coding skills for more than just coding? 1. Load your prompt as a 𝘷𝘢𝘳𝘪𝘢𝘣𝘭𝘦 in a REPL programming environment 2. Give the model tools to peek into, decompose, and recursively process parts of that variable 3. Let the model write 𝘤𝘰𝘥𝘦 that calls itself on programmatic slices of the input This enables the model to handle prompts that are literally 100x longer than its context window. The 𝗿𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 element is the key insight here - the LLM can call itself (or a smaller subagent) for smaller tasks, allowing it to batch and concatenate results to answer complex questions. 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 I tested it in Python (via DSPy), input the full alice in wonderland book and asked it to give a sentiment analysis of the openings of each chapter. The LLM: 1. Explored the prompt (book) to see how the chapter headings were formatted 2. Implemented regex to split the full string into chunks before/after each chapter heading 3. Invoked the LLM sub-agent on each paragraph to analyse the sentiment Even if the full prompt can't fit into history, LLMs have notoriously suffered from context rot. This approach enabled each task to be separately analysed by the sub-agent, each having no knowledge of the greater task. 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 • RLMs successfully process inputs up to 𝘁𝘄𝗼 𝗼𝗿𝗱𝗲𝗿𝘀 𝗼𝗳 𝗺𝗮𝗴𝗻𝗶𝘁𝘂𝗱𝗲 beyond model context windows • On BrowseComp-Plus (6-11M tokens), RLM(GPT-5) achieved 91.3% accuracy vs 0% for the base model RLMs aren't perfect. The inference cost has high variance - median costs are comparable to base models, but some trajectories explode to 3x+ the cost due to long recursive chains. I also found, as the authors note in the appendix, that the models continue analysing well past when they had already found an answer. My hunch is that each LLM invocation always wants to do 𝘴𝘰𝘮𝘦𝘵𝘩𝘪𝘯𝘨, even if that something has already been done. It always wants to check its answer. Because of how they're trained, LLMs never just say "Okay, done!". The paper demonstrates that with better training (especially on-policy rollouts at scale), native RLMs could become far more efficient than current implementations suggest. I'll be extremely excited if this becomes a core part of model training, building custom models that excel at managing their prompt with code. Read the paper: https://lnkd.in/eq_xUJvJ
-
What if this is the next big step for LLMs? A new inference technique from Massachusetts Institute of Technology called Recursive Language Models (RLMs) is rethinking context windows entirely. The core problem is well-known: even frontier models like GPT-5 suffer from "context rot" - performance degrades quickly as prompts get longer, regardless of the technical context limit. Summarization helps but loses critical details. Retrieval misses complex reasoning patterns. Have we been trying to make models see more, when perhaps the answer is to make them see differently? RLMs treats the prompt not as direct neural network input, but as an external object the model interacts with programmatically. The prompt is loaded as a variable in a Python REPL environment, and the LLM writes code to peek into it, decompose it, and recursively call itself over smaller snippets. Same interface as a regular LLM, radically different execution. On information-dense tasks where GPT-5 scores below 0.1%, RLMs achieve 58%. On multi-hop research questions spanning 6-11M tokens, RLMs hit 91% accuracy while costing less than feeding the full context would. Crucially, performance degrades far more gracefully as complexity scales - the approach handles inputs two orders of magnitude beyond native context windows. This suggests that scaling context is not just an architecture problem but also an inference problem. RLMs demonstrate that letting models reason about their input symbolically rather than processing it neurally could be a promising new direction. If this approach generalizes, we may be looking at a new axis for scaling language model capabilities entirely. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡
-
You're in a Senior AI Interview at Google DeepMind. The interviewer sets a trap: "Gemini 1.5 Pro has a 2M token context window. Why on earth are you still building complex RAG pipelines? Why not just stuff the whole documentation into the prompt?" 90% of candidates walk right into it. They say: "We don't need RAG anymore! With 1M+ tokens, we can just feed the entire codebase or legal library into the context. RAG is legacy tech for small context windows." The reality is that if they say this, they just failed the system design round. They aren't optimizing for 𝘤𝘢𝘱𝘢𝘤𝘪𝘵𝘺. They are optimizing for 𝘚𝘪𝘨𝘯𝘢𝘭 𝘋𝘦𝘯𝘴𝘪𝘵𝘺. ----- 𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The interviewer isn't testing your knowledge of the API spec. They are testing your understanding of the 𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘔𝘦𝘤𝘩𝘢𝘯𝘪𝘴𝘮. Here is why "Stuffing the Prompt" fails in production: 1️⃣ 𝘛𝘩𝘦 "𝘓𝘰𝘴𝘵 𝘪𝘯 𝘵𝘩𝘦 𝘔𝘪𝘥𝘥𝘭𝘦": LLMs are not databases. They have a "U-shaped" attention curve. They are great at recalling the beginning and the end of the prompt, but performance degrades significantly in the middle. - Result: Your model "forgets" the critical clause buried on page 342 of the 1000-page input. 2️⃣ 𝘛𝘩𝘦 𝘓𝘢𝘵𝘦𝘯𝘤𝘺 𝘌𝘤𝘰𝘯𝘰𝘮𝘪𝘤𝘴: Attention is roughly quadratic O(n^2) (or linear with optimizations, but still heavy). - Scenario: A user asks a simple question. - RAG: Retrieves 5 relevant chunks. Input = 2k tokens. Latency = 400ms. - Context Stuffing: Inputs 500k tokens. Latency = 15 seconds. Cost = 100x. - Result: You just built the world's most expensive, slowest grep tool. 3️⃣ 𝘛𝘩𝘦 𝘚𝘪𝘨𝘯𝘢𝘭-𝘵𝘰-𝘕𝘰𝘪𝘴𝘦 𝘙𝘢𝘵𝘪𝘰: More context does not equal more intelligence. It often equals more distraction. Feeding irrelevant tokens increases the probability of hallucination because the model tries to connect dots that shouldn't be connected. You need toexplain that Huge Context Windows and RAG are orthogonal, not competitive. - RAG is the Librarian: It curates the exact 10 pages you need. - Context Window is the Desk: It allows you to spread those 10 pages out and reason across them deeply. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝: "I use RAG to maximize the Signal-to-Noise Ratio of the input. I treat the 1M window as a reasoning buffer, not a storage layer. If you treat the Context Window like a database, you get database latency with probabilistic accuracy." #AI #ArtificialIntelligence #GenerativeAI #LLM #LargeLanguageModels #GoogleDeepMind #RAG #SystemDesign
-
Clawdbot's memory system has a clever trick most AI developers miss. LLM summarization is lossy. Every compaction destroys information you might need later. Here's the trap: your AI agent hits 180K tokens in a 200K context window. The obvious solution is compaction, summarize older turns into a compact entry, keep recent messages intact. Turns 1-140 become a paragraph. Problem solved. Except it isn't. That paragraph lost the exact error message from Turn 23. The specific config value mentioned in Turn 67. The nuance in the user's preference from Turn 12. Summarization optimizes for gist. Memory retrieval needs specifics. Clawdbot handles this with a "pre-compaction memory flush." When context approaches the limit, before any summarization happens, the agent silently writes important facts to persistent storage. Then compaction proceeds. The principle: compaction can't destroy what's already saved elsewhere. This inverts the typical mental model. Most developers treat context limits as a compression problem. The better frame: context limits are a persistence trigger. When you're approaching the wall, don't ask "what can I summarize?" Ask "what must I save before I summarize?"
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development