Challenges Faced by Llms in Multi-Turn Conversations

Explore top LinkedIn content from expert professionals.

Summary

Large language models (LLMs) struggle to stay consistent and reliable in multi-turn conversations, where tasks unfold over several messages rather than all at once. This challenge, often called the "lost in conversation" effect, highlights how LLMs tend to forget context, make early assumptions, and produce confusing answers as conversations progress.

  • Share clear context: Always provide the model with as much relevant information upfront to help it understand the conversation’s purpose from the start.
  • Reset when needed: If responses start to drift or become unreliable, consider restarting the conversation or summarizing previous points to reestablish clarity.
  • Summarize key decisions: Periodically recap important steps or choices so the model stays focused on what matters and avoids carrying forward mistakes or irrelevant details.
Summarized by AI based on LinkedIn member posts
  • View profile for Pragyan Tripathi

    Clojure Developer @ Amperity | Building Chuck Data

    4,048 followers

    Ever noticed that your AI starts strong, but after a few back-and-forths, it spirals into nonsense? It turns out that it’s not your imagination; it’s science. New research from Microsoft + Salesforce tested 15 leading LLMs (including #GPT-4, #Claude, #Gemini) across multi-turn tasks. 𝐓𝐡𝐞 𝐫𝐞𝐬𝐮𝐥𝐭𝐬?  1. Performance dropped by an average of 39%. 2. Same task. Same info. Just given step by step instead of all at once. 3. Every single model got worse. 𝐇𝐞𝐫𝐞’𝐬 𝐰𝐡𝐲 𝐋𝐋𝐌𝐬 𝐠𝐞𝐭 𝐥𝐨𝐬𝐭 𝐢𝐧 𝐜𝐨𝐧𝐯𝐞𝐫𝐬𝐚𝐭𝐢𝐨𝐧: -> Premature answers → They guess before they have full context. -> Answer bloat → Responses get longer, carrying over flawed logic. -> Loss of middle context → They remember the start and end, but forget what’s in between. -> Verbal drift → More words = more assumptions = more confusion. The scary part isn’t just the decline.  𝐈𝐭’𝐬 𝐭𝐡𝐞 𝐮𝐧𝐫𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲: It’s the unreliability: – In single-turn tasks, results are fairly consistent. – In multi-turn, the same prompt can succeed one time and fail the next. 𝐓𝐡𝐢𝐬 𝐞𝐱𝐩𝐥𝐚𝐢𝐧𝐬 𝐰𝐡𝐲: -> Your AI-generated UI starts strong but drifts into chaos. -> Conversations often end with users restarting from scratch. -> Temperature settings don’t fix the problem — it’s deeper than randomness. 𝐖𝐡𝐚𝐭 𝐜𝐚𝐧 𝐰𝐞 𝐝𝐨 (𝐟𝐨𝐫 𝐧𝐨𝐰)? – Give more context upfront instead of “drip-feeding” instructions. – Reset conversations when quality drops. – Use summaries to re-establish shared context. 𝐓𝐡𝐞 𝐛𝐢𝐠 𝐭𝐚𝐤𝐞𝐚𝐰𝐚𝐲: The next frontier of AI isn’t just “smarter models.” It’s models that can stay coherent and consistent across extended interactions.

  • View profile for Eduardo Ordax

    🤖 Generative AI Lead @ AWS ☁️ (200k+) | Startup Advisor | Public Speaker | AI Outsider | Founder Thinkfluencer AI

    225,812 followers

    🧠 LLMs still get lost in conversation. You should pay attention to this, specially when building AI Agents! A new paper just dropped, and it uncovers something many of us suspected: LLMs perform way worse when instructions are revealed gradually in multi-turn conversations. 💬 While LLMs excel when you give them everything up front (single-turn), performance drops by an average of 39% when the same task is spread across several conversational turns. Even GPT-4 and Gemini 2.5 stumble. Why? Because in multi-turn chats, models: ❌ Make premature assumptions ❌ Try to “wrap up” too soon ❌ Get stuck on their own past mistakes ❌ Struggle to recover when they go off-track The authors call this the “𝗟𝗼𝘀𝘁 𝗶𝗻 𝗖𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻” effect, and it explains why LLMs sometimes seem great in demos, but frustrating in real-world use. 🔍 If you’re building agentic AI products, this is a wake-up call. Most evaluation benchmarks don’t reflect how users actually interact with messy, evolving, often underspecified prompts. 📄 Paper link in comments.

  • View profile for Akash Sharma

    CEO at vellum

    16,078 followers

    🧠 If you're building apps with LLMs, this paper is a must-read. Researchers at Microsoft and Salesforce recently released LLMs Get Lost in Multi-Turn Conversation — and the findings resonate with our experience at Vellum. They ran 200,000+ simulations across 15 top models, comparing performance on the same task in two modes: - Single-turn (user provides a well-specified prompt upfront) - Multi-turn (user reveals task requirements gradually — like real users do) The result? ✅ 90% avg accuracy in single-turn 💬 65% avg accuracy in multi-turn 🔻 -39% performance drop across the board 😬 Unreliability more than doubled Even the best models get lost when the task unfolds over multiple messages. They latch onto early assumptions, generate bloated answers, and fail to adapt when more info arrives. For application builders, this changes how we think about evaluation and reliability: - One-shot prompt benchmarks ≠ user reality - Multi-turn behavior needs to be a first-class test case - Agents and wrappers won’t fix everything — the underlying model still gets confused This paper validates something we've seen in the wild: the moment users interact conversationally, reliability tanks — unless you're deliberate about managing context, fallback strategies, and prompt structure. 📌 If you’re building on LLMs, read this. Test differently. Optimize for the real-world path, not the happy path.

  • View profile for Brooke Hopkins

    Founder @ Coval | ex-Waymo

    11,147 followers

    LLMs Get Lost in Multi-Turn Conversations: New Research Reveals Major Reliability Gap Just read a fascinating new paper from Microsoft and Salesforce Research revealing a critical flaw in today's LLMs: they dramatically underperform in multi-turn conversations compared to single-turn interactions. 📊 Key findings: 🔗 LLMs suffer an average 39% performance drop in multi-turn settings across six generation tasks 🔗 This occurs even in conversations with as few as two turns 🔗 The problem affects ALL tested models, including the most advanced ones (Claude 3.7, GPT-4.1, Gemini 2.5) 🔍 The researchers call this the "lost in conversation" phenomenon - when LLMs take a wrong turn in conversation, they get lost and don't recover. This is caused by: 🔗 Making assumptions too early 🔗 Prematurely generating final solutions 🔗 Relying too heavily on previous (incorrect) answers 🔗 Producing overly verbose responses 💬 Why conversation-level evaluation matters: Traditional LLM benchmarks focus on single-turn performance, creating a dangerous blind spot. Real-world AI interactions are conversational by nature, and this research shows that even the most capable models struggle with maintaining context and adapting to new information over multiple turns. Without robust conversation-level evaluation, we risk deploying systems that perform brilliantly in lab tests but frustrate users in practice. 🔎 At Coval, this is exactly what we focus on: evaluating LLMs in realistic conversational scenarios rather than isolated prompts. By measuring how models handle the natural flow of information across turns, we can identify reliability issues before they impact users and guide development toward truly conversational AI. This research highlights a critical gap between how we evaluate LLMs (single-turn) versus how we use them in practice (multi-turn). As we build AI assistants and agents, addressing this reliability issue becomes essential.

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    627,986 followers

    One of the biggest challenges I see with scaling LLM agents isn’t the model itself. It’s context. Agents break down not because they “can’t think” but because they lose track of what’s happened, what’s been decided, and why. Here’s the pattern I notice: 👉 For short tasks, things work fine. The agent remembers the conversation so far, does its subtasks, and pulls everything together reliably. 👉 But the moment the task gets longer, the context window fills up, and the agent starts forgetting key decisions. That’s when results become inconsistent, and trust breaks down. That’s where Context Engineering comes in. 🔑 Principle 1: Share Full Context, Not Just Results Reliability starts with transparency. If an agent only shares the final outputs of subtasks, the decision-making trail is lost. That makes it impossible to debug or reproduce. You need the full trace, not just the answer. 🔑 Principle 2: Every Action Is an Implicit Decision Every step in a workflow isn’t just “doing the work”, it’s making a decision. And if those decisions conflict because context was lost along the way, you end up with unreliable results. ✨ The Solution to this is "Engineer Smarter Context" It’s not about dumping more history into the next step. It’s about carrying forward the right pieces of context: → Summarize the messy details into something digestible. → Keep the key decisions and turning points visible. → Drop the noise that doesn’t matter. When you do this well, agents can finally handle longer, more complex workflows without falling apart. Reliability doesn’t come from bigger context windows. It comes from smarter context windows. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

  • View profile for Llewyn Paine, Ph.D.

    📊 Outcomes over output: Validated AI research guidance for product leaders | Training workshops | Speaking | Consulting

    3,020 followers

    In my AI+UXR workshops, I recommend starting a fresh chat each time you ask the LLM to do a significant task. Why? Because #UXresearch tools need to be reliable, and the more you talk to the LLM, the more that reliability takes a hit. This can introduce unknown errors. This happens for several reasons, but here are a few big (albeit interrelated) ones: 1️⃣ LLMs can get lost even in short multi-turn conversations According to recent research from Microsoft and Salesforce, providing instructions over multiple turns (vs. all at once upfront) can dramatically degrade the output of LLMs.  This is true even for reasoning models like o3 and Deepseek-R1, which “deteriorate in similar ways.” 2️⃣ Past turns influence how the LLM weights different concepts In the workshop, I show a conversation that continuously, subtly references safaris, until the LLM takes a hard turn and generates content with a giraffe in it. Every token influences future tokens, and repeated concepts (even inadvertent ones) can “prime” the model to produce unexpected output. 3️⃣ Every turn is an opportunity for “context poisoning” “Context poisoning” is when inaccurate, irrelevant, or hallucinated information gets into the LLM context, causing misleading results or deviation from instructions. This is sometimes exploited to jailbreak LLMs, but it can happen unintentionally as well. In simple terms, bad assumptions early on are hard to recover from. To avoid these issues, I recommend: 🧩 Starting the conversation from scratch any time you’re doing an important research task (including turning off memory and custom instructions) 🧩 Using a single well-structured prompt when possible 🧩 And always, testing carefully and being alert to errors in LLM output I talk about these issues (and a lot more) in my workshops, and I’m writing about this today because the question was asked by some of my amazing workshop participants. Sign up in my profile to get notified about my next public workshop–or if you’re looking for private, in-house training for your team, drop me a note! #AI #UX #LLM #userresearch

  • View profile for Lalit Kundu

    US Healthcare | Wharton MBA | ex-Google Leader

    37,820 followers

    Much has been said about the “𝗹𝗼𝘀𝘁 𝗶𝗻 𝘁𝗵𝗲 𝗺𝗶𝗱𝗱𝗹𝗲” problem with LLM context. Here’s what we’ve learned putting an AI agent into production. For context: at Delty (YC X25), our agent helps enterprise engineering teams with tasks like building technical roadmaps across millions of lines of code, mapping microservices, or prepping exec reviews from dozens of docs at once. These conversations often span millions of tokens. LLMs, however, notoriously ignore the details that live in the middle of the context. Those details often just vanish from the model’s working memory. 𝗦𝗼 𝗵𝗼𝘄 𝗱𝗼 𝘆𝗼𝘂 𝗳𝗶𝘅 𝗶𝘁? You don’t rely on raw context. You equip the LLM with lookup tools so it can retrieve the most relevant snippets on demand, and then insert them at the end of the conversation where the model pays more attention. 𝗕𝘂𝘁 𝘄𝗮𝗶𝘁: doesn’t that just make the context grow endlessly with repetition? Yes. It’s inevitable. Which is why summarization becomes essential. You need to aggressively retain what matters most and collapse the rest. Otherwise, you’ll end up with a bloated context that weakens reasoning instead of strengthening it. ⚙️ 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆 Solving “lost in the middle” isn’t just about bigger context windows. It’s about retrieval + summarization, working hand-in-hand to keep the right details alive. 𝗙𝗼𝗼𝗱 𝗳𝗼𝗿 𝘁𝗵𝗼𝘂𝗴𝗵𝘁: Reading is easy. Building is hard. If this post got you thinking, my advice: don’t just follow blog posts — try building an agent yourself. You’ll hit walls faster than you expect, and that’s where the learning happens. I’ll keep sharing what we’re discovering at Delty as we push our AI staff engineer forward. 🚀 Want those raw lessons before I jot them down into posts here? Drop your email in comments and I'll send you my brain dump.

  • View profile for Woojin Kim
    Woojin Kim Woojin Kim is an Influencer

    LinkedIn Top Voice · Chief Strategy Officer & CMIO at HOPPR · CMO at ACR DSI · MSK Radiologist · Serial Entrepreneur · Keynote Speaker · Advisor/Consultant · Transforming Radiology Through Innovation

    11,016 followers

    🚨 Why do we need to move beyond single-turn task evaluation of large language models (LLMs)? 🤔 I have long advocated for evaluation methods of LLMs and other GenAI applications in healthcare that reflect real clinical scenarios, rather than multiple-choice questions or clinical vignettes with medical jargon. For example, interactions between clinicians and patients typically involve multi-turn conversations. 🔬 A study by Microsoft and Salesforce tested 200,000 AI conversations, using large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. They selected a total of 15 LLMs from eight model families: OpenAI (GPT-4o-mini, GPT-4o, o3, and GPT-4.1), Anthropic (Claude 3 Haiku, Claude 3.7 Sonnet), Google’s Gemini (Gemini 2.5 Flash, Gemini 2.5 Pro), Meta’s Llama (Llama3.1-8B-Instruct, Llama3.3-70B-Instruct, Llama 4 Scout), AI2 OLMo-2-13B, Microsoft Phi-4, Deepseek-R1, and Cohere Command-A. ❓ The results? ❌ Multi-turn conversations resulted in an average 39% drop in performance across six generation tasks. ❌ Their analysis of conversations revealed a minor decline in aptitude and a significant increase in unreliability. 📉 Here's why LLMs stumble: • 🚧 Premature assumptions derail conversations. • 🗣️ Overly verbose replies confuse rather than clarify. • 🔄 Difficulty adapting after initial mistakes. 😵💫 Simply put: When an AI goes off track early, it gets lost and does not recover. ✅ The authors advocate: • Multi-turn conversations must become a priority. • Better multi-turn testing is crucial. Single-turn tests just aren’t realistic. • Users should be aware of these limitations. 🔗 to the original paper is in the first comment 👇 #AI #ConversationalAI #LargeLanguageModels #LLMs

  • View profile for Elvis S.

    Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

    85,577 followers

    LLMs Get Lost in Multi-turn Conversation The cat is out of the bag. Pay attention, devs. This is one of the most common issues when building with LLMs today. Glad there is now paper to share insights. Here are my notes: The paper investigates how LLMs perform in realistic, multi-turn conversational settings where user instructions are often underspecified and clarified over several turns. I keep telling devs to spend time preparing those initial instructions. Prompt engineering is important. The authors conduct large-scale simulations across 15 top LLMs (including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, DeepSeek-R1, and others) over six generation tasks (code, math, SQL, API calls, data-to-text, and document summarization). Severe Performance Drop in Multi-Turn Settings All tested LLMs show significantly worse performance in multi-turn, underspecified conversations compared to single-turn, fully-specified instructions. The average performance drop is 39% across six tasks, even for SoTA models. For example, models with >90% accuracy in single-turn settings often drop to ~60% in multi-turn settings. Degradation Is Due to Unreliability, Not Just Aptitude The performance loss decomposes into a modest decrease in best-case capability (aptitude, -15%) and a dramatic increase in unreliability (+112%). In multi-turn settings, the gap between the best and worst response widens substantially, meaning LLMs become much less consistent and predictable. High-performing models in single-turn settings are just as unreliable as smaller models in multi-turn dialogues. Don't ignore testing and evaluating in multi-turn settings. Main reasons LLMs get "lost" - Make premature and often incorrect assumptions early in the conversation. - Attempt full solutions before having all necessary information, leading to “bloated” or off-target answers. - Over-rely on their previous (possibly incorrect) answers, compounding errors as the conversation progresses. - Produce overly verbose outputs, which can further muddle context and confuse subsequent turns. - Pay disproportionate attention to the first and last turns, neglecting information revealed in the middle turns (“loss-in-the-middle” effect). Practical Recommendations: - Users are better off consolidating all requirements into a single prompt rather than clarifying over multiple turns. - If a conversation goes off-track, starting a new session with a consolidated summary leads to better outcomes. - System builders and model developers are urged to prioritize reliability in multi-turn contexts, not just raw capability. This is especially true if you are building complex agentic systems where the impact of these issues is more prevalent. - LLMs are really weird. And all this weirdness is creeping up into the latest models too but it more subtle ways. Be careful out there, devs.

  • View profile for Pascal Biese

    AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

    85,064 followers

    New study from Microsoft finds LLM accuracy to drop very quickly. LLMs ace single-shot prompts - but get lost in two-turn conversations. Whether you’re asking an AI to write code or summarize documents, you’ll naturally refine instructions over several messages. Yet benchmarks still focus on one-and-done prompts. The paper from Microsoft & Salesforce introduces a “sharded conversation” simulator that feeds one requirement per turn across six generation tasks - from coding and math to SQL, API calls, table captions and summaries - and tests 15 leading models. By measuring best-case “aptitude” and stability “reliability,” the authors show an average 39% performance drop driven by doubled unreliability. Even agent-style recaps or deterministic decoding recoup only a fraction of the loss. By spotlighting multi-turn fragility, they highlight a new path for LLM evaluation: conversation consistency - not just raw accuracy. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡

Explore categories