LLM Accuracy in Complex Context Scenarios

Explore top LinkedIn content from expert professionals.

Summary

LLM accuracy in complex context scenarios refers to how reliably large language models (LLMs) can interpret and respond to tasks that involve multiple steps, evolving information, or nuanced decision-making—especially when interacting with real users. These scenarios often cause LLMs to lose track of important details or provide inconsistent answers, highlighting the challenges of maintaining context and clarity in conversational environments.

Prioritize human interaction: Test your models with real users to uncover gaps in accuracy and reliability that simulations and benchmarks may overlook.
Engineer smarter context: Carry forward key decisions and relevant information while filtering out unnecessary details to improve long-term consistency in complex workflows.
Adopt iterative retrieval: Use step-by-step approaches for gathering data, such as knowledge graph-based retrieval, to help LLMs navigate multi-step reasoning and avoid information overload.

Summarized by AI based on LinkedIn member posts

Akash Sharma

CEO at vellum

16,076 followers 11mo
Report this post
🧠 If you're building apps with LLMs, this paper is a must-read. Researchers at Microsoft and Salesforce recently released LLMs Get Lost in Multi-Turn Conversation — and the findings resonate with our experience at Vellum. They ran 200,000+ simulations across 15 top models, comparing performance on the same task in two modes: - Single-turn (user provides a well-specified prompt upfront) - Multi-turn (user reveals task requirements gradually — like real users do) The result? ✅ 90% avg accuracy in single-turn 💬 65% avg accuracy in multi-turn 🔻 -39% performance drop across the board 😬 Unreliability more than doubled Even the best models get lost when the task unfolds over multiple messages. They latch onto early assumptions, generate bloated answers, and fail to adapt when more info arrives. For application builders, this changes how we think about evaluation and reliability: - One-shot prompt benchmarks ≠ user reality - Multi-turn behavior needs to be a first-class test case - Agents and wrappers won’t fix everything — the underlying model still gets confused This paper validates something we've seen in the wild: the moment users interact conversationally, reliability tanks — unless you're deliberate about managing context, fallback strategies, and prompt structure. 📌 If you’re building on LLMs, read this. Test differently. Optimize for the real-world path, not the happy path.
No more previous content

No more next content
12 Comments
Like Comment
David Riedman

🤖PhD in AI (variance in LLMs), 🎓Professor, 📊Founder of K-12 School Shooting Database, 🥋BJJ Coach, 🎙️Riedman Report Podcast

11,816 followers 7mo
Report this post
After 6 years and 6,000 miles between the starting and end point, this morning I successfully defended my PhD dissertation (LLMs Versus Human Experts: Mixed Methods Analysis Measuring Variance in School Shooting Threat Assessments). Abstract: This study measures the unwanted variability in expert judgments by testing six frontier large language models (LLMs) on fictitious but realistic school shooting scenarios derived from 2,000 real threats made in the United States. Using prior decision science research, this dissertation measures the accuracy and consistency of assessments by LLMs compared to a prior study of 245 human law enforcement officers who rated the severity of the same six threat scenarios. Quantitative results demonstrate that the LLMs produced severity ratings within one point of the mean of ratings by human experts (ΔM ≤ 1) with no statistically significant differences (p > .05), supporting the primary hypothesis that LLMs can approximate the accuracy of human threat assessments. Aggregate LLM scores displayed lower variance than human ratings, showing both a wisdom-of-crowds effect and reduced judgment noise. Qualitative thematic analysis of narrative explanations revealed that LLMs consistently focused on specific aspects of the threats, while humans were influenced by lived experiences, adherence to formal procedures, and personal assumptions. The findings suggest that LLMs can enhance the reliability of school-based threat assessments as part of a human-LLM hybrid team, or as the sole assessor in under-resourced or rural schools that lack trained human experts. This study contributes to the fields of behavioral economics, decision science, violence prevention, and serves as a framework for comparing the assessment abilities of LLMs to those of humans.
No more previous content

No more next content
45 Comments
Like Comment
Bhargav Patel, MD, MBA

AI x Healthcare | Bridging Medicine & AI for Clinicians, Founders, Engineers & Health Systems | Physician-Innovator | Medical AI Research | Psychiatrist | Upcoming Books: Trauma Transformed & Future of AI in Healthcare

10,503 followers 2mo
Report this post
LLMs scored 95% on identifying medical conditions when tested alone. When real people used them for medical advice, accuracy dropped to 35%. A new randomized study in Nature Medicine tested whether large language models actually help the public make better medical decisions. 1,298 participants were given medical scenarios and asked to identify conditions and recommend next steps. GPT-4o, Llama 3, and Command R+ all performed well when directly prompted. They identified relevant conditions in 94.9% of cases and recommended correct disposition in 56.3% on average. But when participants used these same models for assistance, condition identification dropped below 34.5% and disposition accuracy fell to 44.2% (no better than the control group using search engines). The gap wasn't medical knowledge. It was interaction. Researchers analyzed conversation transcripts and found users provided incomplete information to models. Models sometimes misinterpreted context or gave inconsistent advice. Even when models suggested correct conditions, users didn't consistently follow recommendations. Standard medical benchmarks didn't predict this. Models achieved passing scores (>60%) on MedQA questions matched to scenarios but still failed in interactive testing. Performance on structured exams was largely uncorrelated to performance with real users. Simulated patient interactions didn't predict it either. When researchers replaced humans with LLM-simulated users, simulated users performed better (57.3% vs 44.2%) and showed less variation. Simulations were only weakly predictive of human behavior. Here’s what this means: Benchmark performance is necessary but insufficient. A model scoring 80% on medical licensing exams can produce 20% accuracy when paired with real users. The constraint isn't algorithmic capability. It's human-AI interaction design. Users don't know what information to provide. Models don't ask the right clarifying questions. Correct suggestions get lost in conversation. For clinicians: expect patients to arrive with AI-informed conclusions that may not be accurate. Patients using LLMs were no better at assessing clinical acuity than those using traditional methods. For developers: user testing with real humans must precede deployment. Simulations and benchmarks don't capture interaction failures. AI excels at medical exams. But medicine isn't a multiple-choice test. It's a conversation under uncertainty. — Source: Nature Medicine - "Reliability of LLMs as medical assistants for the general public"

29 Comments
Like Comment
Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

627,957 followers 7mo
Report this post
One of the biggest challenges I see with scaling LLM agents isn’t the model itself. It’s context. Agents break down not because they “can’t think” but because they lose track of what’s happened, what’s been decided, and why. Here’s the pattern I notice: 👉 For short tasks, things work fine. The agent remembers the conversation so far, does its subtasks, and pulls everything together reliably. 👉 But the moment the task gets longer, the context window fills up, and the agent starts forgetting key decisions. That’s when results become inconsistent, and trust breaks down. That’s where Context Engineering comes in. 🔑 Principle 1: Share Full Context, Not Just Results Reliability starts with transparency. If an agent only shares the final outputs of subtasks, the decision-making trail is lost. That makes it impossible to debug or reproduce. You need the full trace, not just the answer. 🔑 Principle 2: Every Action Is an Implicit Decision Every step in a workflow isn’t just “doing the work”, it’s making a decision. And if those decisions conflict because context was lost along the way, you end up with unreliable results. ✨ The Solution to this is "Engineer Smarter Context" It’s not about dumping more history into the next step. It’s about carrying forward the right pieces of context: → Summarize the messy details into something digestible. → Keep the key decisions and turning points visible. → Drop the noise that doesn’t matter. When you do this well, agents can finally handle longer, more complex workflows without falling apart. Reliability doesn’t come from bigger context windows. It comes from smarter context windows. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
No more previous content

No more next content
51 Comments
Like Comment
Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,023 followers 11mo
Report this post
Excited to share a groundbreaking advancement in AI: KG-IRAG! I've been diving into this fascinating paper on Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that significantly enhances how Large Language Models (LLMs) handle complex temporal reasoning tasks. >> What makes KG-IRAG special? Unlike traditional RAG methods that often struggle with multi-step reasoning, KG-IRAG introduces an iterative approach that progressively gathers relevant data from knowledge graphs. This is particularly powerful for queries involving temporal dependencies - like planning trips based on weather conditions or traffic patterns. The framework employs two collaborative LLMs: - LLM1 identifies the initial exploration plan and generates a reasoning prompt - LLM2 evaluates retrieved data and determines if further retrieval steps are needed The magic happens in the iterative retrieval process, where the system incrementally explores the knowledge graph, retrieving only what's needed when it's needed. This prevents information overload while ensuring comprehensive data collection. >> Technical implementation details: The researchers constructed knowledge graphs treating time, location, and event status as key entities, with relationships capturing temporal, spatial, and event-based correlations. Their approach models time as an entity for easier retrieval and reasoning. The system follows a sophisticated algorithm where: 1. Initial time/location parameters are identified 2. Relevant triplets are retrieved from the KG 3. LLM2 evaluates if current data is sufficient 4. If insufficient, search criteria are adjusted based on detected "abnormal events" 5. This continues until enough information is gathered to generate an accurate answer >> Impressive results: The researchers at UNSW, MBZUAI and TII evaluated KG-IRAG on three custom datasets (weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW), demonstrating significant improvements in accuracy for complex temporal reasoning tasks compared to standard RAG methods. What's particularly impressive is how KG-IRAG outperforms other approaches on questions requiring dynamic temporal reasoning - like determining the latest time to leave early or earliest time to leave late to avoid adverse conditions. This work represents a significant step forward in making LLMs more capable of handling real-world temporal reasoning tasks. Excited to see how this technology evolves!
No more previous content

No more next content
1 Comment
Like Comment
Sohrab Rahimi

Director, AI/ML Lead @ Google

23,608 followers 10mo
Report this post
A debate is quietly reshaping how we think about reasoning in LLMs, and it has real implications for how we build AI systems today. In 𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴 recently published by Apple , researchers tested reasoning-augmented LLMs on structured problems like Tower of Hanoi, River Crossing, and Blocks World. The results were sharp. As task complexity increased, even models trained for reasoning began to fail. Performance dropped, not just in output quality, but in the effort models applied to thinking. The conclusion: reasoning in LLMs may appear to exist on the surface, but collapses when deeper, compositional logic is required. They argue that we should not mistake verbal fluency for true reasoning capability. A recent response, 𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝘁𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, offers a different angle. The authors do not dispute that models fail in some of these tasks. But they show that many of those failures are a result of poor task design. Some models were asked to generate outputs that exceeded their token limits. Others were penalized for correctly stating that a task had no solution. When tasks were reframed more realistically, such as asking the model to generate an algorithm instead of every step, models performed well. Their conclusion is that what looks like reasoning failure is often a mismatch between evaluation expectations and what the model is actually being asked to do. Taken together, these papers provide a much-needed framework for thinking about when LLMs and reasoning-focused models (LRMs) are useful and where they are not. For simple tasks like summarization, retrieval, or classification, classic LLMs work well. They are fast, general, and effective. Adding reasoning often adds cost and confusion without improving performance. For medium-complexity tasks like applying policy logic, referencing context, or handling multi-turn interactions, LRMs offer clear value. Their planning ability, when structured well, improves accuracy and consistency. For complex tasks like symbolic reasoning, recursive planning, or solving puzzles with deep constraints, both LLMs and LRMs fail more often than they succeed. They either give up early, apply shallow logic, or lose coherence midway. These tasks require additional architecture: modular agents, memory-aware execution, or fallback control. Take a contact center automation as an example. For routine account questions, classic LLMs may suffice. For dynamic policy explanation or billing disputes, LRMs can help. For high-stakes calls involving eligibility, compliance, or contract negotiation, more structure is required. But this is just one example. The bigger lesson is this. We should stop assuming reasoning scales cleanly with model size or prompt complexity. It does not. Reasoning has limits, and those limits depend on how we frame the task, what we ask the model to output, and how we measure success.
No more previous content

No more next content
10 Comments
Like Comment
Brooke Hopkins

Founder @ Coval | ex-Waymo

11,146 followers 11mo
Report this post
LLMs Get Lost in Multi-Turn Conversations: New Research Reveals Major Reliability Gap Just read a fascinating new paper from Microsoft and Salesforce Research revealing a critical flaw in today's LLMs: they dramatically underperform in multi-turn conversations compared to single-turn interactions. 📊 Key findings: 🔗 LLMs suffer an average 39% performance drop in multi-turn settings across six generation tasks 🔗 This occurs even in conversations with as few as two turns 🔗 The problem affects ALL tested models, including the most advanced ones (Claude 3.7, GPT-4.1, Gemini 2.5) 🔍 The researchers call this the "lost in conversation" phenomenon - when LLMs take a wrong turn in conversation, they get lost and don't recover. This is caused by: 🔗 Making assumptions too early 🔗 Prematurely generating final solutions 🔗 Relying too heavily on previous (incorrect) answers 🔗 Producing overly verbose responses 💬 Why conversation-level evaluation matters: Traditional LLM benchmarks focus on single-turn performance, creating a dangerous blind spot. Real-world AI interactions are conversational by nature, and this research shows that even the most capable models struggle with maintaining context and adapting to new information over multiple turns. Without robust conversation-level evaluation, we risk deploying systems that perform brilliantly in lab tests but frustrate users in practice. 🔎 At Coval, this is exactly what we focus on: evaluating LLMs in realistic conversational scenarios rather than isolated prompts. By measuring how models handle the natural flow of information across turns, we can identify reliability issues before they impact users and guide development toward truly conversational AI. This research highlights a critical gap between how we evaluate LLMs (single-turn) versus how we use them in practice (multi-turn). As we build AI assistants and agents, addressing this reliability issue becomes essential.
No more previous content

No more next content
8 Comments
Like Comment
Neda Nasiriani

GenAI Principal Architect @ The Friedkin Group | Leading Agentic AI Strategy and Value Creation

2,195 followers 11mo
Report this post
Lesson 1: When 100% Accuracy Is Non-Negotiable, LLMs Are Not the Answer Let’s start with the most important truth about using LLMs in enterprise applications: If your business use case demands 100% accuracy, you should not rely on LLMs to make final decisions. This isn’t a limitation you can prompt-engineer away. You can ask the model nicely. You can beg it not to hallucinate. You can say “please be accurate” or “don’t make things up.” It won’t matter. LLMs are probabilistic — not deterministic. They generate outputs based on patterns in their training data, not guaranteed truths. That means: • They can hallucinate. • They can contradict themselves. • They often sound confident… even when wrong. In high-stakes environments like finance, healthcare, legal, or compliance, “close enough” is not enough. You need real guarantees — not statistical guesses. That doesn’t mean LLMs are useless. It means they need to be used responsibly: • As co-pilots, not pilots • Paired with rule-based systems • Wrapped in validations and guardrails • Reviewed by humans or checked against authoritative data In this series, I’ll be sharing real-world lessons from building LLM-powered enterprise applications — starting with the foundational one: know the limits before scaling the hype. #LLM #EnterpriseAI #GenAI #AccuracyMatters #AIinBusiness #ResponsibleAI #LLMApplications #AIProduct

5 Comments
Like Comment
George Hurn-Maloney

Co-Founder @ Fastino

8,080 followers 2mo
Report this post
We published a case study on LLM inadequacy in healthcare last week. This week, a Nature Medicine article reinforced our findings. Luc Rocher and colleagues from Oxford Internet Institute, University of Oxford published an article in Nature Medicine testing GPT-4o, Llama 3, and Command R+ with 1,298 people across 10 medical scenarios. The results reveal what the authors call a “translation gap.” When the researchers fed the models with clean, structured data in the form of Standardized Medical Scenarios (SMS), they identified medical conditions with an average of 94.9% accuracy. However, when they used the same models to identify medical conditions in a chatbot scenario (with less structured data and more "noise"), they were only 34.9% accurate. Participants who used a chatbot identified conditions in less than 34.5% of cases, and the right course of action in less than 44.2%. This demonstrates that LLMs are excellent at encoding medical knowledge but quite poor at generating it. The researchers found that the LLMs were highly sensitive to user bias and tended to agree with the user’s assessment of the situation significantly more often than they should. This is unsurprising, given recent findings about LLM sycophancy. They also found that in chatbot scenarios, the LLMs were sensitive to even very slight variations in how users phrased questions, demonstrating overall brittleness and unreliability in medical language generation. The Nature study shows exactly why this matters: LLMs are excellent encoders of medical knowledge but poor generators in practice. This paper underscores one of the most critical success patterns we're seeing in AI right now: model architectures must be matched to their downstream tasks. Fastino Labs's GLiNER2 excels at encoding and extracting information, not generating erroneous advice. Links to the Nature Medicine paper and our blog post below. 🔗 Nature Medicine paper: https://lnkd.in/gesYWrVw 🔗 Blog: https://lnkd.in/gcNmnA8T
No more previous content

No more next content
10 Comments
Like Comment

LLM Accuracy in Complex Context Scenarios

Summary

More in Improving Predictive Accuracy

Explore categories