Evaluating LLM Accuracy on Familiar and Rare Information

Explore top LinkedIn content from expert professionals.

Summary

Evaluating LLM accuracy on familiar and rare information means testing how well large language models answer questions about common facts versus less-known or new topics. This research is crucial for understanding both the strengths and limitations of AI, especially when people rely on these models for important decisions like healthcare or technical advice.

Compare approaches: When choosing between different AI methods, examine how each performs with both common and obscure information to see where strengths and weaknesses lie.
Design for interaction: Make sure your system allows users to provide clear information and encourages models to ask clarifying questions when needed.
Prioritize real testing: Test systems with actual users, not just benchmarks or simulations, since real-world performance often reveals gaps that aren't obvious in controlled conditions.

Summarized by AI based on LinkedIn member posts

Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,023 followers 1y
Report this post
Fascinating new research comparing Long Context LLMs vs RAG approaches! A comprehensive study by researchers from Nanyang Technological University Singapore and Fudan University reveals key insights into how these technologies perform across different scenarios. After analyzing 12 QA datasets with over 19,000 questions, here's what they discovered: Key Technical Findings: - Long Context (LC) models excel at processing Wikipedia articles and stories, achieving 56.3% accuracy compared to RAG's 49.0% - RAG shows superior performance in dialogue-based contexts and fragmented information - RAPTOR, a hierarchical tree-based retrieval system, outperformed traditional chunk-based and index-based retrievers with 38.5% accuracy Under the Hood: The study implements a novel three-phase evaluation framework: 1. Empirical retriever assessment across multiple architectures 2. Direct LC vs RAG comparison using filtered datasets 3. Granular analysis of performance patterns across different question types and knowledge sources Most interesting finding: RAG exclusively answered 10% of questions that LC couldn't handle, suggesting these approaches are complementary rather than competitive. The research team also introduced an innovative question filtering methodology to ensure fair comparison by removing queries answerable through parametric knowledge alone. This work significantly advances our understanding of when to use each approach in production systems. A must-read for anyone working with LLMs or building RAG systems!
No more previous content

No more next content
1 Comment
Like Comment
Bhargav Patel, MD, MBA

AI x Healthcare | Bridging Medicine & AI for Clinicians, Founders, Engineers & Health Systems | Physician-Innovator | Medical AI Research | Psychiatrist | Upcoming Books: Trauma Transformed & Future of AI in Healthcare

10,505 followers 2mo
Report this post
LLMs scored 95% on identifying medical conditions when tested alone. When real people used them for medical advice, accuracy dropped to 35%. A new randomized study in Nature Medicine tested whether large language models actually help the public make better medical decisions. 1,298 participants were given medical scenarios and asked to identify conditions and recommend next steps. GPT-4o, Llama 3, and Command R+ all performed well when directly prompted. They identified relevant conditions in 94.9% of cases and recommended correct disposition in 56.3% on average. But when participants used these same models for assistance, condition identification dropped below 34.5% and disposition accuracy fell to 44.2% (no better than the control group using search engines). The gap wasn't medical knowledge. It was interaction. Researchers analyzed conversation transcripts and found users provided incomplete information to models. Models sometimes misinterpreted context or gave inconsistent advice. Even when models suggested correct conditions, users didn't consistently follow recommendations. Standard medical benchmarks didn't predict this. Models achieved passing scores (>60%) on MedQA questions matched to scenarios but still failed in interactive testing. Performance on structured exams was largely uncorrelated to performance with real users. Simulated patient interactions didn't predict it either. When researchers replaced humans with LLM-simulated users, simulated users performed better (57.3% vs 44.2%) and showed less variation. Simulations were only weakly predictive of human behavior. Here’s what this means: Benchmark performance is necessary but insufficient. A model scoring 80% on medical licensing exams can produce 20% accuracy when paired with real users. The constraint isn't algorithmic capability. It's human-AI interaction design. Users don't know what information to provide. Models don't ask the right clarifying questions. Correct suggestions get lost in conversation. For clinicians: expect patients to arrive with AI-informed conclusions that may not be accurate. Patients using LLMs were no better at assessing clinical acuity than those using traditional methods. For developers: user testing with real humans must precede deployment. Simulations and benchmarks don't capture interaction failures. AI excels at medical exams. But medicine isn't a multiple-choice test. It's a conversation under uncertainty. — Source: Nature Medicine - "Reliability of LLMs as medical assistants for the general public"

29 Comments
Like Comment
George Hurn-Maloney

Co-Founder @ Fastino

8,080 followers 2mo
Report this post
We published a case study on LLM inadequacy in healthcare last week. This week, a Nature Medicine article reinforced our findings. Luc Rocher and colleagues from Oxford Internet Institute, University of Oxford published an article in Nature Medicine testing GPT-4o, Llama 3, and Command R+ with 1,298 people across 10 medical scenarios. The results reveal what the authors call a “translation gap.” When the researchers fed the models with clean, structured data in the form of Standardized Medical Scenarios (SMS), they identified medical conditions with an average of 94.9% accuracy. However, when they used the same models to identify medical conditions in a chatbot scenario (with less structured data and more "noise"), they were only 34.9% accurate. Participants who used a chatbot identified conditions in less than 34.5% of cases, and the right course of action in less than 44.2%. This demonstrates that LLMs are excellent at encoding medical knowledge but quite poor at generating it. The researchers found that the LLMs were highly sensitive to user bias and tended to agree with the user’s assessment of the situation significantly more often than they should. This is unsurprising, given recent findings about LLM sycophancy. They also found that in chatbot scenarios, the LLMs were sensitive to even very slight variations in how users phrased questions, demonstrating overall brittleness and unreliability in medical language generation. The Nature study shows exactly why this matters: LLMs are excellent encoders of medical knowledge but poor generators in practice. This paper underscores one of the most critical success patterns we're seeing in AI right now: model architectures must be matched to their downstream tasks. Fastino Labs's GLiNER2 excels at encoding and extracting information, not generating erroneous advice. Links to the Nature Medicine paper and our blog post below. 🔗 Nature Medicine paper: https://lnkd.in/gesYWrVw 🔗 Blog: https://lnkd.in/gcNmnA8T
No more previous content

No more next content
10 Comments
Like Comment
Gerry Tsoukalas

Professor & Early-stage Advisor/Investor in the AI/ML & Crypto space | ex Morgan Stanley, Hedge Fund Trader

1,934 followers 7mo
Report this post
🤔 Ever wonder why after billions of dollars spent on LLMs, even the most advanced models still hallucinate? A recent OpenAI paper (kudos to them for transparency) formalizes what genAI researchers have been grappling with from the start: hallucinations aren't random glitches or bad data ("garbage in garbage out"). They are inherent to the technology. When an LLM hits missing or rare information, it has little reason to say "I'm not sure." In training, it gets rewarded for producing confident answers, even made-up ones. Sounds like an easy fix then... why not just train them differently? The paper touches on potential post-training fixes, but my view is that these are band-aids that don't address a deeper issue, which is that of mathematical tractability at pre-training. To optimize billions of weights, we need algorithms that work efficiently at scale. Yet our best algorithms only work on "nice" mathematical functions, which limits the types of "loss functions" we can use for training. Loosely speaking, this translates to bias where models prioritize bluffing over restraint. What to do? While research is advancing on this, I advocate for human intuition using an in-sample vs out-of-sample lens. Well-known, stable answers like "What's the capital of France"? You can be confident. Rare, time-sensitive, or recent information? Be wary and verify!🔍 What's the most confidently wrong answer you've gotten from an AI? Share below 👇 Paper link in comments.
No more previous content

No more next content
22 Comments
Like Comment

Evaluating LLM Accuracy on Familiar and Rare Information

Summary

More in Improving Predictive Accuracy

Explore categories