Case Studies on AI System Accuracy

Explore top LinkedIn content from expert professionals.

Summary

Case studies on AI system accuracy examine how artificial intelligence performs in real-world tasks, especially focusing on how often it gets things right or wrong. This research helps us understand the reliability and risks of AI, particularly when used in fields like healthcare, science, and customer service.

  • Review real-world errors: Take time to analyze how AI systems make mistakes, including omissions and costly actions, so you can spot hidden risks before deploying them widely.
  • Keep humans involved: Always include human oversight in critical workflows, as studies show that relying solely on AI can lead to more unnoticed errors and safety concerns.
  • Use model ensembles: Consider combining multiple AI models for important tasks, since consensus frameworks and multi-agent systems have shown to boost accuracy and reliability without needing extensive human supervision.
Summarized by AI based on LinkedIn member posts
  • View profile for Bhargav Patel, MD, MBA

    AI x Healthcare | Bridging Medicine & AI for Clinicians, Founders, Engineers & Health Systems | Physician-Innovator | Medical AI Research | Psychiatrist | Upcoming Books: Trauma Transformed & Future of AI in Healthcare

    10,507 followers

    Even the best AI models produce severe clinical harm in 12-15% of cases. A new Stanford-Harvard study just evaluated 31 AI models on real physician-to-specialist cases across 10 specialties. The results should concern anyone using AI for clinical decisions. Here's what they found: The best-performing models (Gemini 2.5 Flash, Claude Sonnet 4.5, and AMBOSS's LiSA) produced about 12-15 severe errors per 100 cases. The worst models? Over 40 severe errors per 100 cases. And here's what makes this dangerous: 77% of severe harm came from omission… not recommending something overtly dangerous, but failing to order a critical test or missing an essential follow-up. These are the kinds of errors that look like nothing happened. Until something goes very wrong. The study used 100 real outpatient eConsult cases with 12,747 expert annotations. They measured safety, completeness, and restraint across diagnostic tests, medications, follow-up, and referrals. Here are two findings stood out: First, there's no correlation between clinical safety and model size, recency, or performance on popular benchmarks like MedQA. A model that scores well on multiple-choice medical exams doesn't necessarily make safe clinical recommendations. Second, multi-agent systems and RAG reduced harm substantially. Three-model heterogeneous ensembles had 6× greater odds of top-quartile safety performance compared to solo models. Here's why this matters: As LLMs improve, their errors become harder to spot. When AI is right 85% of the time, clinicians start trusting it more… and that's exactly when automation bias kicks in. This is why AI should not replace clinicians. Humans need to be the safeguard. AI can do some things better… like reading thousands of papers in seconds. But clinicians are better at catching what AI misses, especially errors of omission. First, understand the limitations of the specific AI tool you're using. Then use it in a way where humans review critical recommendations before they're acted on. The goal is to find synergies: where AI does better, where physicians do better, and where both can do better together. Check out the their MAST: Medical AI Superintelligence Test: https://lnkd.in/gkfTQcbw *** Are you using AI for clinical decisions? How are you catching errors of omission? — Study: Wu et al., Stanford-Harvard evaluation of 31 AI models on clinical harm

  • View profile for Terezija Semenski, MSc

    Helping 300,000+ people master AI and Math fundamentals faster | LinkedIn [in]structor 15 courses | Author @ Math Mindset newsletter

    31,116 followers

    Everyone's playing with AI agents, from coding to customer service. Meanwhile, Princeton's top 31 researchers have been working hard this year on infrastructure for fair agent evaluations on challenging benchmarks. This paper, "Holistic agent leaderboard: The missing infrastructure for AI Agent evaluation", summarizes insights from 20,000+ agent rollouts on 9 challenging benchmarks spanning web, coding, science, and customer service tasks. The team evaluated 9 models across 4 areas (9 benchmarks), with 1-2 scaffolds per benchmark, totaling over 20,000 rollouts. This includes: 1️⃣ coding (USACO, SWE-Bench Verified Mini),  2️⃣ web (Online Mind2Web, AssistantBench, GAIA),  3️⃣ science (CORE-Bench, ScienceAgentBench, SciCode),  4️⃣ customer service tasks (TauBench) This analysis uncovered many interesting insights: 1) Higher reasoning effort does not lead to better accuracy in the majority of cases: In case the authors used the same model with different reasoning efforts (o4-mini, Claude 3.7, Claude 4.1), the higher reasoning did not improve accuracy in 21 of 36 cases. 2) Agents often take shortcuts rather than solving the task correctly: For example, to solve scientific reproduction tasks, agents would grep the Jupyter notebook and hard-code their guesses rather than reproducing the work. When they needed to solve web tasks, web agents would look up the benchmark on huggingface. 3) Agents take actions that are extremely costly in deployment:   On flight booking tasks in Taubench, agents booked flights from the incorrect airport, refunded users more than necessary, and charged the incorrect credit card. 4) Researchers analyzed the tradeoffs between cost vs. accuracy: The red dotted line represents the Pareto frontier: it captures the models with the best accuracy at a given budget. The 3 models most commonly on the frontier are Gemini 2.0 Flash (7 of 9 benchmarks), GPT-5 (4 of 9), and o4-mini Low (4 of 9). Surprisingly, the most expensive model (Opus 4.1) performs very low (1 of 9 benchmarks). 5) The most token-efficient models are not the cheapest: On comparisons of token cost vs. accuracy, Opus 4.1 is on the Pareto frontier for 3 benchmarks.  This is important because providers change model prices frequently. 6) They used TransluceAI Docent to log all the agent behaviors and analyze them:  It uses LLMs to uncover specific actions the agent took.  Then the team conducted a systematic analysis of agent logs on 3 benchmarks: AssistantBench, SciCode, and CORE-Bench.  This analysis allowed the research team to spot agents taking shortcuts and costly reliability failures. Forget the hype. Read this paper before building. Paper link in comments. What's stopping you from building agents that actually ship? ♻️ Repost to help someone skip the expensive mistakes

  • View profile for Paolo Sironi

    Author | Podcaster | IBM Research Leader | International speaker

    46,771 followers

    💥 AI agents become common deployment. Yet, they currently succeed through deliberate simplicity, not sophisticated autonomy. A group of researchers at Cornell University published a survey with 306 practitioners and 20 in-depth case studies about real production usage of AI agents. Here are the core findings: 1️⃣ PATTERNS AND RELIABILITY ⦿ 68% of agents execute at most 10 steps before human intervention, with 47% completing fewer than 5 steps  ⦿ 70% use off-the-shelf models with zero fine-tuning  ⦿ 74% rely primarily on human evaluation ⦿ 80% of production cases use predefined, tightly scoped workflows 👉 Scope is deliberately limited as reliability is the N.1 goal: it's not easy to verify agents' correctness at scale. 2️⃣ MODEL SELECTION ⦿ 17 out of 20 case studies use closed-source frontier models. 👉 Open-source is only adopted when forced by extreme inference volume/cost or regulatory bans on sending data to external providers. Overall, model cost is trivial compared to the human expert time saved. Claude is the most selected option. 3️⃣ AGENT vs HUMAN ⦿ 73% deploy agents to make humans 10× faster on manual tasks. ⦿ 66% tolerate response times of minutes (or longer) because it still crushes human baseline speed. 👉 Productivity is the main adoption driver. 4️⃣ AGENT VALIDATION ⦿ 75% of teams have no formal benchmarks at all — just user feedback. ⦿ Building internal benchmarks took one team 6 months for 100 examples. 👉 Validation is still hard: non-determinism breaks traditional testing. 5️⃣ AGENT RISK CONTROL ⦿ 74% use "Humans-in-the-loop" to make output evaluation ⦿ 52% use "LLM-as-judge", but every single one also layers human verification on top 👉 Typically, LLMs score output confidence auto-accepting high-ranked results. Everything else is routed to humans together with a random 5% sample of auto-accepted outputs. 🔥 KEY CONSIDERATIONS Production AI agents that actually work in the real world are: ⦿ deliberately simple ⦿ narrowly scoped ⦿ deeply dependent on human oversight Teams willingly trade away autonomy because reliability is still the unsolved bottleneck. Truly robust risk management and governance for AI agents remain extremely hard to build, but it is the only way to scale AI agents enterprise-wide. 📖 LEARN MORE 📥 This empirical research can be found here: https://lnkd.in/dnUMpfrf 📥 Here you can also find an IBM study that identifies 15 key risk management considerations to implement agentic AI in banking: https://lnkd.in/dmqEKxZA #ArtificialIntelligence #AI #IBM

  • View profile for José Manuel de la Chica
    José Manuel de la Chica José Manuel de la Chica is an Influencer

    Head of Global AI Lab at Santander | AI Research Leader

    15,835 followers

    AI meet Consensus? A New Consensus Framework that Makes Models More Reliable and Collaborative. This paper addresses the challenge of ensuring the reliability of LLMs in high-stakes domains such as healthcare, law, and finance. Traditional methods often depend on external knowledge bases or human oversight, which can limit scalability. To overcome this, the author proposes a novel framework that repurposes ensemble methods for content validation through model consensus. Key Findings: Improved Precision: In tests involving 78 complex cases requiring factual accuracy and causal consistency, the framework increased precision from 73.1% to 93.9% with two models (95% CI: 83.5%-97.9%) and to 95.6% with three models (95% CI: 85.2%-98.8%). Inter-Model Agreement: Statistical analysis showed strong inter-model agreement (κ > 0.76), indicating that while models often concurred, their independent errors could be identified through disagreements. Scalability: The framework offers a clear pathway to further enhance precision with additional validators and refinements, suggesting its potential for scalable deployment. Relevance to Multi-Agent and Collaborative AI Architectures: This framework is particularly pertinent to multi-agent systems and collaborative AI architectures for several reasons: Enhanced Reliability: By leveraging consensus among multiple models, the system can achieve higher reliability, which is crucial in collaborative environments where decisions are based on aggregated outputs. Error Detection: The ability to detect errors through model disagreement allows for more robust systems where agents can cross-verify information, reducing the likelihood of propagating incorrect data. Scalability Without Human Oversight: The framework's design minimizes the need for human intervention, enabling scalable multi-agent systems capable of operating autonomously in complex, high-stakes domains. In summary, the proposed ensemble validation framework offers a promising approach to improving the reliability of LLMs, with significant implications for the development of dependable multi-agent AI systems. https://lnkd.in/d8is44jk

  • View profile for Dominic King, MD PhD

    VP Health, Microsoft AI

    14,197 followers

    We’re excited to share research from our health team at Microsoft AI: a proof-of-concept showing that AI can master medicine’s most intricate diagnostic challenges by following the same step-by-step reasoning expert physicians use. There's more detail in our pre print paper & blog Paper-> https://lnkd.in/egDiNsqR Blog-> https://lnkd.in/esGFhSeB Sharing what I'm most excited about from this work. 1. Benchmarks Traditional medical benchmarks like the USMLEs condense clinical cases into neat multiple-choice questions—far from the real clinical workflow. We’ve approached things in a different way: Sequential Diagnosis Benchmark (SDBench) deconstructs 304 of the most diagnostically complex and demanding cases in medicine published in the New England Journal of Medicine. SD Bench requires models—and physician—to begin with an initial presentation, ask follow-up questions, order tests, and converge on the confirmed diagnosis—just as in routine clinical practice. You can see how this works in a video with Xiao Liu on our blog. 2. Performance With this new benchmark we tested a suite of the best known generative AI models against the 304 NEJM cases with impressive out-the-box performance. Beyond this we developed the Microsoft AI Diagnostic Orchestrator (MAI-DxO). By emulating a virtual panel of physicians with diverse thinking styles, MAI-DxO boosts raw model accuracy and solves a remarkable 85.5% of NEJM cases. For comparison we evaluated 21 practicing UK/US physicians and on the same tasks, these experts achieved a mean accuracy of 20%. 3. Costs One of our concerns was that AI would default to ordering every investigation to arrive at the correct diagnosis. So we set the system up such that each requested investigation also incurred a cost. This allowed us to evaluate performance against both diagnostic accuracy and resource expenditure. As MAI-DxO is configurable it is seen to operate along a Pareto frontier of accuracy versus resource use. What’s next: While for now exciting research, we believe this kind of superhuman clinical reasoning will in future reshape medicine. A particular focus for our group is on consumer health. Today, Bing and Copilot answer over 50 million health queries daily—from a first-time knee-pain search to finding a late-night pharmacy. We’re committed to bringing rigorous and reliable AI support into these journeys, backed by clinical evidence and robust commitments to quality, safety and trust. A huge shout-out to everyone on our new team who contributed, our partners across Microsoft, and particularly to Mustafa Suleyman for his vision. He saw the opportunity for AI to improve healthcare more than a decade ago and it now feels like this is the right time to deliver on the opportunity. Harsha Nori Mayank Daswani Christopher Kelly Scott Lundberg Marco Túlio Ribeiro Marc Wilson Xiao Liu Viknesh Sounderajah Bay Gross Peter Hames Eric Horvitz Charlotte Cooper Simpson, PhD

  • 🧠 Half of radiology AI models lose performance on external datasets, a study shows. And most institutions don’t know it, until it’s too late. A systematic review in "Radiology: AI" (Yu et al., 2022 https://lnkd.in/dppFimQ8) analyzed 86 studies of AI models for radiologic diagnosis. In 42 of them, performance dropped when tested outside the original development setting: 🔍 21 showed a substantial decrease 🔍 21 showed a modest one It is not just about accuracy - it’s about generalizability. Yet many AI decisions are still based on slide decks and regulatory approvals, not contextual performance on local data. That’s why we built Evaluation Flows in deepcOS®: ✅ Run head-to-head comparisons ✅ Use real clinical cases ✅ Benchmark commercial and research AI models ✅ Choose what works best for your population ➡️ See the demo in my previous post: https://lnkd.in/d-ZCK9Jm 📄 And it’s not just academic studies. A recent NHS evaluation by GE HealthCare, Aival, and a UK hospital showed real-world AI sensitivity varied widely depending on patient age, sex, and scan parameters, even for CE-cleared models. 📊 Examples: On the target population data, sensitivity and specificity dropped to 76.8%, and 54%, respectively - a vast deviation from the provided information from the vendor. Also, when assessing fairness, sensitivity ranged from 56.5% to 84.8% across age groups. Find the full case study here: https://lnkd.in/dwXc7Sat 💡 We know that poor generalizability is a known shortcoming of many CNN-based models - and that next-gen foundation models are already showing much better cross-site robustness. That’s great news. But even the best models still need to prove themselves in local context before they reach clinical impact at scale. We believe: If it can’t be tested, it can’t be trusted. #AIinHealthcare #MedicalAI #deepcOS #AIevaluation #TrustInAI #LocalValidation #ExternalValidation #Generalizability #ResponsibleAI #RadiologyInnovation Kicky van Leeuwen Stephan Romeijn Steven Kean David Bowen Amna Kashgari Alex King Dr. Boj Friedrich Hoppe Prof. Dr. Clemens Cyran Anna Martina Bröhan Mansour Aleisa Sebastien Ourselin FREng FMedSci Bryce Travers Professor Sultan Mahmud Dr Amir H. Bigdeli, MD Dr Geraldine Dean Kanwal Bhatia Sarim Ather Jan Beger Jonny McDaniell Dr Hugh Harvey

  • View profile for Dr. Sara Al Dallal

    President of Emirates Health Economics Society at Emirates Medical Association

    31,652 followers

    World Health Organization PAHO just published one of the most operationally useful AI governance documents to come out of a multilateral health body. "Bias-free Artificial Intelligence in Health: Dos and Don'ts for Developing and Implementing Algorithms" — and it reads like a policy checklist, not a tech manual. 🎯 The framing is deliberate Algorithmic bias is not positioned as a fairness problem. It's positioned as a performance problem — one that degrades diagnostic accuracy, misallocates resources, and erodes institutional trust. That's the argument that moves ministries. 🔬 The case studies are concrete Pulse oximeters calibrated on lighter skin, systematically overestimating oxygen saturation in darker-skinned patients. Sepsis prediction models trained in tertiary hospitals, failing in community settings. A care-management algorithm that used historical expenditure as a proxy for risk — and systematically underestimated need in patients whose care had historically been underfunded. Not hypotheticals. Documented failures with patient consequences. 📋 What the document actually delivers Role-specific dos and don'ts for policymakers, clinicians, developers, and hospital directors. A model card template — essentially a nutritional label for an algorithm. Disaggregated validation requirements. A maturity scoring framework across six governance domains. And a principle that should be non-negotiable in any procurement contract: human accountability is nondelegable. ⚖️ The policy ask is clear Don't approve AI systems based on overall accuracy alone. Require subgroup validation. Mandate model cards. Build sunset plans. Worth reading and circulating to anyone involved in digital health policy, regulation, or AI procurement.

  • View profile for Stuart Winter-Tear

    Author of UNHYPED | AI as Capital Discipline | Advisor on what to fund, test, scale, or stop

    53,648 followers

    AI factual accuracy is a core concern in high-stakes domains, not just theoretically, but in real-world conversations I have. This paper proposes atomic fact-checking: a precision method that breaks long-form LLM outputs into the smallest verifiable claims, and checks each one against an authoritative corpus before reconstructing a reliable, traceable answer. The study focuses on medical Q&A, and shows this method outperforms standard RAG systems across multiple benchmarks: - Up to 40% improvement in real-world clinical responses. - 50% hallucination detection, with 0% false positives in test sets. - Statistically significant gains across 11 LLMs on the AMEGA benchmark - with the greatest uplift in smaller models like Llama 3.2 3B. 5-step pipeline: - Generate an initial RAG-based answer. - Decompose it into atomic facts. - Verify each fact independently against a vetted vector DB. - Rewrite incorrect facts in a correction loop. - Reconstruct the final answer with fact-level traceability. While the results are promising, the limitations are worth noting: The system can only verify against what’s in the corpus, it doesn't assess general world knowledge or perform independent reasoning. Every step depends on LLM output, introducing the risk of error propagation across the pipeline. In some cases (up to 6%), fact-checking slightly degraded answer quality due to retrieval noise or correction-side hallucinations. It improves factual accuracy, but not reasoning, insight generation, or conceptual abstraction. While this study was rooted in oncology, the method is domain-agnostic and applicable wherever trust and traceability are non-negotiable: - Legal (case law, regulations) - Finance (audit standards, compliance) - Cybersecurity (NIST, MITRE) - Engineering (ISO, safety manuals) - Scientific R&D (citations, reproducibility) - Governance & risk (internal policy, external standards) This represents a modular trust layer - part of an architectural shift away from monolithic, all-knowing models toward composable systems where credibility is constructed, not assumed. It’s especially powerful for smaller, domain-specific models - the kind you can run on-prem, fine-tune to specialised corpora, and trust to stay within scope. In that architecture, the model doesn’t have to know everything. It just needs to say what it knows - and prove it. The direction of travel feels right to me.

  • View profile for Clint Gibler

    Sharing the latest cybersecurity research at tldrsec.com | Head of Research at Semgrep

    33,775 followers

    A number of companies are 𝐮𝐬𝐢𝐧𝐠 𝐀𝐈 𝐭𝐨 𝐚𝐮𝐭𝐨𝐭𝐫𝐢𝐚𝐠𝐞 𝐒𝐀𝐒𝐓 𝐟𝐢𝐧𝐝𝐢𝐧𝐠𝐬. But... does it actually work? How well? ➡️ A data-driven approach with a 96% true positive accuracy rate. I often see AI + security posts that are very "vibes based." Like, "I tried it a few times, and it seems to work on my machine #shipit🤘" So it was nice seeing this post by Jack Moxon and Seth Jaksik that walks through how Semgrep's AI Assistant autotriage works, and iterative improvements they made. 1️⃣ How does Assistant Autotriage work? Given a new pull request (PR), if there are Semgrep findings, prompt an LLM with: * The relevant code (including a taint trace) * The Semgrep rule and finding And later: * Historical findings * Better prompting * RAG * ... 2️⃣ How do you verify Assistant autotriage works? Human security researcher experts reviewed >2,000 Semgrep findings and manually labeled them as true or false positives. This then became a benchmark using promptfoo that could be iterated on. 3️⃣ Performance Improvements After a series of improvements (see the post for details), Assistant now has a: * 96% accuracy on true positives * 41% accuracy on false positives That is: Assistant is very unlikely to incorrectly say that a true vulnerability is safe, but may say a false positive is a real issue. Seth and Jack argue that this is the right trade-off, as it's better to not miss true issues. For lots of charts and more methodology description, see the post! https://lnkd.in/gpA8U9aj

  • I ran 849 tests on AI context files. The results surprised me. If you use Claude Code with reference documentation—security playbooks, compliance libraries, runbooks—how you organize those files directly impacts answer quality. I assumed a well-organized folder hierarchy would outperform a messy flat directory. Clean categories, logical nesting, maybe an index file. That's just good practice, right? The data said otherwise. Key findings from 849 controlled tests: → Flat structure (all files in one folder) hit 100% accuracy at 302K words → Each level of folder nesting costs 1-2% accuracy → Index files and summaries actually hurt at scale—dropping accuracy by 4.6% at 600K+ words → Your filenames are the index. Claude selects files based on names before reading content. The simplest approach won at every scale I tested. If you're building AI-assisted workflows (for security or developement), the highest-performing structure is also the easiest to maintain: descriptive filenames in a flat folder. No elaborate hierarchies needed. Full methodology, data tables, and an open-source test harness in the article. https://lnkd.in/gYzS_uBJ #CyberSecurity #AI #ClaudeCode #IncidentResponse #CISO #AIEngineering #ProductivityTips

Explore categories