Testing LLM Performance Against Malformed Data

Explore top LinkedIn content from expert professionals.

Summary

Testing LLM performance against malformed data means evaluating how well large language models (LLMs) handle inputs that are messy, incomplete, misspelled, or contain irrelevant information. This process helps uncover how often these AI systems "hallucinate"—that is, generate inaccurate or made-up responses—when faced with confusing or unexpected queries.

  • Simulate real-world errors: Challenge your language model by feeding it misspelled, incomplete, or contextually misleading prompts to see how reliably it responds.
  • Monitor for hallucinations: Track when the AI produces confident but incorrect answers, especially in cases where it should express uncertainty or request clarification.
  • Design targeted safeguards: Build error detection and handling systems that can flag or prevent unreliable outputs, instead of relying solely on the model to self-correct.
Summarized by AI based on LinkedIn member posts
  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    41,883 followers

    A new study shows that even the best financial LLMs hallucinate 41% of the time when faced with unexpected inputs. FailSafeQA, a new benchmark from Writer, tests LLM robustness in finance by simulating real-world mishaps, including misspelled queries, incomplete questions, irrelevant documents, and OCR-induced errors. Evaluating 24 top models revealed that: * OpenAI’s o3-mini, the most robust, hallucinated in 41% of perturbed cases * Palmyra-Fin-128k-Instruct, the model best at refusing irrelevant queries, still struggled 17% of the time FailSafeQA uniquely measures: (1) Robustness - performance across query perturbations (e.g., misspelled, incomplete) (2) Context Grounding - the ability to avoid hallucinations when context is missing or irrelevant (3) Compliance - balancing robustness and grounding to minimize false responses Developers building financial applications should implement explicit error handling that gracefully addresses context issues, rather than solely relying on model robustness. Developing systems to proactively detect and respond to problematic queries can significantly reduce costly hallucinations and enhance trust in LLM-powered financial apps. Benchmark details https://lnkd.in/gq-mijcD

  • View profile for Gary Ang, PhD
    Gary Ang, PhD Gary Ang, PhD is an Influencer

    Former AI Risk Lead & Investment Risk Head at MAS | AI × Risk × Finance | Accidental Computer Scientist & Artist | Developed Singapore’s AI Risk Management Guidelines (AIRG) for Finance | Art above by me, not AI.

    5,081 followers

    Gen AI hallucinations are dead. Just kidding. Of course they ain’t. Hallucinations are linked to uncertainty. Uncertainty is a feature, not a bug in AI. They arise naturally from an AI model learning to generalize. In previous diptychs, we explored hierarchy, representations, attention in AI. Today's diptych focuses on uncertainty (one of the 3 ‘U’s in my Thinking in Risks article, alongside unexpectedness and unexplainability). There’s been a lot of chatter on OpenAI’s paper “Why Language Models Hallucinate”. I found a paper by Leon Chlon, PhD on how to estimate the likelihood of an LLM hallucinating more interesting. And it somehow made me think of an older orthogonal paper on conformal prediction. So here’s a diptych of the two papers. The two papers are quite different, but the common theme is uncertainty. Left Panel: “Distribution-Free Predictive Inference for Regression” Use past behavior to assess uncertainty.   Train your model, then calibrate it on a separate set of data points where you know the correct answers. Check how wrong the predictions are on this separate set of data points, then use these errors to create confidence intervals for subsequent new predictions, allowing you to say - based on past performance, I'm 90% confident the true answer falls in this range. Right Panel: “Compression Failure in LLMs: Bayesian in Expectation, Not in Realization” Check uncertainty by messing up the question and seeing how much the answer changes. Ask "Who was the President of the USA in 2019?" Then ask corrupted versions: "Who was the President of [MASK] in 2019?" and "Who was the [MASK] of the USA in 2019?" If confidence appropriately plummets, it probably knows the answer to the original uncorrupted question. If the model's confidence barely drops despite missing key information, it's more likely to be hallucinating or bullshitting you. Both approaches tackle the same fundamental problem – distinguishing how confident a prediction or an answer is. Different paths to the same critical goal: knowing how likely your AI is bullshitting you. How do you decide when to trust advice that sounds confident? #Uncertainty #ConformalPrediction #Hallucination #AIRiskManagement #AI

  • View profile for Aishwarya Naresh Reganti

    Founder & CEO @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

    123,782 followers

    😯 HALOGEN is a new benchmark for measuring LLM hallucinations and shows that even top-performing models can hallucinate at rates of up to 86%. Hallucination remains a persistent issue in LLMs, and as models continue to improve, harder benchmarks like HALOGEN are necessary for identifying new patterns and understanding the evolving challenges. Insights: 👉 HALOGEN consists of over 10,000 prompts across nine domains, designed to elicit hallucinations. Tasks include programming, scientific attribution, summarization, text simplification, biographies, historical events, false presuppositions, and rationalization tasks. 👉 It introduces a system to detect hallucinations by breaking model outputs into atomic units and verifying these against trusted knowledge sources. For instance, in code generation, imports are checked against the PyPI index. 👉 The benchmark tested 14 models and revealed significant hallucination rates, ranging from 4% to 86%, depending on the domain. The study categorized errors into three types: ⛳ Type A: The correct fact is in the training data, but the model hallucinates. ⛳Type B: Incorrect facts are in the training data or taken out of context. ⛳Type C: Facts are fabricated, with no basis in the training data. The paper emphasizes that addressing hallucinations will require multiple strategies tailored to error types. Retrieval-based approaches may help when relevant knowledge exists, while models should be trained to express uncertainty when encountering unfamiliar scenarios. Link: https://lnkd.in/eSJSwx3u

Explore categories