LLM hallucinations aren't bugs, they're compression artefacts. And we just figured out how to predict them before they happen. 400 stars in one week, the reception has been unreal. Our toolkit is open source and anyone can use it. https://lnkd.in/e4s3X8GK When your LLM confidently states that "Napoleon won the Battle of Waterloo," it's not broken. It's doing exactly what it was trained to do: compress the entire internet into model weights, then decompress on demand. Sometimes, there isn't enough information to perfectly reconstruct rare facts, so it fills gaps with statistically plausible but wrong content. Think of it like a ZIP file corrupted during compression. The decompression algorithm still runs, but outputs garbage where data was lost. The breakthrough: We proved hallucinations occur when information budgets fall below mathematical thresholds. Using our Expectation-level Decompression Law (EDFL), we can calculate exactly how many bits of information are needed to prevent any specific hallucination, before generation even starts. This resolves a fundamental paradox: LLMs achieve near-perfect Bayesian performance on average, yet systematically fail on specific inputs. We proved they're "Bayesian in expectation, not in realisation", optimising average-case compression rather than worst-case reliability. Why this changes everything? Instead of treating hallucinations as inevitable, we can now: Calculate risk scores before generating any text Set guaranteed error bounds (e.g. 95%) Know precisely when to gather more context vs. abstain The full preprint is being released on arXiv this week. Until then, read the preprint PDF we uploaded here: https://lnkd.in/eRf_ecu3 The toolkit works with any OpenAI-compatible API. Zero retraining required. Provides mathematical SLA guarantees for compliance. Perfect for healthcare, finance, legal, anywhere errors aren't acceptable. The era of "trust me, bro" AI is ending. Welcome to bounded, predictable AI reliability. Big thanks to Ahmed K. Maggie C. for all the help putting this + the repo together! #AI #MachineLearning #ResponsibleAI #OpenSource #LLM #Innovation
Ensuring LLM Accuracy in Predictive Analytics
Explore top LinkedIn content from expert professionals.
Summary
Ensuring LLM accuracy in predictive analytics means making sure that large language models (LLMs) generate reliable and precise predictions or analyses, especially in scenarios where mistakes can have serious consequences. This involves understanding why these models sometimes produce incorrect or misleading outputs—often called "hallucinations"—and taking steps to reduce those errors for better decision-making.
- Clarify output requirements: Clearly define what accurate results look like for your predictive analytics task to avoid ambiguity and minimize errors from your LLM.
- Ground with verified data: Anchor your LLM’s predictions in trusted sources or use retrieval methods to ensure responses are based on real, up-to-date information rather than guesswork.
- Monitor and test regularly: Set up version tracking and regression testing to catch unexpected changes and maintain reliability as your model evolves.
-
-
LLMs are great for data processing, but using new techniques doesn't mean you get to abandon old best practices. The precision and accuracy of LLMs still need to be monitored and maintained, just like with any other AI model. Tips for maintaining accuracy and precision with LLMs: • Define within your team EXACTLY what the desired output looks like. Any area of ambiguity should be resolved with a concrete answer. Even if the business "doesn't care," you should define a behavior. Letting the LLM make these decisions for you leads to high variance/low precision models that are difficult to monitor. • Understand that the most gorgeously-written, seemingly clear and concise prompts can still produce trash. LLMs are not people and don't follow directions like people do. You have to test your prompts over and over and over, no matter how good they look. • Make small prompt changes and carefully monitor each change. Changes should be version tracked and vetted by other developers. • A small change in one part of the prompt can cause seemingly-unrelated regressions (again, LLMs are not people). Regression tests are essential for EVERY change. Organize a list of test case inputs, including those that demonstrate previously-fixed bugs and test your prompt against them. • Test cases should include "controls" where the prompt has historically performed well. Any change to the control output should be studied and any incorrect change is a test failure. • Regression tests should have a single documented bug and clearly-defined success/failure metrics. "If the output contains A, then pass. If output contains B, then fail." This makes it easy to quickly mark regression tests as pass/fail (ideally, automating this process). If a different failure/bug is noted, then it should still be fixed, but separately, and pulled out into a separate test. Any other tips for working with LLMs and data processing?
-
How can we further increase the accuracy of LLM-powered question answering systems? Ontologies to the rescue! That is the conclusion of the latest research coming from the data.world AI Lab with Dean Allemang. Based on our previous Knowledge Graph LLM Accuracy benchmark research, our intuition is that accuracy can be further increased by 1) leveraging the ontology of the knowledge graph to check for errors in the generated queries and 2) using the LLM to repair incorrect queries. We ask ourselves the following two research questions 1️⃣ To what extent can the accuracy increase by leveraging the ontology of a knowledge graph to detect errors of a SPARQL query and an LLM to repair the errors? 2️⃣ What types of errors are most commonly presented in SPARQL queries generated by an LLM? 🧪 Our hypothesis: An ontology can increase the accuracy of an LLM powered question answering system that answers a natural language question over a knowledge graph. 📏 Our approach consists of - Ontology-based Query Check (OBQC): checks deterministically if the query is valid by applying rules based on the semantics of the ontology. The rules check the the body of the query (i.e. WHERE clause) and the head of query (i.e. the SELECT clause). If a check does not pass, it returns an explanation. - LLM Repair: repair the SPARQL query generated by the LLM. It takes as input the incorrect query and the explanation and sends a zero-shot prompt to the LLM. The result is a new query which can then be passed back to the OBQC. 🏅Results: Using our chat with the data benchmark and GPT-4 - Our OBQC and LLM Repair approach increased the accuracy to 72.55%. If the repairs were not successful after three iterations, an unknown result was returned, which occurred 8% of the time. Thus the final error rate is 19.44%. “I don’t know” is a valid answer which reduces the error rate. - Low complex questions on low complex schemas achieves an error rate of 10.46%, which is now arguably at levels deemed to be acceptable by users. - All questions on high complex schemas substantially increased the accuracy. - 70% of the repairs where done by rules checking the body of the query. The majority were rules related to the domain of a property. Putting this all together with our previous work, LLM Question Answering accuracy that leverages Knowledge Graphs and Ontologies is over 4x the SQL accuracy! These results support the main conclusion of our research: investment in metadata, semantics, ontologies and Knowledge Graph are preconditions to achieve higher accuracy for LLM powered question answering systems. Link to paper in comments. We are honored that we get to work with strategic customers to push the barrier of the data catalog and knowledge graph industry, and the data.world product. We are proud that our research results are a core part of the the data.world AI Context Engine. Thanks for all the valuable feedback we have received from colleagues across industry and academia
-
Are your LLM apps still hallucinating? Zep used to as well—a lot. Here’s how we worked to solve Zep's hallucinations. We've spent a lot of cycles diving into why LLMs hallucinate and experimenting with the most effective techniques to prevent it. Some might sound familiar, but it's the combined approach that really moves the needle. First, why do hallucinations happen? A few core reasons: 🔍 LLMs rely on statistical patterns, not true understanding. 🎲 Responses are based on probabilities, not verified facts. 🤔 No innate ability to differentiate truth from plausible fiction. 📚 Training datasets often include biases, outdated info, or errors. Put simply: LLMs predict the next likely word—they don’t actually "understand" or verify what's accurate. When prompted beyond their knowledge, they creatively fill gaps with plausible (but incorrect) info. ⚠️ Funny if you’re casually chatting—problematic if you're building enterprise apps. So, how do you reduce hallucinations effectively? The #1 technique: grounding the LLM in data. - Use Retrieval-Augmented Generation (RAG) to anchor responses in verified data. - Use long-term memory systems like Zep to ensure the model is always grounded in personalization data: user context, preferences, traits etc - Fine-tune models on domain-specific datasets to improve response consistency and style, although fine-tuning alone typically doesn't add substantial new factual knowledge. - Explicit, clear prompting—avoid ambiguity or unnecessary complexity. - Encourage models to self-verify conclusions when accuracy is essential. - Structure complex tasks with chain-of-thought prompting (COT) to improve outputs or force "none"/unknown responses when necessary. - Strategically tweak model parameters (e.g., temperature, top-p) to limit overly creative outputs. - Post-processing verification for mission-critical outputs, for example, matching to known business states. One technique alone rarely solves hallucinations. For maximum ROI, we've found combining RAG with a robust long-term memory solution (like ours at Zep) is the sweet spot. Systems that ground responses in factual, evolving knowledge significantly outperform. Did I miss any good techniques? What are you doing in your apps?
-
Despite the impressive capabilities of LLMs, developers still face challenges in getting the most out of these systems. LLMs often need a lot of fine-tuning and prompt adjustments to produce the best results. First, LLMs currently lack the ability to refine and improve their own responses autonomously and second, they have limited research capabilities. It would be highly beneficial if LLMs could conduct their own research, equipped with a powerful search engine to access and integrate a broader range of resources. In the past couple of weeks, several studies have taken on these challenges: 1. Recursive Introspection (RISE): RISE introduces a novel fine-tuning approach where LLMs are trained to introspect and correct their responses iteratively. By framing the process as a multi-turn Markov decision process (MDP) and employing strategies from online imitation learning and reinforcement learning, RISE has shown significant performance improvements in models like LLaMa2 and Mistral. RISE enhanced LLaMa3-8B's performance by 8.2% and Mistral-7B's by 6.6% on specific reasoning tasks. 2. Self-Reasoning Framework: This framework enhances the reliability and traceability of RALMs by introducing a three-stage self-reasoning process, encompassing relevance-aware processing, evidence-aware selective processing, and trajectory analysis. Evaluations across multiple datasets demonstrated that this framework outperforms existing state-of-the-art models, achieving an 83.9% accuracy on the FEVER fact verification dataset, improving the model's ability to evaluate the necessity of external knowledge augmentation. 3. Meta-Rewarding with LLM-as-a-Meta-Judge: The Meta-Rewarding approach incorporates a meta-judge role into the LLM’s self-rewarding mechanism, allowing the model to critique its judgments as well as evaluate its responses. This self-supervised approach mitigates rapid saturation in self-improvement processes, as evidenced by an 8.5% improvement in the length-controlled win rate for models like LLaMa2-7B over multiple iterations, surpassing traditional self-rewarding methods. 4. Multi-Agent Framework for Complex Queries: It mimics human cognitive processes by decomposing complex queries into sub-tasks using dynamic graph construction. It employs multiple agents—WebPlanner and WebSearcher—that work in parallel to retrieve and integrate information from large-scale web sources. This approach led to significant improvements in response quality when compared to existing solutions like ChatGPT-Web and Perplexity.ai. The combination of these four studies would create a highly powerful system: It would self-improve through recursive introspection, continuously refining its responses, accurately assess its performance and learn from evaluations to prevent saturation, and efficiently acquire additional information as needed through dynamic and strategic search planning. How do you think a system with these capabilities reshape the future?
-
Changing a single field name in our LLM response schema improved accuracy from 4.5% to 95% on GSM8k. The fix was simple: going from final_choice to final_answer. Turns out our model was returning a multiple-choice index instead of the actual answer. If you're working with structured outputs: 1. Look closely at your field names - they fundamentally alter model behavior, same prompt, drastically different results 2. JSON mode isn't a free lunch for better performance - it showed 50% more performance variance than Function Calling across 200 test cases 3. A model needs room to think too, like you - Chain of Thought remains critical with up to 60% accuracy improvements With LLMs, it's trivial to generate schema variations and with structured outputs, it's easy to validate the results early on. Look at your data.
-
Evaluations —or “Evals”— are the backbone for creating production-ready GenAI applications. Over the past year, we’ve built LLM-powered solutions for our customers and connected with AI leaders, uncovering a common struggle: the lack of clear, pluggable evaluation frameworks. If you’ve ever been stuck wondering how to evaluate your LLM effectively, today's post is for you. Here’s what I’ve learned about creating impactful Evals: 𝗪𝗵𝗮𝘁 𝗠𝗮𝗸𝗲𝘀 𝗮 𝗚𝗿𝗲𝗮𝘁 𝗘𝘃𝗮𝗹? - Clarity and Focus: Prioritize a few interpretable metrics that align closely with your application’s most important outcomes. - Efficiency: Opt for automated, fast-to-compute metrics to streamline iterative testing. - Representation Matters: Use datasets that reflect real-world diversity to ensure reliability and scalability. 𝗧𝗵𝗲 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗼𝗳 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: 𝗙𝗿𝗼𝗺 𝗕𝗟𝗘𝗨 𝘁𝗼 𝗟𝗟𝗠-𝗔𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗘𝘃𝗮𝗹𝘀 Traditional metrics like BLEU and ROUGE paved the way but often miss nuances like tone or semantics. LLM-assisted Evals (e.g., GPTScore, LLM-Eval) now leverage AI to evaluate itself, achieving up to 80% agreement with human judgments. Combining machine feedback with human evaluators provides a balanced and effective assessment framework. 𝗙𝗿𝗼𝗺 𝗧𝗵𝗲𝗼𝗿𝘆 𝘁𝗼 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲: 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗘𝘃𝗮𝗹 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 - Create a Golden Test Set: Use tools like Langchain or RAGAS to simulate real-world conditions. - Grade Effectively: Leverage libraries like TruLens or Llama-Index for hybrid LLM+human feedback. - Iterate and Optimize: Continuously refine metrics and evaluation flows to align with customer needs. If you’re working on LLM-powered applications, building high-quality Evals is one of the most impactful investments you can make. It’s not just about metrics — it’s about ensuring your app resonates with real-world users and delivers measurable value.
-
Groundbreaking Research Alert: Rethinking Adaptive Retrieval in Large Language Models A comprehensive study by researchers from Skolkovo Institute of Science and Technology, AIRI, and other leading institutions has revealed fascinating insights about adaptive retrieval methods in LLMs. The study analyzed 35 different approaches, including 8 recent methods and 27 established uncertainty estimation techniques, across 6 diverse datasets. Key Technical Insights: - The research shows that simple uncertainty estimation methods often outperform complex retrieval pipelines while being significantly more compute-efficient. - Internal-state based uncertainty methods excel at simple tasks, while reflexive methods perform better on complex reasoning tasks. The study found that SeaKR demonstrates strong self-knowledge identification on single-hop datasets by inspecting LLM internal states. Under the Hood: - The study implements a hybrid approach combining multiple uncertainty features, including logit-based, consistency-based, and internal-based methods. - Researchers used LLaMA 3.1-8b-instruct model with BM25 retriever and Wikipedia corpus for evaluation. - The analysis covered 10 different metrics across QA performance, self-knowledge capabilities, and computational efficiency. Notable Findings: - Uncertainty methods achieve comparable performance to recent adaptive retrieval approaches while requiring fewer compute resources. - The study revealed that consistency-based methods excel in downstream performance but lag in self-knowledge assessment. - The research identified a significant gap between ideal and current uncertainty estimators, highlighting room for improvement. This work represents a significant step forward in understanding how to balance between LLMs' intrinsic knowledge and external information retrieval, potentially leading to more efficient and accurate AI systems.
-
Over the last several months, I’ve been heads down with the Data Commons team, exploring potential pathways to improve the accuracy of large language models (#LLMs - new #AI brains behind products like Gemini) when queried for numerical and statistical information. Today, we released #DataGemma, the first open models designed to connect LLMs with the extensive, real-world data housed within Google's Data Commons. As outlined in our research paper, we’ve seen notable enhancements to LLM factuality (their ability to source facts and avoid #hallucinations) utilizing two distinct approaches: Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG). We are still in early phases of this work, but our preliminary findings are very exciting. Google is unique in its willingness to share our research to make this latest Gemma model variant “open”. We hope to facilitate research and exploration across the industry on combining Knowledge Graph data with LLMs to improve reliability, factuality, and reasoning. LLMs and AI afford some of the biggest opportunities of our lifetime. Grounding them in real-world data can ensure we can actually use their output for all our imagined, and yet to be imagined, use cases. Here's a link to our blogpost: https://lnkd.in/es_nAFgR And for those looking for a more technical primer, here's the link to the Google Research blogpost: https://lnkd.in/eHfhCVFd A big thanks to Jennifer Chen, Bo Xu, Hannah Pho, Adriana Olmos, our alum Prashanth R and R. Guha, the leadership of James Manyika, and the entire Data Commons team. I’d also like to thank organizations like the Statistics Division at United Nations DESA (specifically Luis Gonzalez Morales and Yongyi Min) and organizations like The ONE Campaign and TechSoup for helping make data AI ready. #AI #ArtificialIntelligence #LLMs #Data #DataScience #Technology
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development