Fascinating new research comparing Long Context LLMs vs RAG approaches! A comprehensive study by researchers from Nanyang Technological University Singapore and Fudan University reveals key insights into how these technologies perform across different scenarios. After analyzing 12 QA datasets with over 19,000 questions, here's what they discovered: Key Technical Findings: - Long Context (LC) models excel at processing Wikipedia articles and stories, achieving 56.3% accuracy compared to RAG's 49.0% - RAG shows superior performance in dialogue-based contexts and fragmented information - RAPTOR, a hierarchical tree-based retrieval system, outperformed traditional chunk-based and index-based retrievers with 38.5% accuracy Under the Hood: The study implements a novel three-phase evaluation framework: 1. Empirical retriever assessment across multiple architectures 2. Direct LC vs RAG comparison using filtered datasets 3. Granular analysis of performance patterns across different question types and knowledge sources Most interesting finding: RAG exclusively answered 10% of questions that LC couldn't handle, suggesting these approaches are complementary rather than competitive. The research team also introduced an innovative question filtering methodology to ensure fair comparison by removing queries answerable through parametric knowledge alone. This work significantly advances our understanding of when to use each approach in production systems. A must-read for anyone working with LLMs or building RAG systems!
LLM Assessment Methods for Knowledge Extraction Research
Explore top LinkedIn content from expert professionals.
Summary
LLM assessment methods for knowledge extraction research involve evaluating how large language models (LLMs) pull useful information from texts, measuring both accuracy and reliability of their outputs. This includes specialized frameworks, human-in-the-loop processes, and advanced metrics to ensure extracted knowledge is trustworthy and relevant across different domains like healthcare and enterprise applications.
- Combine human review: Balance automated LLM evaluations with targeted expert oversight to catch nuanced errors and ensure the extracted information is accurate in complex scenarios.
- Use advanced metrics: Move beyond simple scoring by incorporating measures like faithfulness, helpfulness, and semantic alignment to better capture how well LLMs understand and answer questions.
- Iterate prompt structure: Continuously refine how you prompt LLMs and assess their outputs, using real-world examples and feedback to improve the consistency and quality of knowledge extraction.
-
-
Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW
-
Building useful Knowledge Graphs will long be a Humans + AI endeavor. A recent paper lays out how best to implement automation, the specific human roles, and how these are combined. The paper, "From human experts to machines: An LLM supported approach to ontology and knowledge graph construction", provides clear lessons. These include: 🔍 Automate KG construction with targeted human oversight: Use LLMs to automate repetitive tasks like entity extraction and relationship mapping. Human experts should step in at two key points: early, to define scope and competency questions (CQs), and later, to review and fine-tune LLM outputs, focusing on complex areas where LLMs may misinterpret data. Combining automation with human-in-the-loop ensures accuracy while saving time. ❓ Guide ontology development with well-crafted Competency Questions (CQs): CQs define what the Knowledge Graph (KG) must answer, like "What preprocessing techniques were used?" Experts should create CQs to ensure domain relevance, and review LLM-generated CQs for completeness. Once validated, these CQs guide the ontology’s structure, reducing errors in later stages. 🧑⚖️ Use LLMs to evaluate outputs, with humans as quality gatekeepers: LLMs can assess KG accuracy by comparing answers to ground truth data, with humans reviewing outputs that score below a set threshold (e.g., 6/10). This setup allows LLMs to handle initial quality control while humans focus only on edge cases, improving efficiency and ensuring quality. 🌱 Leverage reusable ontologies and refine with human expertise: Start by using pre-built ontologies like PROV-O to structure the KG, then refine it with domain-specific details. Humans should guide this refinement process, ensuring that the KG remains accurate and relevant to the domain’s nuances, particularly in specialized terms and relationships. ⚙️ Optimize prompt engineering with iterative feedback: Prompts for LLMs should be carefully structured, starting simple and iterating based on feedback. Use in-context examples to reduce variability and improve consistency. Human experts should refine these prompts to ensure they lead to accurate entity and relationship extraction, combining automation with expert oversight for best results. These provide solid foundations to optimally applying human and machine capabilities to the very-important task of building robust and useful ontologies.
-
🎉 Pleased to share our paper published in Nature Portfolio digital medicine. 🥳 We’ve developed a comprehensive framework called CREOLA (short for Clinical Review Of Large Language Models (LLMs) and AI). This framework is pioneered at TORTUS, taking a safety-first, science approach to LLMs in healthcare. 🔹 Key Components of the CREOLA Framework -Error Taxonomy -Clinical Safety Assessment -Iterative Experimental Structure 🔹 Error Taxonomy Hallucinations: instances of text in clinical documents unsupported by the transcript of the clinical encounter Omissions: Clinically important text in the encounter that was not included in the clinical documentation 🔹 Clinical Safety Assessment: Our innovation incorporates accepted clinical hazard identification principles (based on NHS DCB0129 standards) to evaluate the potential harm of errors: We categorise errors as either ‘major’ or ‘minor’, where major errors can have downstream impact on the diagnosis or the management of the patient if left uncorrected. This is further assessed as a risk matrix comprising of: Risk severity (1 (minor) to 5 (catastrophic)) compared with Likelihood assessment (very low to very high) 🔹 Iterative Experimental Structure We share a methodical approach to compare different prompts, models, and workflows. Label errors, consolidate review, evaluate clinical safety (and then make further adjustments and re-evaluate if necessary). ----------Method-------------- To demonstrate how to apply CREOLA to any LLM / AVT, we used GPT-4 (early 2024) as a case study here. 🔹 We conduct one of the largest manual evaluations of LLM-generated clinical notes to date, analyzing 49,590 transcript sentences and 12,999 clinical note sentences across 18 experimental configurations. 🔹 Transcripts-clinical note pairs are broken down to a sentence level and annotated for errors by clinicians. ----------Results-------------- 🔹 Of 12,999 sentences in 450 clinical notes, 191 sentences had hallucinations (1.47%), of which 84 sentences (44%) were major. Of the 49,590 sentences from our consultation transcripts, 1712 sentences were omitted (3.45%), of which 286 (16.7%) of which were classified as major and 1426 (83.3%) as minor. 🔹 Hallucination types Fabrication (43%) - completely invented information Negation (30%) - contradicting clinical facts Contextual (17%) - mixing unrelated topics Causality (10%) - speculating on causes without evidence 🔹 Hallucinations, while less common than omissions, carry significantly more clinical risk. Negation hallucinations were the most concerning 🔹 we CAN reduce or even abolish hallucinations and omissions by making prompt or model changes. In one experiment with GPT4 - We reduced incidence of major hallucinations by 75%, major omissions by 58%, and minor omissions by 35% through prompt iteration Links in comments Ellie Asgari Nina Montaña Brown Magda Dubois Saleh Khalil Jasmine Balloch Dr Dom Pimenta M.D.
-
If you were building a Q&A feature (or chatbot) based on very long documents (like books), what evals would you focus on? 1. Two metrics that come to mind • Faithfulness: Grounding of answers in document's content. Not to be confused with correctness—an answer can be correct (based on updated information) but not faithful to the document. Sub-metric: Precision of citations • Helpfulness: Usefulness (directly addresses the question with enough detail and explanation) and completeness (does not omit important details); an answer can be faithful but not helpful if too brief or doesn't answer the question • Evaluate separately: Faithfulness = binary label -> LLM-evaluator; Helpfulness = pairwise comparisons -> reward model 2. How to build robust evals • Use LLMs to generate questions from the text • Evals should evaluate positional robustness (i.e., have questions at the beginning, middle, and end of text) 3. Potential challenges • Open-ended questions may have no single correct answer, making reference-based evals trickly. For example: What is the theme of this novel? • Questions should be representative of prod traffic, with a mix of factual, inferential, summarization, definitional questions 4. Benchmark Datasets: • NarrativeQA: Questions based on entire movie scripts or novels. Includes reference answers useful for LLM-eval comparisons • NovelQA: Q&A over full novels; includes both MCQ and free-form responses, and includes references • Qasper: Similar to NarrativeQA, but with academic documents that are 5-10k tokens, and includes evaluation of answer spans • LongBench: Average of 6.7k words across fiction and technical docs • LongBench v2: Extension of LongBench, but evals are MCQ only • L-Eval: 20 tasks and >500 long documents (up to 200k tokens), with several QA-oriented tasks • HELMET: Includes reference-based evaluation for long-context QA, and includes measures for positional robustness • MultiDoc2Dial: Modeling dialogues grounded in multiple documents. Evaluates ability to integrate info over multiple docs • Frustratingly Hard Evidence Retrieval for QA Over Books: Reframed NarrativeQA as open-domain task where book text must be retrieved Links to resources, papers, tech blogs, etc. appreciated 🙏
-
What research tasks can LLMs beat expert humans at today? In a new paper I show that one area is structured extraction of methods information from academic papers. We trained highly skilled graduate students for hundreds of hours to identify the use of 30 methods in social science papers (regression, instrumental variables, matching, did, interpretivism etc) and tested their performance against GPT-5 mini. The result: GPT-5 mini uniformly beats them on sensitivity and equals human performance on specificity. And it does this at 1/1000th of the cost. This task is gruelling for humans and involves hours of carefully skimming footnotes and methods sections. This technology will allow research assistants to spend far more time on higher-level intellectual tasks. Link: https://lnkd.in/etbtnVWu Coauthored with Vincent Arel-Bundock and Ryan Briggs
-
As large language models (#LLMs) become central to extracting clinical information from electronic health records (#EHRs), #oncology researchers face the challenge of ensuring their reliability. A new paper by Melissa Estevez and colleagues introduces the Validation of Accuracy for LLM-/ML-Extracted Information and Data (#VALID) Framework, a comprehensive approach designed to evaluate the accuracy, consistency, and fairness of #AI-generated clinical variables. The VALID framework is built on three key pillars: 1. Variable-level performance metrics — benchmarking LLM-extracted variables against expert human abstraction to quantify accuracy, completeness, and relative performance. 2. Verification checks — identifying patient-level and cohort-level inconsistencies to surface latent errors, enhance face validity, and guide targeted model refinement. 3. Replication and benchmarking analyses — assessing whether analyses performed with LLM-extracted data reproduce results from human-abstracted datasets or established external benchmarks. As oncology increasingly adopts AI-powered curation pipelines, frameworks like VALID will be important for establishing quality standards, guiding model improvement, and fostering confidence among researchers, regulators, and clinicians. This paper marks a significant step toward a more transparent and reliable future for AI-enabled real-world evidence (#RWE).
-
Evaluation is an exciting space and so critical to put AI apps in production and helping them perform same or better as environment or models change. Still looking for a great company in this space Four primary evaluation methodologies, which can be broadly categorized as either benchmark-based or judgment-based. The four core methods are: 1. Multiple-Choice Benchmarks: Quantify an LLM's knowledge recall through standardized tests like MMLU. They are reproducible and scalable but do not assess real-world utility or reasoning. 2. Verifiers: Assess free-form answers in domains like math and code by programmatically checking a final, extracted answer against a ground truth. This is crucial for evaluating reasoning but is limited to deterministically verifiable domains. 3. Leaderboards: Rank models based on aggregated human preferences, as exemplified by LM Arena. This method captures subjective qualities like style and helpfulness but is susceptible to bias and lacks the instant feedback needed for active development. 4. LLM-as-a-Judge: Employ a powerful LLM to score another model's output against a reference answer using a detailed rubric. This is a scalable and consistent alternative to human evaluation but is highly dependent on the judge model's capabilities and the rubric's design. A strong score in multiple-choice benchmarks suggests solid general knowledge. High performance on verifier-based tasks indicates proficiency in technical domains. However, if that same model scores poorly on leaderboards or LLM-as-a-judge evaluations, it may indicate issues with articulation, style, or user helpfulness, suggesting a need for fine-tuning. https://lnkd.in/gSdFpScW
-
Apple Researchers Present KGLens: A Novel AI Method Tailored for Visualizing and Evaluating the Factual Knowledge Embedded in LLMs Researchers from Apple introduced KGLENS, an innovative knowledge probing framework that has been developed to measure knowledge alignment between KGs and LLMs and identify LLMs’ knowledge blind spots. The framework employs a Thompson sampling-inspired method with a parameterized knowledge graph (PKG) to probe LLMs efficiently. KGLENS features a graph-guided question generator that converts KGs into natural language using GPT-4, designing two types of questions (fact-checking and fact-QA) to reduce answer ambiguity. Human evaluation shows that 97.7% of generated questions are sensible to annotators. KGLENS employs a unique approach to efficiently probe LLMs’ knowledge using a PKG and Thompson sampling-inspired method. The framework initializes a PKG where each edge is augmented with a beta distribution, indicating the LLM’s potential deficiency on that edge. It then samples edges based on their probability, generates questions from these edges, and examines the LLM through a question-answering task. The PKG is updated based on the results, and this process iterates until convergence. Also, This framework features a graph-guided question generator that converts KG edges into natural language questions using GPT-4. It creates two types of questions: Yes/No questions for judgment and Wh-questions for generation, with the question type controlled by the graph structure. Entity aliases are included to reduce ambiguity. Read our full take on KGLens: https://lnkd.in/gtJTifF6 Paper: https://lnkd.in/gxHShs7i Apple He Bai Yizhe Zhang Yi Su Xiaochuan Niu Navdeep Jaitly
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development