Using powerful LLMs (GPT-4) as an evaluator for smaller models is becoming the de facto standard. However, relying on closed-source models is suboptimal due to missing control, transparency, and versioning. 🤔 The recent paper "Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models" shows that open LLMs can match GPT-4 evaluation skills. 🚀 🔥𝗣𝗿𝗼𝗺𝗲𝘁𝗵𝗲𝘂𝘀 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 1️⃣ Created a new dataset with 1000 scoring rubrics, 20K instructions (20 each), and 100K responses with feedback scores (1-5) generated by GPT-4 (5 each). → 100k training samples 2️⃣ Fine-tuned Llama-2-Chat-13B on this dataset (1️⃣) to generate the feedback (Prometheus 🔥) 3️⃣ Evaluated Prometheus on seen and unseen rubrics (including MT Bench), comparing correlation with human scores and GPT-4 scores ✨𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 🥇 Scores a Pearson correlation of 0.897 with human evaluators, on par with GPT-4 (0.882), and outperforms GPT-3.5. (0.392) 🧑⚖️ Can be used a replacement of GPT-4 for LLM-as-a-Judge 🧬 High correlation with GPT-4 → due to imitation learning? 🔢 Requires 4 components in the input: prompt, generation to evaluate, a score rubric, and a reference generation. 😍 Prometheus can be further improved on training on customized rubrics and feedback, e.g. company specific domains 🧠 Can be used as a Reward Model for RLHF or for DPO to create preference pairs. 🤗 Dataset and Model available on Hugging Face Paper: https://lnkd.in/eXx-n_tx Dataset: https://lnkd.in/e8gVRGm4 Model: https://lnkd.in/eF9tKiTc Kudos to the researchers for this contribution to make AI more explainable, reproducible, and open! 🤗
Training Evaluation Models
Explore top LinkedIn content from expert professionals.
Summary
Training evaluation models are systems and tools used to assess how well training programs, tasks, or AI models perform against defined standards or real-world outcomes. These models help organizations measure progress, pinpoint areas for improvement, and ensure training meets both technical and business needs.
- Define clear criteria: Specify what you want to measure and outline detailed descriptions for each standard before starting your evaluation.
- Combine data sources: Use a mix of human-labeled and synthetic examples to build a robust dataset and avoid bias in your evaluation model.
- Track real-world impact: Test models on realistic tasks that reflect workplace demands to ensure your training adds practical business value.
-
-
Here is a step-by-step guide for successfully finetuning your own LLM judge on granular / domain-specific evaluation tasks… Background: LLM-as-a-Judge is a reference-free evaluation technique that prompts an off-the-shelf / proprietary LLM to evaluate the output of another LLM. This approach is effective, but it has limitations: - LLM APIs are not transparent and come with security concerns. - Updates to the model (which we can’t control) impact evaluation results. - Every call to the LLM judge costs money, so cost can become a concern. Proprietary LLMs are also best at tasks that are aligned with their training data, tend to avoid providing strong scores / opinions, and may struggle with domain-specific evaluation. For these reasons, we may want to finetune our own LLM judge using the steps below. (1) Solidify the evaluation criteria: The first step of evaluation is deciding what exactly we want to evaluate. We should: - Outline a specific set of criteria that we care about. - Write a detailed description for each of these criteria. Over time, we must evolve, refine, and expand our criteria as we better understand our evaluation task. (2) Prepare a dataset: Human-labeled data allows us to finetune and evaluate our LLM judge. Finetuning an LLM judge will require ~1K-100K evaluation examples, and collecting more / better data is always beneficial. Each example should contain an input instruction, a response, a description of the evaluation criteria, and a scoring rubric. Each input is paired with a scoring rationale and a final result (e.g., a 1-5 Likert score). (2.5) Use synthetic data: Using purely synthetic training data can introduce bias by exposing the model to a narrow distribution of data, but combining human / synthetic data can be effective. For example, check out Constitutional AI [1] or RLAIF [2]. (3) Focus on the rationales: We obviously want the scores over which the LLM judge is trained to be accurate, but we should also create high-quality rationales for each score. Tweaking the rationales over which the LLM judge is trained can make the model more helpful. (4) Use reference answers: This step is optional, but we can prepare reference answers for each example in the dataset. Reference answers simplify evaluation by allowing the LLM judge to compare the response to a reference instead of having to score the response in an absolute manner. (5) Train the model: Once all of our data (and optionally reference answers) have been collected, then we can train our LLM judge using a basic SFT approach. Finetuning an LLM judge is technically no different than any other instruction tuning task! For a full implementation of this process, check out the Prometheus papers [3, 4, 5]. This work shows that we can create highly-accurate, domain-specific evaluation models–even surpassing the performance of LLM-as-a-Judge with proprietary LLMs– by simply finetuning an LLM on a small amount of data that is relevant to our evaluation task.
-
The best part of my job is I get to learn something new every day. When I joined OpenAI, I started to understand how quickly the capabilities of our models were advancing, as measured by performance on structured evaluations. One example of an evaluation or eval would be a set of very hard math problems. Our models kept getting better and better at these kinds of problems over time and we recently achieved gold medal-level performance on the 2025 International Mathematical Olympiad. But as a social scientist who works on firms and other organizations, I also had this nagging concern that these kinds of evaluations on objective tasks were not necessarily the best indicator of how useful AI could be at work. For example, having a machine that can solve the hardest math problems in the world doesn’t necessarily create new revenue or lower costs for firms. So how do you build evaluations for tasks that are more subjective, more realistic and more valuable? The OpenAI Frontier Evals team just took a step in that direction today. Today they’re introducing GDPval-v0 — a new benchmark designed to measure how leading models perform on 1,300+ real-world tasks, across 44 occupations and 9 major industries. These are realistic work products like legal briefs, engineering diagrams, and nursing care plans developed by professionals with an average of 14 years of experience in the field. The goal is to create an evaluation that reflects where AI can generate real business value. As we keep training new models and improving them, we can use evaluations like this to make sure we are getting better at solving the most important problems. A few early findings: - Top models are already producing expert-level results in many tasks and doing so ~100× faster and cheaper. - Performance scales with larger models, more reasoning, and richer context. Reinforcement training on these tasks pushes it even further. Look at the steady progress in capabilities as we tested the performance of successive models of ChatGPT - Most interestingly, this eval demonstrates how models can free people up to focus on the creative, judgment-intensive parts of their work. The team has open-sourced a subset of tasks and grading tools and we’re inviting professionals to contribute new ones as we build what’s next. Here’s the full paper: https://lnkd.in/eiMbmNnS Great work from the team who led the charge on this: Tejal, Elizabeth, Grace, Rachel, and Phoebe.
-
Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
Exciting News in AI Research: LLM4Ranking Framework Released! I'm thrilled to share a groundbreaking development in the field of information retrieval and large language models - the LLM4Ranking framework! Researchers from Renmin University of China, Shanghai Jiao Tong University, and Carnegie Mellon University have developed this unified, easy-to-use framework that enables seamless integration of large language models (LLMs) for document reranking tasks. >> What is LLM4Ranking? LLM4Ranking is a comprehensive toolkit that allows researchers and practitioners to leverage the power of LLMs for reranking documents in just a few lines of code. The framework supports various reranking paradigms: - Pointwise: Evaluates relevance scores for individual query-document pairs - Pairwise: Compares document pairs to determine relative relevance - Listwise: Directly generates a ranking order for a list of documents - Selection-based: Implements tournament-style selection mechanisms like TourRank >> Technical Implementation Details The architecture consists of three core modular components: 1. LLM Interface: Supports both open-source models via HuggingFace Transformers and proprietary LLMs through APIs (OpenAI, Anthropic Claude, DeepSeek, etc.). It includes quantization strategies using bitsandbytes and GPTQ for memory efficiency, with compatibility for vLLM acceleration. 2. Ranking Logic Abstraction: Decouples abstract ranking paradigms from concrete model implementations, making it easy to implement and evaluate new customized ranking methods. 3. Model Component: Provides three approaches for LLM interaction: - Generation-based models (like RankGPT) - Log-likelihood-based models (for query generation) - Logits-based models (for relevance scoring) The framework also includes robust training capabilities with supervised fine-tuning pipelines and specialized training for logits-based models with various loss functions including Cross-Entropy and learning-to-rank losses like RankNet. >> Evaluation Capabilities LLM4Ranking supports comprehensive evaluation across multiple popular academic datasets including TREC DL, BEIR, MAIR, NevIR, and Bright. The evaluation system provides detailed metrics (MAP, NDCG, Recall) and performance analytics including reranking latency and token usage. This framework represents a significant contribution to both academic research and practical applications in search engines and retrieval-augmented generation systems. The code is publicly available, enabling the community to build upon this work and advance the field further.
-
Knowledge Graphs as Powerful Evaluation Tools for LLM Document Intelligence 📃 Organizations across industries are grappling with an unprecedented deluge of unstructured information contained in documents. From medical records and legal contracts to financial reports and technical manuals, these text-heavy resources hold valuable insights that, if properly harnessed, could revolutionize decision-making processes and operational efficiencies. Document intelligence powered by LLMs represents a paradigm shift in how we approach unstructured data. These sophisticated AI models, trained on vast corpora of text, demonstrate remarkable abilities in understanding context, extracting relevant information, and even generating human-like responses. Unlike traditional rule-based systems or narrow AI models, LLMs offer unparalleled versatility in tackling diverse document processing tasks. They can adapt to new domains with minimal fine-tuning, understand complex relationships within text, and provide insights that were previously accessible only through human expertise. The applications of LLM-driven document intelligence are vast and transformative. In healthcare, these models can analyze medical records to assist in diagnosis and treatment planning. In the legal sector, they can review contracts to identify potential risks or inconsistencies. The potential for increased efficiency, accuracy, and novel insights across industries is immense. However, as we venture into this new frontier of AI-powered document processing, a critical question emerges : How do we effectively evaluate the performance of these sophisticated language models? This is where the importance of robust evaluation methodologies comes into sharp focus. Evaluation is not merely an academic exercise; it is the cornerstone of responsible AI deployment in real-world scenarios. Traditional evaluation metrics for natural language processing tasks, such as BLEU or ROUGE scores, fall short when assessing the complex, multi-faceted nature of document intelligence. This is where Knowledge Graphs (KGs) emerge as a powerful and innovative evaluation tool. Knowledge graphs offer a structured representation of information, capturing entities, relationships, and complex hierarchies within documents. By leveraging KGs in the evaluation process, we can assess LLMs’ performance in a way that aligns more closely with human-like understanding of document content. KG evaluation tools by Zhang et al. 2024, offer a sophisticated approach to assessing document intelligence, especially for radiology reports : ReXKG-NSC measures entity capture. It compares nodes in AI-generated and human-written report graphs. ReXKG-AMS evaluates relationship accuracy. It compares edge structures between graphs. ReXKG-SCS assesses complex concept representation. It examines important subgraphs within the larger structure.
-
🔍 Ever wondered "Is it ok to use DeepSeek R1?" or which AI model best fits your company's specific needs? As enterprise AI adoption accelerates, organizations face increasingly complex decisions about which models to trust. Excited to share my recent work on 𝐌𝐨𝐝𝐞𝐥 𝐓𝐫𝐮𝐬𝐭 𝐒𝐜𝐨𝐫𝐞𝐬 at Credo AI that addresses this challenge head-on! We synthesized 60+ benchmarks across 95 use cases in 21 industries to create context-specific perspectives of leading models 📖 Read the full technical blog here: https://lnkd.in/ekQ_bC9w 𝐓𝐡𝐞 𝐌𝐨𝐝𝐞𝐥 𝐓𝐫𝐮𝐬𝐭 𝐒𝐜𝐨𝐫𝐞 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤 𝐡𝐞𝐥𝐩𝐬 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞𝐬: [1] First filter models based on non-negotiable requirements (security, infrastructure compatibility, legal compliance) [2] Then evaluate models across capability, safety, cost, and latency [3] Finally contextualize these evaluations for specific industry applications through our novel relevance scoring system 𝐊𝐞𝐲 𝐓𝐚𝐤𝐞𝐚𝐰𝐚𝐲𝐬 🟣 Context matters in model selection: Generic benchmarks provide limited insight into model performance for specific enterprise use cases. 🟣 The evaluation ecosystem has significant gaps: Most industries lack relevant benchmarks. 🟣 Non-negotiables are essential filters: Security, infrastructure compatibility, and legal requirements must be assessed before evaluating performance tradeoffs. 🟣 Safety evaluations are underdeveloped: The ecosystem needs more comprehensive safety benchmarks, especially for newer models. 🟣 Different models excel in different contexts: No single model dominates across all industries and dimensions. 🟣 Cost-capability tradeoffs are significant: DeepSeek R1 demonstrates impressive capability relative to its cost, highlighting the importance of multidimensional analysis. 🟣 Reasoning models show advantages: Models like OpenAI's o1/o3, DeepSeek R1, and Claude 3.7 demonstrate strong capabilities across industries. 𝐍𝐞𝐱𝐭 𝐒𝐭𝐞𝐩𝐬 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐅𝐢𝐞𝐥𝐝 🔭 Develop industry-specific benchmarks: Create targeted evaluations for underserved industries to better assess model suitability. 🏛️ Establish evaluation standards: Industry consortiums, research institutions, and AI providers should collaborate on standardized evaluation approaches. ✅ Move from relative to absolute trust measures: Develop certification frameworks with clear minimum thresholds for capability and safety. 🦺 Improve safety evaluation coverage: Prioritize comprehensive safety assessments for all models, especially newer releases. 🌉 Bridge governance and technical evaluation: Connect benchmark results to governance requirements and risk management frameworks. 🔍 Increase transparency in model reporting: Providers should proactively publish comprehensive evaluations to build trust.
-
Self-Taught Evaluators: improving LLM-as-Judge evaluators without human annotations Training evaluators require a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. To address it, authors of this paper present an approach aimed to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, proposed iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions 𝗠𝗲𝘁𝗵𝗼𝗱𝗼𝗹𝗼𝗴𝘆 iterative training scheme that bootstraps improvements by annotating the current model’s judgments using constructed synthetic data ,so that the Self-Taught Evaluator is more performant on the next iteration, as outlined in pipeline as follows: i) Initialization - assume access to a large set of human-written user instructions and an initial seed LLM ii) Instruction Selection - next select a challenging, balanced distribution of user instructions from the uncurated set by categorizing them via LLM iii) Response Pair Construction - For each user instruction (example) we create a preference pair of two model responses (chosen & rejected), generating them via prompting such that the rejected response is likely of lower quality than the chosen response iv) Judgment Annotation - For each example, we sample from the current model up to N times LLM-as-a-Judge generated reasoning traces and judgments - If we find a correct judgment we add that example to our training set, otherwise we discard it. v) Model Fine-tuning (Iterative Training) - fine-tune the model on the newly constructed training set which yields an updated model for the next iteration 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 - model is initalized from Llama3-70B-Instruct - Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench - outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀 - large LLMs(70B params) was used, but smaller models not explored if they work with this approach - approach is limited by having a capable instruction fine-tuned model which is already reasonably aligned to human preferences - LLM-as-a-Judge models usually have longer outputs and thus higher inference cost because of the reasoning chains generation 𝗕𝗹𝗼𝗴: https://lnkd.in/e_yWKX4y 𝗣𝗮𝗽𝗲𝗿: https://lnkd.in/eN4hYZBb
-
😅 If not properly configured, LLM judges can cause more trouble than they solve. LLM judges are quickly becoming a go-to for evaluating LLM results, cutting down on human effort. However, they must be carefully configured, either through training, proper cues, or human annotations. Here’s a nice paper from Meta that shows how to achieve this using only synthetic training data, without relying on human annotations. Some Insights: ⛳ The paper uses unlabeled instructions and prompting to generate synthetic preference pairs, where one response is intentionally made inferior to the other. ⛳ An LLM is then used to generate reasoning traces and judgments for these pairs, creating labeled data from the synthetic examples. ⛳ This labeled data is used to retrain the LLM-as-a-Judge, with the process repeated in cycles to progressively improve the model’s evaluation capabilities. ⛳ On the Llama-3-70B-Instruct model, the approach obtains accuracy on RewardBench from 75.4 to 88.7 (with majority vote) or 88.3 (without majority vote). The method matches or even outperforms traditional reward models trained on human-annotated data, demonstrating the potential of using synthetic data for model evaluation without relying on human input. Link: https://lnkd.in/eRhF4ykx
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning