How to Evaluate Model Performance

Explore top LinkedIn content from expert professionals.

Summary

Evaluating model performance means measuring how well an AI system predicts, classifies, or generates results compared to what’s expected. Choosing the right evaluation metrics is crucial for understanding if a model is reliable and interprets data as intended, particularly since different tasks require different approaches. Select relevant metrics: Pick metrics that match the real-world goals and specific task of your model, such as accuracy, recall, or precision, instead of relying on a single measure. : Test your model with a variety of datasets—like synthetic, real-world, and adversarial examples—to capture its strengths, weaknesses, and ability to handle unexpected situations. Prioritize error analysis: Go beyond headline numbers by examining where and why your model makes mistakes, using visual tools or logs to gain practical insights for improvement.

Summarized by AI based on LinkedIn member posts

Greg Coquillo Greg Coquillo is an Influencer

AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

228,968 followers 3w
Report this post
Your model is trained. But is it actually good? Most ML engineers default to accuracy. Then wonder why their model fails in production. Here are 20 evaluation metrics — and when to actually use each one: Classification: - Accuracy → Balanced datasets only. - Precision → When false positives are costly. - Recall → When false negatives matter more. - F1 Score → Imbalanced datasets. Balances both. - ROC-AUC → Binary classification evaluation. - Log Loss → Probabilistic models. Penalizes confident wrong predictions. - Confusion Matrix → Error analysis. See exactly where it breaks. - Specificity → When detecting negatives correctly matters. - Balanced Accuracy → Uneven datasets. Don't trust plain accuracy here. Regression: - MAE → Simple, interpretable error measurement. - MSE → Penalizes larger errors more heavily. - RMSE → Error in original scale. Most interpretable. - R² Score → How much variance your model explains. - Adjusted R² → Feature-heavy models. Adjusts for complexity. - MAPE → Business forecasting. Error as a percentage. - Explained Variance → Model consistency evaluation. Clustering: - Silhouette Score → Cluster cohesion and separation. Cluster validation. - Davies-Bouldin Index → Lower is better clustering. NLP: - BLEU Score → Machine translation quality. - ROUGE Score → Text summarization quality. Accuracy is not a strategy. Picking the right metric for the right problem is. A model that looks great on accuracy can destroy real-world outcomes when the wrong metric guided its evaluation. Save this. 📌 Which metric do most engineers misuse? 👇
No more previous content

No more next content
69 Comments
Like Comment
Paul Iusztin

Senior AI Engineer • Founder @ Decoding AI • Author @ LLM Engineer’s Handbook ~ I ship AI products and teach you about the process.

98,269 followers 12mo
Report this post
The most underestimated part of building LLM applications? Evaluation. Evaluation can take up to 80% of your development time (because it’s HARD) Most people obsess over prompts. They tweak models. Tune embeddings. But when it’s time to test whether the whole system actually works? That’s where it breaks. Especially in agentic RAG systems - where you’re orchestrating retrieval, reasoning, memory, tools, and APIs into one seamless flow. Implementation might take a week. Evaluation takes longer. (And it’s what makes or breaks the product.) Let’s clear up a common confusion: 𝗟𝗟𝗠 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 ≠ 𝗥𝗔𝗚 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻. LLM eval tests reasoning in isolation - useful, but incomplete. In production, your model isn’t reasoning in a vacuum. It’s pulling context from a vector DB, reacting to user input, and shaped by memory + tools. That’s why RAG evaluation takes a system-level view. It asks: Did this app respond correctly, given the user input and the retrieved context? Here’s how to break it down: 𝗦𝘁𝗲𝗽 𝟭: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹. → Are the retrieved docs relevant? Ranked correctly? → Use LLM judges to compute context precision and recall → If ranking matters, compute NDCG, MRR metrics → Visualize embeddings (e.g. UMAP) 𝗦𝘁𝗲𝗽 𝟮: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻. → Did the LLM ground its answer in the right info? → Use heuristics, LLM-as-a-judge, and contextual scoring. In practice, treat your app as a black box and log: - User query - Retrieved context - Model output - (Optional) Expected output This lets you debug the whole system, not just the model. 𝘏𝘰𝘸 𝘮𝘢𝘯𝘺 𝘴𝘢𝘮𝘱𝘭𝘦𝘴 𝘢𝘳𝘦 𝘦𝘯𝘰𝘶𝘨𝘩? 5–10? Too few. 30–50? Good start. 400+? Now you’re capturing real patterns and edge cases. Still, start with how many samples you have available, and keep expanding your evaluation split. It’s better to have an imperfect evaluation layer than nothing. Also track latency, cost, throughput, and business metrics (like conversion or retention). Some battle-tested tools: → RAGAS (retrieval-grounding alignment) → ARES (factual grounding) → Opik by Comet (end-to-end open-source eval + monitoring) → Langsmith, Langfuse, Phoenix (observability + tracing) TL;DR: Agentic systems are complex. Success = making evaluation part of your design from Day 0. We unpack this in full in Lesson 5 of the PhiloAgents course. 🔗 Check it out here: https://lnkd.in/dA465E_J
No more previous content

No more next content
31 Comments
Like Comment
Catherine Breslin

CTO and co-founder LichenAI | AI Scientist, Advisor & Coach | Former Amazon Alexa, Cambridge University

6,544 followers 11mo
Report this post
I spoke recently with a team who’d been iterating on their AI model for months but were struggling to make progress. They’d explored a range of approaches, yet couldn’t confidently say what was helping and what wasn’t. It turned out the challenge wasn’t in their modelling or engineering skills. It was their evaluation framework. Without a clear and consistent way to assess results, they were left guessing what to try next. This is something I see often. When evaluation isn’t quick and easy, progress stalls. Here are a few simple practices I’ve found make all the difference in getting models production-ready: 🔎 Snapshot your test sets: If you want to measure genuine progress over time, your comparisons need to be fair even as you’re collecting more data. Shifting baselines obscure what’s working. Snapshot your test sets so you can always compare like with like. 🔎 Prioritise fast feedback: Evaluation should be quick to run. Ideally in minutes, not hours. The shorter the gap between trying something and seeing how it performed, the more iterations you can make and the better your outcomes will be. 🔎 Invest in error analysis: While metrics give you the headline, error analysis reveals the story. Build tools that let you explore what went wrong - visualisations, dashboards or even simple logs. This is often where the real insight lies. Evaluation isn’t just a checkpoint at the end. It’s a core part of building effective systems. I work with AI leaders to embed sustainable, practical data practices. If you're looking to strengthen your team’s approach, get in touch for a free 30-minute session. #ArtificialIntelligence #MachineLearning #MLOps #Evaluation
No more previous content

No more next content
3 Comments
Like Comment
Anurag(Anu) Karuparti

Agentic AI Strategist @Microsoft (30k+) | Author - Generative AI for Cloud Solutions | LinkedIn Learning Instructor | Responsible AI Advisor | Ex-PwC, EY | Marathon Runner

31,505 followers 1y
Report this post
As we scale GenAI from demos to real-world deployment, one thing becomes clear: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 𝗰𝗮𝗻 𝗺𝗮𝗸𝗲 𝗼𝗿 𝗯𝗿𝗲𝗮𝗸 𝗮 𝗚𝗲𝗻𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺. A model can be trained on massive amounts of data, but that doesn’t guarantee it understands context, nuance, or intent at inference time. You can teach a student all the textbook theory in the world. But unless you ask the right questions, in the right setting, under realistic pressure, you’ll never know what they truly grasp. This snapshot outlines the 6 dataset types that AI teams use to rigorously evaluate systems at every stage of maturity: The Evaluation Spectrum 1. 𝐐𝐮𝐚𝐥𝐢𝐟𝐢𝐞𝐝 𝐚𝐧𝐬𝐰𝐞𝐫𝐬 Meaning: Expert-reviewed responses Use: Measure answer quality (groundedness, coherence, etc.) Goal: High-quality, human-like responses 2. 𝐒𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 Meaning: AI-generated questions and answers Use: Test scale and performance Goal: Maximize response accuracy, retrieval quality, and tool use precision 3. 𝐀𝐝𝐯𝐞𝐫𝐬𝐚𝐫𝐢𝐚𝐥 Meaning: Malicious or risky prompts (e.g., jailbreaks) Use: Ensure safety and resilience Goal: Avoid unsafe outputs 4. 𝐎𝐎𝐃 (𝐎𝐮𝐭 𝐨𝐟 𝐃𝐨𝐦𝐚𝐢𝐧) Meaning: Unusual or irrelevant topics Use: See how well the model handles unfamiliar territory Goal: Avoid giving irrelevant or misleading answers 5. 𝐓𝐡𝐮𝐦𝐛𝐬 𝐝𝐨𝐰𝐧 Meaning: Real examples where users rated answers poorly Use: Identify failure modes Goal: Internal review, error analysis 6. 𝐏𝐑𝐎𝐃 Meaning: Cleaned, real user queries from deployed systems Use: Evaluate live performance Goal: Ensure production response quality This layered approach is essential for building: • Trustworthy AI • Measurable safety • Meaningful user experience Most organizations still rely on "accuracy-only" testing. But GenAI in production demands multi-dimensional evaluation — spanning risk, relevance, and realism. If you’re deploying GenAI at scale, ask: Are you testing the right things with the right datasets? Let’s sharpen the tools we use to measure intelligence. Because better testing = better AI. 👇 Would love to hear how you’re designing your eval pipelines. #genai #evaluation #llmops #promptengineering #aiarchitecture #openai
No more previous content

No more next content
2 Comments
Like Comment
Jan Beger

Our conversations must move beyond algorithms.

89,459 followers 6mo
Report this post
AI models in medical imaging often boast high accuracy, but are we measuring what really matters? 1️⃣ Many AI models are judged using metrics that do not match clinical goals, like relying on AUROC (area under the receiver operating characteristic curve, which shows how well the model separates classes) in imbalanced datasets where rare but critical findings are overlooked. 2️⃣ A single metric such as accuracy or Dice can be misleading. Multiple, task-specific metrics are essential for a robust evaluation. 3️⃣ In classification, AUROC can stay high even if a model misses rare cases. AUPRC (area under the precision-recall curve, which focuses on the model's performance on the positive class) is more useful when positives are rare. 4️⃣ For regression, MAE (mean absolute error, the average size of prediction errors) and RMSE (root mean squared error, which gives more weight to large errors) do not reflect how serious the errors are in real clinical settings. 5️⃣ In survival analysis, the C-index (concordance index, which measures how well predicted risks match actual outcomes) and time-dependent AUCs (area under the curve at specific time points) each reflect different things. Using the wrong one can mislead. 6️⃣ Detection models need precision-recall metrics like mAP (mean average precision, which combines detection quality and location accuracy) or FROC (free-response receiver operating characteristic, which shows sensitivity versus false positives per image). Accuracy is not useful here. 7️⃣ Segmentation metrics like Dice (which measures the overlap between predicted and true regions) and IoU (intersection over union, the overlap divided by the total area) can miss small but important errors. Visual review is often needed. 8️⃣ Calibration means checking if predicted risks match observed outcomes. ECE (expected calibration error, the average gap between predicted and actual risks) and the Brier score (the mean squared difference between predicted probability and actual outcome) help assess this. 9️⃣ Foundation models need extra checks: generalization (how well they perform across tasks), label efficiency (how few labeled examples they need), and alignment across inputs and outputs. Zero-shot means no examples were given before testing. Few-shot means only a few examples were used. 🔟 Metrics must fit the clinical context. A small error in one use case may be acceptable, but the same error could be dangerous in another. ✍🏻 Burak Kocak, Michail Klontzas, MD, PhD, Arnaldo Stanzione, Aymen Meddeb MD, EBIR, Aydin Demircioglu, Christian Bluethgen, Keno Bressem, Lorenzo Ugga, Nate Mercaldo, Oliver Diaz, Renato Cuocolo. Evaluation metrics in medical imaging AI: fundamentals, pitfalls, misapplications, and recommendations. European Journal of Radiology Artificial Intelligence. 2025. DOI: 10.1016/j.ejrai.2025.100030

31 Comments
Like Comment
Sumeet Agrawal

Vice President of Product Management

9,696 followers 6mo
Report this post
AI Evaluation Frameworks As AI systems evolve, one major challenge remains: how do we measure their performance accurately? This is where the concept of “AI Judges” comes in, from LLMs to autonomous agents and even humans. Here is how each type of judge works - 1. LLM-as-a-Judge - An LLM acts as an evaluator, comparing answers or outputs from different models and deciding which one is better. - It focuses on text-based reasoning and correctness - great for language tasks, but limited in scope. -Key Insight : LLMs can not run code or verify real-world outcomes. They are best suited for conversational or reasoning-based evaluations. 2. Agent-as-a-Judge - An autonomous agent takes evaluation to the next level. - It can execute code, perform tasks, measure accuracy, and assess efficiency, just like a real user or system would. -Key Insight : This allows for scalable, automated, and realistic testing, making it ideal for evaluating AI agents and workflows in action. 3. Human-as-a-Judge - Humans manually test and observe agents to determine which performs better. - They offer detailed and accurate assessments, but the process is slow and hard to scale. - Key Insight : While humans remain the gold standard for nuanced judgment, agent-based evaluation is emerging as the scalable replacement for repetitive testing. The future of AI evaluation is shifting - from static text comparisons (LLM) to dynamic, real-world testing (Agent). Humans will still guide the process, but AI agents will soon take over most of the judging work. If you are building or testing AI systems, start adopting Agent-as-a-Judge methods. They will help you evaluate performance faster, more accurately, and at scale.
No more previous content

No more next content
18 Comments
Like Comment
Adam Fishman

20,193 followers 11mo
Report this post
One topic that continues to surface in my conversations with AI builders is how to determine whether your AI-powered product is doing the right thing? In the Reforge 𝗔𝗜 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝘀 course we cover this in the third part of the BUILD Framework created by Brian Balfour. How do we Improve our upgraded models? To dive deep on this I was joined by Laura Burkhauser - VP Product Descript. We discussed: 𝐇𝐮𝐦𝐚𝐧 𝐅𝐞𝐞𝐝𝐛𝐚𝐜𝐤: How humans help improve our models. Useful for when you need to determine what “good” looks like for a specific use case. 𝐅𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠: How we systematically adapt models. You’d leverage this when the base models responses don’t match your specific needs. 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠: This is how models can learn autonomously and useful when you want the model to improve through ongoing interactions. But the BEST tool in the PM toolkit? 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧𝐬. 𝐓𝐡𝐞𝐫𝐞 𝐚𝐫𝐞 𝐟𝐢𝐯𝐞 𝐩𝐫𝐢𝐦𝐚𝐫𝐲 𝐬𝐭𝐞𝐩𝐬 𝐭𝐨 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧𝐬: 1. Define your use case and goal - what problem this this AI feature solve? What makes a “good” output? What are the risks to avoid? What are my user expectations? 2. Build your evaluation dataset - what is the data that we need to collect to run our evaluations? 3. Choose your eval metrics - what metrics reflect user value? How do we measure quality? What's acceptable? 4. Choose judging methodology - humans or AI judges (or both)? 5. Analyze your evals - do we create a synthetic score? Are there measures where we are willing to accept a lower performing result? In the beginning the team at Descript leveraged an employee who also happened to be a semi-professional cellist as a human evaluator. They then had to codify his evaluation into a set of criteria and metrics, prioritize them, and train others on how to evaluate against those criteria. Her case for why PMs should be writing evaluations: we often know our customers and our domain the best. Her process starts with writing a memo to an intern then turning that into an eval template. 𝐇𝐞𝐫 𝐛𝐢𝐠𝐠𝐞𝐬𝐭 𝐭𝐚𝐤𝐞𝐚𝐰𝐚𝐲𝐬 𝐟𝐫𝐨𝐦 𝐰𝐨𝐫𝐤𝐢𝐧𝐠 𝐢𝐧 𝐀𝐈 𝐟𝐨𝐫 𝐬𝐞𝐯𝐞𝐫𝐚𝐥 𝐲𝐞𝐚𝐫𝐬: 1️⃣ Evals should have priority use cases and prioritized criteria 2️⃣ Data does *not* tell the complete story; don’t discount ‘vibes’ in the beginning 3️⃣ PMs should take the first pass at the eval; we know our customers; don’t delegate this 4️⃣ PMs should run the evals the first few times along with QA Thanks again to Laura for joining me; I think this quote from an attendee summed up our conversation best: “Laura was a great speaker. The energy and excitement she brought were engaging. As someone who is very new to AI concepts, the way she spoke to the concepts through the lens of the customer/user and the user problem they were solving was much easier for me to follow and tie together.” If you want to level up your knowledge of how AI works, check out Reforge.
No more previous content

No more next content
Like Comment
Shivani Virdi

AI Engineering | Founder @ NeoSage | ex-Microsoft • AWS • Adobe | Teaching 70K+ How to Build Production-Grade GenAI Systems

85,024 followers 6mo
Report this post
I've spent countless hours building and evaluating AI systems. This is the 3-part evaluation roadmap I wish I had on day one. Evaluating an LLM system isn't one task. It's about measuring the performance of each component in the pipeline. You don't just test "the AI"; You test the retrieval, the generation, and the overall agentic workflow. 𝗣𝗮𝗿𝘁 𝟭: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 (𝗧𝗵𝗲 𝗥𝗔𝗚 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲) Your system is only as good as the context it retrieves. 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: ↳ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻: How much of the retrieved context is actually relevant vs. noise? ↳ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗥𝗲𝗰𝗮𝗹𝗹: Did you retrieve all the necessary information to answer the query? ↳ 𝗡𝗗𝗖𝗚: How high up in the retrieved list are the most relevant documents? 𝗞𝗲𝘆 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: ↳ 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸: RAGAs Framework (Repo) https://lnkd.in/gAPdCRzh ↳ 𝗣𝗮𝗽𝗲𝗿: RAGAs Paper https://lnkd.in/gUKVe4ac 𝗣𝗮𝗿𝘁 𝟮: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗧𝗵𝗲 𝗟𝗟𝗠'𝘀 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲) Once you have the context, how good is the model's actual output? 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: ↳ 𝗙𝗮𝗶𝘁𝗵𝗳𝘂𝗹𝗻𝗲𝘀𝘀: Does the answer stay grounded in the provided context, or does it start to hallucinate? ↳ 𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝗲: Is the answer directly addressing the user's original prompt? ↳ 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻 𝗙𝗼𝗹𝗹𝗼𝘄𝗶𝗻𝗴: Did the model adhere to the output format you requested? 𝗞𝗲𝘆 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: ↳ 𝗧𝗲𝗰𝗵𝗻𝗶𝗾𝘂𝗲: LLM-as-Judge Paper https://lnkd.in/gyhaU5CC ↳ 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸𝘀: OpenAI Evals & LangChain Evals https://lnkd.in/g9rjmfGS https://lnkd.in/gmJt7ZBa 𝗣𝗮𝗿𝘁 𝟯: 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝗔𝗴𝗲𝗻𝘁 (𝗧𝗵𝗲 𝗘𝗻𝗱-𝘁𝗼-𝗘𝗻𝗱 𝗦𝘆𝘀𝘁𝗲𝗺) Does the system actually accomplish the task from start to finish? 𝗞𝗲𝘆 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: ↳ 𝗧𝗮𝘀𝗸 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗶𝗼𝗻 𝗥𝗮𝘁𝗲: Did the agent successfully achieve its final goal? This is your north star. ↳ 𝗧𝗼𝗼𝗹 𝗨𝘀𝗮𝗴𝗲 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆: Did it call the correct tools with the correct arguments? ↳ 𝗖𝗼𝘀𝘁/𝗟𝗮𝘁𝗲𝗻𝗰𝘆 𝗽𝗲𝗿 𝗧𝗮𝘀𝗸: How many tokens and how much time did it take to complete the task? 𝗞𝗲𝘆 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: ↳ 𝗚𝗼𝗼𝗴𝗹𝗲'𝘀 𝗔𝗗𝗞 𝗗𝗼𝗰𝘀: https://lnkd.in/g2TpCWsq ↳ 𝗗𝗲𝗲𝗽𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴(.)𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗘𝘃𝗮𝗹 𝗖𝗼𝘂𝗿𝘀𝗲: https://lnkd.in/gcY8WyjV Stop testing your AI like a monolith. Start evaluating the components like a systems engineer. That's how you build systems that you can actually trust. Save this roadmap. What's the hardest part of your current eval pipeline? ♻️ Repost this to help your network build better systems. ➕ Follow Shivani Virdi for more.

43 Comments
Like Comment
Avi Chawla

Co-founder DailyDoseofDS | IIT Varanasi | ex-AI Engineer MastercardAI | Newsletter (150k+)

172,661 followers 1y
Report this post
4 ways to test ML models in production. (explain with visuals) Despite rigorously testing an ML model locally (on validation and test sets), it could be a terrible idea to instantly replace the previous model with the new model. A more reliable strategy is to test the model in production (yes, on real-world incoming data). While this might sound risky, ML teams do it all the time, and it isn’t that complicated. The following visual depicts 4 common strategies to do so. Some terminology: - The current model is called the legacy model. - The new model is called the candidate model. 1) A/B testing - Distribute the incoming requests non-uniformly between the legacy model and the candidate model. - Intentionally limit the exposure of the candidate model to avoid any potential risks. Thus, the number of requests sent to the candidate model must be low. 2) Canary testing - In A/B testing, since traffic is randomly redirected to either model irrespective of the user, it can potentially affect all users. - In canary testing, the candidate model is released to a small subset of users in production and gradually rolled out to more users. 3) Interleaved testing - This involves mixing the predictions of multiple models in the response. - Consider Amazon’s recommendation engine. In interleaved deployments, some product recommendations displayed on their homepage can come from the legacy model, while some can be produced by the candidate model. 4) Shadow testing - All of the above techniques affect some (or all) users. - Shadow testing (or dark launches) lets us test a new model in a production environment without affecting the user experience. - The candidate model is deployed alongside the existing legacy model and serves requests like the legacy model. However, the output is not sent back to the user. Instead, the output is logged for later use to benchmark its performance against the legacy model. - We explicitly deploy the candidate model instead of testing offline because the production environment is difficult to replicate offline. - Shadow testing offers risk-free testing of the candidate model in a production environment. 👉 Over to you: What are some ways to test models in production? ____ Find me → Avi Chawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
No more previous content

No more next content
11 Comments
Like Comment
Karun Thankachan

Senior Data Scientist @ Walmart (ex-FAANG) | Teaching 95K+ practitioners Applied ML & Agentic AI | 2xML Patents

96,223 followers 6mo
Report this post
Day 11/30 of SLMs/LLMs - Evaluating LLMs So how do we evaulate language models? Perplexity is one of the oldest and most fundamental metrics for evaluating language models. It measures how surprised the model is by real text. Mathematically, it’s the exponential of the model’s average negative log-likelihood over the dataset, BUT in plain English, lower perplexity means the model is more confident and consistent in predicting words. For example: A model trained on English will have low perplexity on Wikipedia articles. Feed it a paragraph of Mandarin, and its perplexity will spike — because it’s “perplexed.” Perplexity is great for pretraining evaluation but doesn’t always align perfectly with human judgments of text quality, especially for open-ended generation. BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are classic metrics for text generation tasks like translation and summarization. BLEU measures precision i.e. how much of the generated text overlaps with the reference. ROUGE focuses on recall i.e. how much of the reference text the model managed to capture. Imagine you ask two models to summarize a paragraph. Model A produces a summary using slightly different words but captures all the meaning. Model B copies exact phrases from the original. BLEU might prefer Model B, while ROUGE could favor Model A. That’s why we often use both. They complement each other. Now, as we step into LLM terittory we have a few additional metrics to evaluate for instance using LLM as judge to evaluate hallucinations. More on this in this weeks interview questions. Finally, When numbers fall short, we turn to humans. Human evaluation assesses outputs on qualities like coherence, relevance, factuality, and creativity. In practice, researchers often combine automated metrics (for scale) with human evals (for insight). For example, OpenAI’s alignment work relies heavily on reinforcement learning from human feedback (RLHF), because no metric can fully capture “helpful” or “truthful” yet. For more of an in-depth review checkout this blog from Evidently AI https://lnkd.in/gbnYD8N6 Evaluating language models is like judging a musician - 👉 Perplexity tells you how well they stay on beat. 👉 BLEU and ROUGE check if they hit the right notes. 👉 Human evaluation decides if it moved the audience. Tune in tomorrow for more SLM/LLMs deep dives. -- 🚶➡️ To learn more about LLMs/SLMs, follow me - Karun! ♻️ Share so others can learn, and you can build your LinkedIn presence!
No more previous content

No more next content
4 Comments
Like Comment

How to Evaluate Model Performance

Summary

More in AI Evaluation Methods

Explore categories