New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift significantly across deeper layers before stabilizing. These tokens capture "genuine reasoning" effort rather than verbose output. Instead of measuring how much a model writes, measure how hard it's actually thinking at each step. Deep-thinking tokens are identified by tracking prediction instability across transformer layers during inference. The ratio of deep-thinking tokens correlates more reliably with accuracy than token count or confidence metrics across mathematical and scientific benchmarks (AIME 24/25, HMMT 25, GPQA-diamond), tested on DeepSeek-R1, Qwen3, and GPT-OSS. They also introduce Think@n, a test-time compute strategy that prioritizes samples with high deep-thinking ratios while early-rejecting low-quality partial outputs, reducing cost without sacrificing performance. Why does it matter? As inference-time scaling becomes a primary lever for improving model performance, we need better signals than token length to understand when a model is actually reasoning versus just rambling.
Improving LLM Reliability Using Internal Metrics
Explore top LinkedIn content from expert professionals.
Summary
Improving LLM reliability using internal metrics means using specific data and measurements inside large language models (LLMs) to judge how well they handle tasks, rather than relying on surface-level outputs. By tracking factors like deep-thinking tokens and prompt sensitivity, developers can spot issues early and build more trustworthy and consistent AI systems.
- Measure real reasoning: Use internal metrics such as deep-thinking tokens to detect when the model is genuinely processing information, instead of just generating lots of text.
- Track and monitor: Set up prompt monitoring and evaluation pipelines that capture key details about system behavior, helping you identify where and why errors happen.
- Test for robustness: Regularly run custom evaluations and try different prompt styles to ensure reliable outputs across varied tasks and real-world scenarios.
-
-
LLM systems don’t fail silently. They fail invisibly. No trace, no metrics, no alerts - just wrong answers and confused users. That’s why we architected a complete observability pipeline in the Second Brain AI Assistant course. Powered by Opik from Comet, it covers two key layers: 𝟭. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 → Tracks full prompt traces (inputs, outputs, system prompts, latencies) → Visualizes chain execution flows and step-level timing → Captures metadata like model IDs, retrieval config, prompt templates, token count, and costs Latency metrics like: Time to First Token (TTFT) Tokens per Second (TPS) Total response time ...are logged and analyzed across stages (pre-gen, gen, post-gen). So when your agent misbehaves, you can see exactly where and why. 𝟮. 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 → Runs automated tests on the agent’s responses → Uses LLM judges + custom heuristics (hallucination, relevance, structure) → Works offline (during dev) and post-deployment (on real prod samples) → Fully CI/CD-ready with performance alerts and eval dashboards It’s like integration testing, but for your RAG + agent stack. The best part? → You can compare multiple versions side-by-side → Run scheduled eval jobs on live data → Catch quality regressions before your users do This is Lesson 6 of the course (and it might be the most important one). Because if your system can’t measure itself, it can’t improve. 🔗 Full breakdown here: https://lnkd.in/dA465E_J
-
We've all shipped an LLM feature that "felt right" in dev, only to watch it break in production. Why? Because human "eyeballing" isn't a scalable evaluation strategy. The real challenge in building robust AI isn't just getting an LLM to generate an output. It’s ensuring the output is 𝐫𝐢𝐠𝐡𝐭, 𝐬𝐚𝐟𝐞, 𝐟𝐨𝐫𝐦𝐚𝐭𝐭𝐞𝐝, 𝐚𝐧𝐝 𝐮𝐬𝐞𝐟𝐮𝐥, consistently, across thousands of diverse user inputs. This is where 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 become non-negotiable. Think of them as the sophisticated unit tests and integration tests for your LLM's brain. You need to move beyond "does it work?" to "how well does it work, and why?" This is precisely what Comet's 𝐎𝐩𝐢𝐤 is designed for. It provides the framework to rigorously grade your LLM's performance, turning subjective feelings into objective data. Here's how we approach it, as shown in the cheat sheet below: 1./ Heuristic Metrics => the 'Linters' & 'Unit Tests' - These are your non-negotiable, deterministic sanity checks. - They are low-cost, fast, and catch objective failures. - Your pipeline should fail here first. ▫️Is it valid? → IsJson, RegexMatch ▫️Is it faithful? → Contains, Equals ▫️Is it close? → Levenshtein 2./ LLM-as-a-Judge => the 'Peer Review' - This is for everything that "looks right" but might be subtly wrong. - These metrics evaluate quality and nuance where statistical rules fail. - They answer the hard, subjective questions. ▫️Is it true? → Hallucination ▫️Is it relevant? → AnswerRelevance ▫️Is it helpful? → Usefulness 3./ G-Eval => the dynamic 'Judge-Builder' - G-Eval is a task-agnostic LLM-as-a-Judge. - You define custom evaluation criteria in plain English (e.g., "Is the tone professional but not robotic?"). - It then uses Chain-of-Thought reasoning internally to analyze the output and produce a human-aligned score for those criteria. - This allows you to test specific business logic without writing new code. 4./ Custom Metrics - For everything else. - This is where you write your own Python code to create a metric. - It’s for when you need to check an output against a live internal API, a proprietary database, or any other logic that only your system knows. Take a look at the cheat sheet for a quick breakdown. Which metric are you implementing first for your current LLM project? ♻️ Don't forget to repost.
-
🚨 Public Service Announcement: If you're building LLM-based applications for internal business use, especially for high-risk functions this is for you. Define Context Clearly ------------------------ 📋 Document the purpose, expected behavior, and users of the LLM system. 🚩 Note any undesirable or unacceptable behaviors upfront. Conduct a Risk Assessment ---------------------------- 🔍 Identify potential risks tied to the LLM (e.g., misinformation, bias, toxic outputs, etc), and be as specific as possible 📊 Categorize risks by impact on stakeholders or organizational goals. Implement a Test Suite ------------------------ 🧪 Ensure evaluations include relevant test cases for the expected use. ⚖️ Use benchmarks but complement them with tests tailored to your business needs. Monitor Risk Coverage ----------------------- 📈 Verify that test inputs reflect real-world usage and potential high-risk scenarios. 🚧 Address gaps in test coverage promptly. Test for Robustness --------------------- 🛡 Evaluate performance on varied inputs, ensuring consistent and accurate outputs. 🗣 Incorporate feedback from real users and subject matter experts. Document Everything ---------------------- 📑 Track risk assessments, test methods, thresholds, and results. ✅ Justify metrics and thresholds to enable accountability and traceability. #psa #llm #testingandevaluation #responsibleAI #AIGovernance Patrick Sullivan, Khoa Lam, Bryan Ilg, Jeffery Recker, Borhane Blili-Hamelin, PhD, Dr. Benjamin Lange, Dinah Rabe, Ali Hasan
-
Small variations in prompts can lead to very different LLM responses. Research that measures LLM prompt sensitivity uncovers what matters, and the strategies to get the best outcomes. A new framework for prompt sensitivity, ProSA, shows that response robustness increases with factors including higher model confidence, few-shot examples, and larger model size. Some strategies you should consider given these findings: 💡 Understand Prompt Sensitivity and Test Variability: LLMs can produce different responses with minor rephrasings of the same prompt. Testing multiple prompt versions is essential, as even small wording adjustments can significantly impact the outcome. Organizations may benefit from creating a library of proven prompts, noting which styles perform best for different types of queries. 🧩 Integrate Few-Shot Examples for Consistency: Including few-shot examples (demonstrative samples within prompts) enhances the stability of responses, especially in larger models. For complex or high-priority tasks, adding a few-shot structure can reduce prompt sensitivity. Standardizing few-shot examples in key prompts across the organization helps ensure consistent output. 🧠 Match Prompt Style to Task Complexity: Different tasks benefit from different prompt strategies. Knowledge-based tasks like basic Q&A are generally less sensitive to prompt variations than complex, reasoning-heavy tasks, such as coding or creative requests. For these complex tasks, using structured, example-rich prompts can improve response reliability. 📈 Use Decoding Confidence as a Quality Check: High decoding confidence—the model’s level of certainty in its responses—indicates robustness against prompt variations. Organizations can track confidence scores to flag low-confidence responses and identify prompts that might need adjustment, enhancing the overall quality of outputs. 📜 Standardize Prompt Templates for Reliability: Simple, standardized templates reduce prompt sensitivity across users and tasks. For frequent or critical applications, well-designed, straightforward prompt templates minimize variability in responses. Organizations should consider a “best-practices” prompt set that can be shared across teams to ensure reliable outcomes. 🔄 Regularly Review and Optimize Prompts: As LLMs evolve, so may prompt performance. Routine prompt evaluations help organizations adapt to model changes and maintain high-quality, reliable responses over time. Regularly revisiting and refining key prompts ensures they stay aligned with the latest LLM behavior. Link to paper in comments.
-
If you’ve ever shipped a GenAI model to production, you already know the real interview isn’t about transformers, it’s about everything that breaks the moment real users touch your system. 1) How would you evaluate an LLM powering a Q&A system? Approach: Don’t talk about accuracy alone. Break it down into: ✅ Functional metrics: exact match, F1, BLEU, ROUGE depending on task. ✅ Safety metrics: hallucination rate, refusal rate, PII leakage. ✅ User-facing metrics: latency, token cost, answer completeness. ✅ Human evaluation: rubric-based scoring from SMEs when answers aren’t deterministic. ✅ A/B tests: compare model variants on real user flows. 2) How do you handle hallucinations in production? Approach: ✅ Show you understand layered mitigation: ✅ Retrieval first (RAG) to ground the model. ✅ Constrain the prompt: citations, “answer only from provided context,” JSON schemas. ✅ Post-generation validation like fact-checking rules or context-overlap checks. ✅ Fall-back behaviors when confidence is low: ask for clarification, return source snippets, route to human. 3) You’re asked to improve retrieval quality in a RAG pipeline. What do you check first? Approach: Walk through a debugging flow: ✅ Check document chunking (size, overlap, boundaries). ✅ Evaluate embedding model suitability for domain. ✅ Inspect vector store configuration (HNSW params, top_k). ✅ Run retrieval diagnostics: is the top_k relevant to the question? ✅ Add metadata filters or rerankers (cross-encoder, ColBERT-style scoring). 4) How do you monitor a GenAI system after deployment? Approach: ✅ Make it clear that monitoring isn’t optional. ✅ Latency and cost per request. ✅ Token distribution shifts (prompt bloat). ✅ Hallucination drift from user conversations. ✅ Guardrail violations and safety triggers. ✅ Retrieval hit rate and query types. ✅ Feedback loops from thumbs up/down or human review. 5) How do you decide between fine-tuning and using RAG? Approach: ✅ Use a decision tree mentality: ✅ If the issue is knowledge freshness, go with RAG. ✅ If the issue is formatting/style, go with fine-tuning. ✅ If the model needs domain reasoning, consider fine-tuning or LoRA. ✅ If the data is large and structured, use RAG + reranking before touching training. Most interviews test what you know. GenAI interviews test what you’ve survived. Follow Sneha Vijaykumar for more... 😊 #genai #datascience #rag #production #interview #questions #careergrowth #prep
-
Evaluating LLMs is hard. Evaluating agents is even harder. This is one of the most common challenges I see when teams move from using LLMs in isolation to deploying agents that act over time, use tools, interact with APIs, and coordinate across roles. These systems make a series of decisions, not just a single prediction. As a result, success or failure depends on more than whether the final answer is correct. Despite this, many teams still rely on basic task success metrics or manual reviews. Some build internal evaluation dashboards, but most of these efforts are narrowly scoped and miss the bigger picture. Observability tools exist, but they are not enough on their own. Google’s ADK telemetry provides traces of tool use and reasoning chains. LangSmith gives structured logging for LangChain-based workflows. Frameworks like CrewAI, AutoGen, and OpenAgents expose role-specific actions and memory updates. These are helpful for debugging, but they do not tell you how well the agent performed across dimensions like coordination, learning, or adaptability. Two recent research directions offer much-needed structure. One proposes breaking down agent evaluation into behavioral components like plan quality, adaptability, and inter-agent coordination. Another argues for longitudinal tracking, focusing on how agents evolve over time, whether they drift or stabilize, and whether they generalize or forget. If you are evaluating agents today, here are the most important criteria to measure: • 𝗧𝗮𝘀𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀: Did the agent complete the task, and was the outcome verifiable? • 𝗣𝗹𝗮𝗻 𝗾𝘂𝗮𝗹𝗶𝘁𝘆: Was the initial strategy reasonable and efficient? • 𝗔𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻: Did the agent handle tool failures, retry intelligently, or escalate when needed? • 𝗠𝗲𝗺𝗼𝗿𝘆 𝘂𝘀𝗮𝗴𝗲: Was memory referenced meaningfully, or ignored? • 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗶𝗼𝗻 (𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀): Did agents delegate, share information, and avoid redundancy? • 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲: Did behavior remain consistent across runs or drift unpredictably? For adaptive agents or those in production, this becomes even more critical. Evaluation systems should be time-aware, tracking changes in behavior, error rates, and success patterns over time. Static accuracy alone will not explain why an agent performs well one day and fails the next. Structured evaluation is not just about dashboards. It is the foundation for improving agent design. Without clear signals, you cannot diagnose whether failure came from the LLM, the plan, the tool, or the orchestration logic. If your agents are planning, adapting, or coordinating across steps or roles, now is the time to move past simple correctness checks and build a robust, multi-dimensional evaluation framework. It is the only way to scale intelligent behavior with confidence.
-
LLM “critics” that monitor an agent and intervene mid-task are gaining momentum as a way to improve accuracy. The WRITER research team just published a new paper testing a key assumption: if a critic is accurate offline, intervention should help in deployment. Our main finding: offline accuracy isn’t enough. In our experiments, a simple binary LLM critic achieved ~0.94 AUROC at predicting failures — yet intervention could still harm end-to-end task success. In one case, success collapsed by 26 percentage points. In another, intervention had negligible impact. Why? Because intervention creates a tradeoff: → you recover some trajectories that would have failed → but you also disrupt some trajectories that would have succeeded The real question isn’t “Is the critic accurate?” It’s “When does intervention produce net gains vs. net disruption?” Our paper proposes a quick pre-rollout check — often requiring only ~50 tasks — that signals whether interventions will improve outcomes or introduce regressions for your specific workload. At WRITER, our agentic system powers complex work for some of the world's largest enterprises. We’re committed to doing (and sharing) the research needed to make those systems more accurate and reliable at scale Read the paper on Hugging Face: https://lnkd.in/gr62vh_K
-
If you're evaluating LLMs with off-the-shelf metrics, you might be measuring the wrong things. Generic scores like "helpfulness" or "hallucination rate" can create a false sense of confidence. They don't tell you what's actually breaking in your system—or why your users are frustrated. 💡 There's a better approach: open coding. It's a technique from qualitative research. Rather than starting with predefined categories, you review your traces and write open-ended notes about what you observe. You let the failure modes emerge from your data. A note might be: "The agent missed a clear opportunity to re-engage after the user mentioned price concerns." Or: "Asked for confirmation twice in a row before booking." These observations are specific, actionable, and grounded in how your product actually behaves. From there, you group similar notes into themes, count the frequencies, and suddenly you have a prioritized list of what to fix—derived from evidence, not assumptions. This isn't new. It's been central to ML error analysis for decades. But it's especially valuable now, when so many teams are shipping LLM features without a clear framework for understanding quality. Gergely Orosz and Hamel Husain published an excellent deep-dive on this workflow. If you're building AI products, it's worth your time. Arize AI Phoenix now supports this with a Span Notes API—add notes to traces programmatically and build annotation workflows into your evaluation process. https://lnkd.in/g2Sdc8vz https://lnkd.in/gRx_2tjV
-
🚨 Your RAG system works… but how do you prove it works? Most people build Retrieval-Augmented Generation (RAG) pipelines and stop at: “It sounds correct.” But in production, “sounds correct” is not a metric. If you're serious about building reliable AI systems, you need to measure two things: 1️⃣ Retrieval-Level Metrics Is your retriever actually finding the right information? • Recall@K – Did we fetch the correct chunk at all? • Precision@K – Are we retrieving useful content or junk? • Context Relevancy – How strongly is the retrieved context related to the query? High recall → system doesn’t miss important info High precision → cleaner context, less confusion for the LLM 2️⃣ Generation-Level Metrics Did the LLM use the retrieved context properly? • Faithfulness – Are all claims supported by the context? • Groundedness – Is the answer truly tied to the source documents? • Answer Relevancy – Does it directly answer the user’s question? Because a retriever can work perfectly… and the model can still hallucinate. 🎯 Why this matters: • Reduces hallucinations • Improves search quality • Helps compare different RAG architectures • Moves you from “it sounds right” → to measurable AI systems 🛠 Tools that help automate evaluation: • Ragas • TruLens • DeepEval • LangChain evaluation chains • LlamaIndex evaluation modules If you’re building AI agents, enterprise copilots, or production RAG systems — evaluation is not optional. It’s infrastructure. Save this post if you're working on RAG. Repost to help others build better AI systems. 🎓 Want a structured plan for AI interviews? My AI Interview Mastery Bundle (13 courses) prepares you for ML, GenAI, LLM, and Agentic AI roles with real interview questions and system thinking. Start preparing the right way. Enroll now using link below. https://lnkd.in/g5e-E9Qp #AI #GenerativeAI #RAG #MachineLearning #LLM #AIEngineering #ArtificialIntelligence
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development