How to Evaluate LLM Reasoning Abilities

Explore top LinkedIn content from expert professionals.

Summary

Evaluating large language model (LLM) reasoning abilities means figuring out how well these AI systems actually “think” and solve problems, beyond just generating text that looks correct. This involves looking at accuracy, faithfulness, and whether the model truly understands and reasons through complex tasks, not just how many words it produces or how confident it sounds.

  • Assess reasoning depth: Use specialized metrics like “deep-thinking tokens” or LLM-as-a-Judge approaches to measure genuine reasoning efforts instead of relying on token counts or surface-level correctness.
  • Track behavioral evolution: Consider how LLMs and AI agents adapt, coordinate, and retain knowledge over time by monitoring changes in their decision-making, memory use, and teamwork.
  • Build structured frameworks: Develop multi-dimensional evaluation systems that break down tasks into components like plan quality, adaptation, and stability, providing clearer signals on where reasoning strengths and weaknesses lie.
Summarized by AI based on LinkedIn member posts
  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems @meta

    206,810 followers

    Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    627,982 followers

    If you’re building LLM applications today, reasoning is where the real leverage lies. And yet, I see a lot of engineers still treating LLM outputs as a single-shot black box. LLMs can reason, but only if you give them the right scaffolding and the right post-training. Here’s a mental model I’ve been using to think about LLM reasoning methods (see chart below): ✅ Inference-time reasoning methods: These are techniques that can be applied at inference time, without needing to retrain your model: → Tree of Thoughts (ToT), search through reasoning paths → Chain of Thought (CoT) prompting, prompt models to generate intermediate reasoning steps → Reasoning + Acting, use tools or function calls during reasoning → Self-feedback, prompt the model to critique and refine its own output → Episodic Memory Agents, maintain a memory buffer to improve multi-step reasoning → Self-consistency, sample multiple reasoning paths and select the most consistent answer ✅ Training-time enhancements: Where things get really powerful is when you post-train your model to improve reasoning, using human annotation or policy optimization: → Use Preference pairs and Reward Models to tune for better reasoning (RFT, Proximal PO, KL Regularization) → Apply RLHF, PPO + KL, Rejection Sampling + SFT, Advantage Estimation, and other advanced techniques to guide the model’s policy → Leverage multiple paths, offline trajectories, and expert demonstrations to expose the model to rich reasoning signals during training Here are my 2 cents 🫰 If you want production-grade LLM reasoning, you’ll need both, → Smart inference-time scaffolds to boost reasoning without slowing latency too much → Carefully tuned post-training loops to align the model’s policy with high-quality reasoning patterns → We’re also seeing increasing use of Direct Preference Optimization (DPO) and reference-free grading to further improve reasoning quality and stability. I’m seeing more and more teams combine both strategies, and the gap between "vanilla prompting" and "optimized reasoning loops" is only getting wider. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

  • View profile for Elvis S.

    Founder at DAIR.AI | Angel Investor | Advisor | Prev: Meta AI, Galactica LLM, Elastic, Ph.D. | Serving 7M+ learners around the world

    85,574 followers

    New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift significantly across deeper layers before stabilizing. These tokens capture "genuine reasoning" effort rather than verbose output. Instead of measuring how much a model writes, measure how hard it's actually thinking at each step. Deep-thinking tokens are identified by tracking prediction instability across transformer layers during inference. The ratio of deep-thinking tokens correlates more reliably with accuracy than token count or confidence metrics across mathematical and scientific benchmarks (AIME 24/25, HMMT 25, GPQA-diamond), tested on DeepSeek-R1, Qwen3, and GPT-OSS. They also introduce Think@n, a test-time compute strategy that prioritizes samples with high deep-thinking ratios while early-rejecting low-quality partial outputs, reducing cost without sacrificing performance. Why does it matter? As inference-time scaling becomes a primary lever for improving model performance, we need better signals than token length to understand when a model is actually reasoning versus just rambling.

  • View profile for Valerio Capraro

    Associate Professor at the University of Milan Bicocca

    9,993 followers

    Major preprint just out! We compare how humans and LLMs form judgments across seven epistemological stages. We highlight seven fault lines, points at which humans and LLMs fundamentally diverge: The Grounding fault: Humans anchor judgment in perceptual, embodied, and social experience, whereas LLMs begin from text alone, reconstructing meaning indirectly from symbols. The Parsing fault: Humans parse situations through integrated perceptual and conceptual processes; LLMs perform mechanical tokenization that yields a structurally convenient but semantically thin representation. The Experience fault: Humans rely on episodic memory, intuitive physics and psychology, and learned concepts; LLMs rely solely on statistical associations encoded in embeddings. The Motivation fault: Human judgment is guided by emotions, goals, values, and evolutionarily shaped motivations; LLMs have no intrinsic preferences, aims, or affective significance. The Causality fault: Humans reason using causal models, counterfactuals, and principled evaluation; LLMs integrate textual context without constructing causal explanations, depending instead on surface correlations. The Metacognitive fault: Humans monitor uncertainty, detect errors, and can suspend judgment; LLMs lack metacognition and must always produce an output, making hallucinations structurally unavoidable. The Value fault: Human judgments reflect identity, morality, and real-world stakes; LLM "judgments" are probabilistic next-token predictions without intrinsic valuation or accountability. Despite these fault lines, humans systematically over-believe LLM outputs, because fluent and confident language produce a credibility bias. We argue that this creates a structural condition, Epistemia: linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without actually knowing. To address Epistemia, we propose three complementary strategies: epistemic evaluation, epistemic governance, and epistemic literacy. Full paper in the first comment. Joint with Walter Quattrociocchi and Matjaz Perc.

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,608 followers

    Evaluating LLMs is hard. Evaluating agents is even harder. This is one of the most common challenges I see when teams move from using LLMs in isolation to deploying agents that act over time, use tools, interact with APIs, and coordinate across roles. These systems make a series of decisions, not just a single prediction. As a result, success or failure depends on more than whether the final answer is correct. Despite this, many teams still rely on basic task success metrics or manual reviews. Some build internal evaluation dashboards, but most of these efforts are narrowly scoped and miss the bigger picture. Observability tools exist, but they are not enough on their own. Google’s ADK telemetry provides traces of tool use and reasoning chains. LangSmith gives structured logging for LangChain-based workflows. Frameworks like CrewAI, AutoGen, and OpenAgents expose role-specific actions and memory updates. These are helpful for debugging, but they do not tell you how well the agent performed across dimensions like coordination, learning, or adaptability. Two recent research directions offer much-needed structure. One proposes breaking down agent evaluation into behavioral components like plan quality, adaptability, and inter-agent coordination. Another argues for longitudinal tracking, focusing on how agents evolve over time, whether they drift or stabilize, and whether they generalize or forget. If you are evaluating agents today, here are the most important criteria to measure: • 𝗧𝗮𝘀𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀: Did the agent complete the task, and was the outcome verifiable? • 𝗣𝗹𝗮𝗻 𝗾𝘂𝗮𝗹𝗶𝘁𝘆: Was the initial strategy reasonable and efficient? • 𝗔𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻: Did the agent handle tool failures, retry intelligently, or escalate when needed? • 𝗠𝗲𝗺𝗼𝗿𝘆 𝘂𝘀𝗮𝗴𝗲: Was memory referenced meaningfully, or ignored? • 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗶𝗼𝗻 (𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀): Did agents delegate, share information, and avoid redundancy? • 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲: Did behavior remain consistent across runs or drift unpredictably? For adaptive agents or those in production, this becomes even more critical. Evaluation systems should be time-aware, tracking changes in behavior, error rates, and success patterns over time. Static accuracy alone will not explain why an agent performs well one day and fails the next. Structured evaluation is not just about dashboards. It is the foundation for improving agent design. Without clear signals, you cannot diagnose whether failure came from the LLM, the plan, the tool, or the orchestration logic. If your agents are planning, adapting, or coordinating across steps or roles, now is the time to move past simple correctness checks and build a robust, multi-dimensional evaluation framework. It is the only way to scale intelligent behavior with confidence.

  • View profile for Jeremy Arancio

    ML Engineer | Document AI Specialist | Turn enterprise-scale documents into profitable data products

    13,811 followers

    LLM-as-a-Judge is a dangerous way to do evaluation. In most cases, it just doesn't work. While it gives a false impression of having a grasp on your system's performance, it lures you with general metrics such as correctness, faithfulness, or completeness. They hide several complexities: - What does "completeness" mean for your application? In the case of a marketing AI assistant, what characterizes a complete post from an incomplete one? If the score goes higher, does it mean that the post is better? - Often, these metrics are scores between 1-5. But what is the difference between a 3 and a 4? Does a 5 mean the output is perfect? What is perfect, then? - If you "calibrate" the LLM-as-a-judge with scores given by users during a test session, how do you ensure the LLM scoring matches user expectations? If I arbitrarily set all scores to 4, will I perform better than your model? However, if LLM-as-a-judge is limited, it doesn't mean it's impossible to evaluate your AI system. Here are some strategies you can adopt: - Online evaluation is the new king in the GenAI era Log and trace LLM outputs, retrieved chunks, routing… Each step of the process. Link it to user feedback as binary classification: was the final output good or bad? Then take a look at the data. Yourself, no LLM. Take notes and try to find patterns among good and bad traces. At this stage, you can use an LLM to help you find clusters within your notes, not the data. After taking this time, you'll already have some clues about how to improve the system. - Evaluate deterministic steps that come before the final output Was the retrieval system performant? Did the router correctly categorize the initial query? Those steps in the agentic system are deterministic, meaning you can evaluate them precisely: Retriever: Hit Rate@k, Mean Reciprocal Rank, Precision, Recall Router: Precision, Recall, F1-Score Create a small benchmark, synthetically or not, to evaluate offline those steps. It enables you to improve them later on individually (hybrid search instead of vector search, fine-tune a small classifier instead of relying on LLMs…) - Don't use tools that promise to externalize evaluation Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well. ... Those are some unequivocal ideas proposed by the AI community. Yet, I still see AI projects relying on LLM-as-a-judge and generic metrics among companies. Being able to evaluate your system gives you the power to improve it. So take the time to create the perfect evaluation for your use case.

  • View profile for Cameron R. Wolfe, Ph.D.

    Research @ Netflix

    23,760 followers

    My favorite paper from NeurIPS’24 shows us that frontier LLMs don’t pay very close attention to their context windows… Needle In A Haystack: The needle in a haystack test is the most common way to test LLMs with long context windows. The test is conducted via the following steps: 1. Place a fact / statement within a corpus of text. 2. Ask the LLM to generate the fact given the corpus as input. 3. Repeat this test while increasing the size of the corpus and placing the fact at different locations. From this test, we see if an LLM “pays attention” to different regions of a long context window, but this test purely examines whether the LLM is able to recall information from its context. Where does this fall short? Most tasks being solved by LLMs require more than information recall. The LLM may need to perform inference, manipulate knowledge, or reason in order to solve a task. With this in mind, we might wonder if we could generalize the needle in a haystack test to analyze more complex LLM capabilities under different context lengths. BABILong generalizes the needle in a haystack test to perform long context reasoning. The LLM is tested based upon its ability to reason over facts that are distributed in very long text corpora. Reasoning tasks that are tested include fact chaining, induction, deduction, counting, list / set comprehension, and more. Such reasoning tasks are challenging, especially when necessary information is scattered in a large context window. “Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.” - BABILong paper Can LLMs reason over long context? We see in the BABILong paper that most frontier LLMs struggle to solve long context reasoning problems. Even top LLMs like GPT-4 and Gemini-1.5 seem to consistently use only ~20% of their context window. In fact, most LLMS struggle to answer questions about facts in texts longer than 10,000 tokens! What can we do about this? First, we should just be aware of this finding! Be wary of using super long contexts, as they might deteriorate the LLM’s ability to solve more complex problems that require reasoning. However, we see in the BABILong paper that these issues can be mitigated with a few different approaches: - Using RAG is helpful. However, this approach only works up to a certain context length and has limitations (e.g., struggles to solve problems where the order of facts matters). - Recurrent transformers can answer questions about facts from very long contexts.

  • View profile for Sumeet Agrawal

    Vice President of Product Management

    9,696 followers

    AI Evaluation Frameworks As AI systems evolve, one major challenge remains: how do we measure their performance accurately? This is where the concept of “AI Judges” comes in, from LLMs to autonomous agents and even humans. Here is how each type of judge works - 1. LLM-as-a-Judge - An LLM acts as an evaluator, comparing answers or outputs from different models and deciding which one is better. - It focuses on text-based reasoning and correctness - great for language tasks, but limited in scope. -Key Insight : LLMs can not run code or verify real-world outcomes. They are best suited for conversational or reasoning-based evaluations. 2. Agent-as-a-Judge - An autonomous agent takes evaluation to the next level. - It can execute code, perform tasks, measure accuracy, and assess efficiency, just like a real user or system would. -Key Insight : This allows for scalable, automated, and realistic testing, making it ideal for evaluating AI agents and workflows in action. 3. Human-as-a-Judge - Humans manually test and observe agents to determine which performs better. - They offer detailed and accurate assessments, but the process is slow and hard to scale. - Key Insight : While humans remain the gold standard for nuanced judgment, agent-based evaluation is emerging as the scalable replacement for repetitive testing. The future of AI evaluation is shifting - from static text comparisons (LLM) to dynamic, real-world testing (Agent). Humans will still guide the process, but AI agents will soon take over most of the judging work. If you are building or testing AI systems, start adopting Agent-as-a-Judge methods. They will help you evaluate performance faster, more accurately, and at scale.

  • View profile for Chad Coleman, Ph.D.

    VP, Engineering & AI Innovation @ ZoomInfo | Ex-Google/IBM | Professor @ Columbia/NYU | Leading AI Strategy & Emerging Technology Vision

    4,286 followers

    Just published: A comparative analysis of ethical reasoning across major LLMs, examining how different model architectures and training approaches influence moral decision-making capabilities. We put six leading models (including GPT-4, Claude, and LLaMA) through rigorous ethical reasoning tests, moving beyond traditional alignment metrics to explore their explicit moral logic frameworks. Using established ethical typologies, we analyzed how these systems articulate their decision-making process in classic moral dilemmas. Technical insight: Despite architectural differences, we found remarkable convergence in ethical reasoning patterns - suggesting that current training methodologies might be creating similar moral scaffolding across models. The variations we observed appear more linked to fine-tuning and post-training processes than base architecture. Critical for ML practitioners: All models demonstrated sophisticated reasoning comparable to graduate-level philosophy, with a strong bias toward consequentialist frameworks. Implications for model development? This convergence raises interesting questions about diversity in ethical reasoning capabilities and potential training modifications. Check out the full paper here: https://lnkd.in/gFamrRVc #LLMs #MachineLearning #AIAlignment #ModelDevelopment

  • View profile for Sinan Ozdemir

    AI & LLM Expert / Author / Check out my latest book!

    11,674 followers

    Your AI agent does all this work: calls tools, reasons through the problem, writes you a solid few paragraphs as a response. So... how do you know if it was right? Is it in the tool calls somewhere? Do you even know what "right" looks like? What if the same answer can be phrased a dozen ways? What if the answer is correct but the reasoning is wrong? Are you going to read every response yourself? You might need a grading rubric. And you might also need an LLM to apply it. This is from Chapter 4 of Building Agentic AI - and part of my free Substack series I'm co-writing with the amazing Julian Alvarado, based on my Pearson book of the same name available now on O'Reilly and Amazon (link in the comments). 🎯 Use an LLM to Grade an LLM Once an LLM is done working, give a second LLM the necessary inputs alongside the agent's response, and (if you have one) the ground truth. Define a scoring rubric: pass/fail, a numeric scale, whatever fits your task. Add few-shot examples so it knows what each score looks like. Include reasoning in those examples so it knows exactly what you're looking for when giving answers. Add chain-of-thought so it reasons before scoring (and hopefully matches your reasoning from the examples). Two main rules: 1. Try to use a different model family than your agent. Same-family models share biases. All the context is in the prompt, a mid-tier LLM can work fine. 2. You don't need the smartest grader. You need the one that agrees with you. 📈 Where It Breaks For one of the case studies in my own book, after running the LLM grader against a gold set (a test explicitly made to test the rubric matching my human judgment) I found that it agreed with me in the vast majority of cases (96%+). But edge cases were easy to spot. For example, the grader gave full marks when the agent said "approximately 42" instead of "42." It struggled when the agent got the right answer through wrong reasoning. Right number. Wrong path. 🔍 Beyond Correctness Beyond "was the agent right", add dimensions for politeness, completeness, tool efficiency, format compliance. Let a rubric look over an agent's entire trajectory to make judgments about how it used tools, etc. LLMs are good enough now (with sufficient prompt engineering, iteration, and testing) to grade other LLMs on basic and even more nuanced tasks. That's huge when it comes to automated training pipelines (e.g. RL pipelines where LLMs can act as the reward-giver to another LLM) and off the shelf eval solutions. Full post + code linked in the comments. If your agent returns free text, you need a way to grade it at scale. This is one way you can do that. #AgenticAI #LLMs #AI #MachineLearning #RAG #AIEvaluation

Explore categories