Lack of proper evaluation is one of the biggest factors limiting adoption of enterprise-scale LLM applications. Even major labs often report performance in non-transparent ways. A recent Anthropic paper provides great new recommendations for evaluation using statistical theory and experimental design. A common scenario across internet, research papers, and companies: Two LLMs, Model A and B. Model A achieves 67% on the primary benchmark, Model B achieves 62%. Many conclude Model A is better. In reality, we can't say much from this information. We need to know the number of benchmark questions and if they were related. If the benchmark had fifty related questions, Model A might be lucky. If it used thousands of unrelated questions, the difference might be significant. Can we account for sample size and interdependence? Yes - rigorous science does it all the time. Interestingly, social science, not physics or biology, provides most insights for these evaluations. Questions in leading benchmarks like MMLU share many properties with social or medical studies. The Anthropic paper shows how to incorporate these practices in LLM evaluation: 1. Compute standard errors using the Central Limit Theorem. For unrelated questions (for experts: iid), this shows if differences between models are significant or luck. Most papers omit these error bars. 2. For related question groups, compute clustered standard errors. Benchmarks ignoring this can provide overconfident error bars, as shown in the Anthropic paper. 3. Reduce variance through resampling and next-token probability analysis. Individual samples have variance; these strategies reduce it. 4. Compare models using question-level paired differences, not population-level statistics. If questions are identical, analyze score differences per question, then average. 5. Use power analysis to determine if an evaluation can test a hypothesis. These techniques are well-known in science. Their adoption in LLM evaluation is really promising. It's also a nice revival of statistical theory in a field often focused on "if it works, it works. If it doesn’t work, let’s add more data and parameters." I'm excited about these opportunities and contributing to this effort. I’m really interested in learning from your evaluation experience and frameworks that you found helpful. #llms #machinelearning #deeplearning
Technology Benchmarking Practices
Explore top LinkedIn content from expert professionals.
Summary
Technology benchmarking practices are systematic approaches used to compare the capabilities, performance, and reliability of tech solutions—such as AI models or infrastructure—against standards or competitors. In simple terms, benchmarking helps organizations evaluate if their technology meets quality, cost, and trust requirements for real-world use.
- Balance accuracy and cost: When comparing AI agents or systems, make sure to consider both the accuracy of their results and the cost required to achieve them, so you avoid inefficient solutions.
- Test in real-world environments: Evaluate technology not just in ideal lab settings, but also under real-world conditions to understand how it performs during daily operations and challenges.
- Choose relevant benchmarks: Select benchmarks that closely match your specific use cases and continuously update them as your system evolves, ensuring your technology stays reliable and trustworthy.
-
-
Reliable evaluation methods for Large Language Model (LLM)-based agents are crucial, yet rapidly evolving. In my view, traditional LLM benchmarks fall short because they don’t fully capture the unique capabilities agents bring—such as planning, reasoning, tool use, self-reflection, memory, and interaction with dynamic environments. Here's how I see current evaluation methods shaping up across four key areas: 1. Agent Capabilities Evaluation: - Planning & Multi-Step Reasoning: Benchmarks like GSM8K, HotpotQA, ARC, and frameworks such as ToolEmu and PlanBench effectively test an agent's ability to decompose tasks, reason causally, and correct its own errors. - Function Calling & Tool Use: Evaluations are evolving from basic tasks (ToolBench) toward complex interactions (ComplexFuncBench, NESTFUL). - Self-Reflection: Benchmarks like LLF-Bench and LLM-Evolve help gauge how agents learn from their past actions and feedback. - Memory: Tools such as MemGPT and LoCoMo are valuable for assessing context retention, while StreamBench and Reflexion highlight continuous learning and effective action planning. 2. Application-Specific Agent Evaluation: - Web Agents: Moving from simple simulations (MiniWob) to complex and realistic environments (WebArena, WorkArena++). - Software Engineering Agents: Practical benchmarks like SWE-bench and SWELancer are critical for measuring an agent’s capability in real-world software development scenarios. - Scientific Agents: Platforms like ScienceWorld and CORE-Bench are effective for evaluating an agent's proficiency in research tasks such as literature synthesis and experiment design. - Conversational Agents: Realistic dialogues and user simulations (MultiWOZ, ALMITA) provide a reliable way to test conversational skills. 3. Generalist Agent Evaluation: - Tools such as GAIA, TheAgentCompany, and CRMArena effectively assess agents across diverse, real-world scenarios. 4. Evaluation Frameworks: - LLMOps tools are indispensable for detailed evaluation, real-time monitoring, and identifying areas for improvement during agent development and deployment. Trends & Future Directions: - Evaluations are increasingly moving towards dynamic, realistic, and challenging environments. - Regular updates to benchmarks are essential to keep up with rapid technological advancements. - Future research should prioritize granular, step-by-step metrics, consider efficiency and cost implications, develop automated evaluation strategies, and rigorously assess safety, compliance, and bias. Addressing these gaps is crucial for responsible and effective deployment of LLM-based agents in real-world applications.
-
𝔼𝕍𝔸𝕃 field note (2 of 3): Finding the benchmarks that matter for your own use cases is one of the biggest contributors to AI success. Let's dive in. AI adoption hinges on two foundational pillars: quality and trust. Like the dual nature of a superhero, quality and trust play distinct but interconnected roles in ensuring the success of AI systems. This duality underscores the importance of rigorous evaluation. Benchmarks, whether automated or human-centric, are the tools that allow us to measure and enhance quality while systematically building trust. By identifying the benchmarks that matter for your specific use case, you can ensure your AI system not only performs at its peak but also inspires confidence in its users. 🦸♂️ Quality is the superpower—think Superman—able to deliver remarkable feats like reasoning and understanding across modalities to deliver innovative capabilities. Evaluating quality involves tools like controllability frameworks to ensure predictable behavior, performance metrics to set clear expectations, and methods like automated benchmarks and human evaluations to measure capabilities. Techniques such as red-teaming further stress-test the system to identify blind spots. 👓 But trust is the alter ego—Clark Kent—the steady, dependable force that puts the superpower into the right place at the right time, and ensures these powers are used wisely and responsibly. Building trust requires measures that ensure systems are helpful (meeting user needs), harmless (avoiding unintended harm), and fair (mitigating bias). Transparency through explainability and robust verification processes further solidifies user confidence by revealing where a system excels—and where it isn’t ready yet. For AI systems, one cannot thrive without the other. A system with exceptional quality but no trust risks indifference or rejection - a collective "shrug" from your users. Conversely, all the trust in the world without quality reduces the potential to deliver real value. To ensure success, prioritize benchmarks that align with your use case, continuously measure both quality and trust, and adapt your evaluation as your system evolves. You can get started today: map use case requirements to benchmark types, identify critical metrics (accuracy, latency, bias), set minimum performance thresholds (aka: exit criteria), and choose complementary benchmarks (for better coverage of failure modes, and to avoid over-fitting to a single number). By doing so, you can build AI systems that not only perform but also earn the trust of their users—unlocking long-term value.
-
#AI benchmarking conversations I have often turn into a false choice. #Synthetic vs. #real_world. The truth? Both matter. At Dell Technologies, our TME labs not only analyze benchmarks but also work directly with customer’s production environments to see real-world challenges. Synthetic benchmarks show what the infrastructure can do under ideal conditions. They expose architectural limits, scaling characteristics, peak throughput, and theoretical efficiency. That’s important. It tells you the ceiling. But production environments don’t operate at the ceiling. Real-world benchmarking shows what the system will do under concurrency, mixed workloads, thermal pressure, network contention, and operational overhead. That’s where latency, sustained throughput, and cost per token actually determine business value. One shows maximum capability. The other reveals operational truth. If you’re evaluating AI infrastructure and only looking at peak tokens per second, you’re missing half the story. The best enterprise AI strategies validate both: • Lab performance to understand headroom • Production performance to understand reality Peak numbers sell slides. Sustained performance drives outcomes. #AI #AIBenchmarking #EnterpriseAI #Infrastructure #DataCenter #IWork4Dell
-
𝐀𝐈 𝐀𝐠𝐞𝐧𝐭𝐬 𝐓𝐡𝐚𝐭 𝐌𝐚𝐭𝐭𝐞𝐫 The paper critically examines AI agent benchmarking and evaluation practices, identifying key shortcomings that hinder real-world applicability. It argues that current evaluations focus too much on accuracy without considering cost leading to inefficient, expensive agents. The authors propose a more holistic framework for benchmarking that incorporates cost control, distinct evaluations for model and downstream developers, proper holdout sets to prevent overfitting and standardized evaluation practices to improve reproducibility. 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝟏. 𝐍𝐞𝐞𝐝 𝐟𝐨𝐫 𝐂𝐨𝐬𝐭-𝐂𝐨𝐧𝐭𝐫𝐨𝐥𝐥𝐞𝐝 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧𝐬 -Many AI agents achieve high accuracy by running repeated model calls, significantly increasing costs. -The authors introduce simple baseline agents that outperform state-of-the-art (SOTA) complex architectures at lower costs. -They advocate for evaluations that balance accuracy and cost to prevent inefficient agent designs. 𝟐. 𝐉𝐨𝐢𝐧𝐭 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐚𝐧𝐝 𝐂𝐨𝐬𝐭 -The authors propose a Pareto curve approach to optimize accuracy and inference cost. -By modifying the DSPy framework, they show that cost can be reduced while maintaining accuracy on benchmarks like HotPotQA. 𝟑. 𝐌𝐨𝐝𝐞𝐥 𝐯𝐬. 𝐃𝐨𝐰𝐧𝐬𝐭𝐫𝐞𝐚𝐦 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐍𝐞𝐞𝐝𝐬 -Model developers focus on scientific improvements (e.g., parameter count) whereas downstream developers need cost-efficient solutions. -The study highlights misleading benchmarks like NovelQA which fail to account for real-world cost implications. 𝟒. 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐢𝐧𝐠 𝐒𝐡𝐨𝐫𝐭𝐜𝐮𝐭𝐬 𝐚𝐧𝐝 𝐎𝐯𝐞𝐫𝐟𝐢𝐭𝐭𝐢𝐧𝐠 𝐑𝐢𝐬𝐤𝐬 -Many agent benchmarks lack proper holdout sets, leading to overfitting. They classify benchmarks into four generality levels: 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧-𝐬𝐩𝐞𝐜𝐢𝐟𝐢𝐜 – Focus on a narrow task (e.g., U.S. grade school math). 𝐓𝐚𝐬𝐤-𝐬𝐩𝐞𝐜𝐢𝐟𝐢𝐜 – Account for distribution shifts (e.g., booking flights). 𝐃𝐨𝐦𝐚𝐢𝐧-𝐠𝐞𝐧𝐞𝐫𝐚𝐥 – Cover a broad domain (e.g., web browsing). 𝐅𝐮𝐥𝐥𝐲 𝐠𝐞𝐧𝐞𝐫𝐚𝐥 – Evaluate agents across domains (e.g., web + robotics). Most benchmarks fail to implement appropriate holdout strategies making results unreliable. 𝟓. 𝐋𝐚𝐜𝐤 𝐨𝐟 𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐑𝐞𝐩𝐫𝐨𝐝𝐮𝐜𝐢𝐛𝐢𝐥𝐢𝐭𝐲 Current agent evaluations are inconsistent due to: -Non-standardized evaluation scripts. -Repurposing LLM benchmarks for agent testing. -High costs making multiple evaluation runs impractical. -External environment dependencies (e.g., web interactions). 𝟔. 𝐇𝐮𝐦𝐚𝐧-𝐢𝐧-𝐭𝐡𝐞-𝐋𝐨𝐨𝐩 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐢𝐬 𝐍𝐞𝐠𝐥𝐞𝐜𝐭𝐞𝐝 -Most agent benchmarks assume either full automation or no user interaction. -Real-world AI agents often operate with human guidance which current benchmarks do not capture. -Future benchmarks should incorporate human feedback to better reflect real-world agent performance. Keep learning 😊 !!
-
Most teams claim to be building general agents, but they evaluate them inside benchmark sandboxes that quietly assume domain-specific integration. The bottleneck is not agent capability. It is evaluation architecture. General Agent Evaluation from IBM Research reframes the problem by treating general-agent evaluation as a systems design challenge rather than a leaderboard exercise. The flawed assumption in today’s practice is that benchmarks are neutral. They are not. They encode communication protocols, hidden environment semantics, and integration shortcuts that effectively test a partially adapted agent. What looks like generalization is often protocol alignment. The correction they suggest is the Unified Protocol and Exgentic framework. Instead of forcing agents to speak each benchmark’s native dialect, they introduce a mediation layer with three explicit fields: task, context, and actions. Every benchmark is decoupled from its bespoke API and expressed as a canonical contract. Every agent is adapted once to that contract rather than pairwise to every benchmark. The key architectural move is the narrow waist. Benchmarks and agents run in separate processes. Adaptors translate between native protocols such as MCP, tool calling, or Python execution and the Unified Protocol. The orchestrator mediates observations and actions in a deterministic loop, with isolation, caching, and reproducibility guarantees. This is evaluation harness engineering, not prompt tuning. Across six environments and ninety configurations, model choice explains 28.2 percent of success rate variance while agent architecture explains 0.6 percent. The backbone dominates. Architectural differences matter, but largely through interaction effects and components such as schema guards and tool shortlisting. For example, GPT 5.2 configurations fail outright in tool-rich environments without shortlisting due to tool count limits, showing that harness design and tool exposure policies can determine viability. There are constraints. Evaluation is expensive, limited to text-based interaction, and still requires adaptor work when integrating new agents or benchmarks. But the structural insight holds. For builders, the takeaway is to stop benchmarking agents inside benchmark-specific wrappers and calling the result generalization. Build a mediation layer that standardizes task contracts, isolates execution, and preserves native agent behavior. Measure cross-environment stability under that constraint. As a rule of thumb if your agent must be rewritten per benchmark, you are testing integration skill, not general capability. Generalization is an infrastructure property before it is a model property. Paper: https://lnkd.in/eVscSkKe Project: http://www.exgentic.ai
-
The Fab Whisperer: Benchmarking — With Our Competitors This week at the SEMI FOA Q1 Collaborative Forum, fabs will compare numbers. Benchmarking is healthy. But let’s address a somewhat uncomfortable truth: The most powerful benchmarking happens when you’re willing to compare yourself — honestly — with your competitors. Yes. Competitors. Semiconductor manufacturing is not a zero-sum efficiency game. When one fab improves: Suppliers improve. Standards mature. Tool performance baselines rise. Reliability practices evolve. The entire ecosystem benefits. The automotive and aerospace industries did this. Even oil & gas learned this lesson decades ago. We still hesitate in semiconductors. Fabs worry about IP leakage, cost exposure, revealing weaknesses and competitive positioning. All those are valid concerns but structured benchmarking forums exist specifically to allow for normalized data sharing, aggregated comparisons and anonymous performance quartiles that altogether drive standardized definitions. No one is sharing recipes or customer lists. We are sharing operational truth. I’ve seen fabs enter benchmarking forums reluctantly. Then something interesting happens, they discover their “world-class” OEE is actually median. Their PM compliance is high — but PM effectiveness is bottom quartile. Their cycle time is competitive — but variability is extreme. Their staffing looks lean — but engineering load per tool group is unsustainable. Those realizations sharpen a fab. The Best Way to Benchmark — Collaboratively If you’re going to benchmark with peers (and competitors), do it right. 1️⃣ Align Definitions with SEMI standards First. No “creative math.” 2️⃣ Normalize Structurally - Mask layers, tool intensity, technology node, automation level and mix complexity. Without normalization, comparisons are noise. 3️⃣ Share Loss Mechanisms — Not just surface metrics. The real learning happens when fabs discuss issues like PM-induced failures, scheduling logic, variability drivers, staffing & capacity models. That’s where breakthroughs happen. 4️⃣ Compete on Improvement Speed — It’s not about who is best today, it’s about who closes gaps fastest. The fabs that refuse to benchmark collaboratively often overestimate their maturity and underestimate structural weaknesses, missing industry shifts and ultimately improve slower. The fabs that engage openly (within proper boundaries) will build sharper diagnostics, improve faster, gain credibility with suppliers and attract stronger engineering talent. in the big picture, benchmarking is strategic intelligence. as we enter a period of massive CapEx expansions, regionalization, talent shortages, and tool cost inflation, no single fab can afford to operate in isolation anymore. Structured collaboration is essential for industry maturity. We can do it. #TheFabWhisperer #SEMI #Semiconductor #FabOperations #Benchmarking #ManufacturingExcellence #OperationalExcellence
-
Benchmarking is one of the most direct ways to answer a question every UX team faces at some point: is the design meeting expectations or just looking good by chance? A benchmark might be an industry standard like a System Usability Scale score of 68 or higher, an internal performance target such as a 90 percent task completion rate, or the performance of a previous product version that you are trying to improve upon. The way you compare your data to that benchmark depends on the type of metric you have and the size of your sample. Getting that match right matters because the wrong method can give you either false confidence or unwarranted doubt. If your metric is binary such as pass or fail, yes or no, completed or not completed, and your sample size is small, you should be using an exact binomial test. This calculates the exact probability of seeing your result if the true rate was exactly equal to your benchmark, without relying on large-sample assumptions. For example, if seven out of eight users succeed at a task and your benchmark is 70 percent, the exact binomial test will tell you if that observed 87.5 percent is statistically above your target. When you have binary data with a large sample, you can switch to a z-test for proportions. This uses the normal distribution to compare your observed proportion to the benchmark, and it works well when you expect at least five successes and five failures. In practice, you might have 820 completions out of 1000 attempts and want to know if that 82 percent is higher than an 80 percent target. For continuous measures such as task times, SUS scores, or satisfaction ratings, the right approach is a one-sample t-test. This compares your sample mean to the benchmark mean while taking into account the variation in your data. For example, you might have a SUS score of 75 and want to see if it is significantly higher than the benchmark of 68. Some continuous measures, like task times, come with their own challenge. Time data are often right-skewed: most people finish quickly but a few take much longer, pulling the average up. If you run a t-test on the raw times, these extreme values can distort your conclusion. One fix is to log-transform the times, run the t-test on the transformed data, and then exponentiate the mean to get the geometric mean. This gives a more realistic “typical” time. Another fix is to use the median instead of the mean and compare it to the benchmark using a confidence interval for the median, which is robust to extreme outliers. There are also cases where you start with continuous data but really want to compare proportions. For example, you might collect ratings on a 5-point scale but your reporting goal is to know whether at least 75 percent of users agreed or strongly agreed with a statement. In this case, you set a cut-off score, recode the ratings into agree versus not agree, and then use an exact binomial or z-test for proportions.
-
🚀 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐢𝐧𝐠 𝐌𝐋-𝐁𝐚𝐬𝐞𝐝 𝐏𝐃𝐄 𝐒𝐨𝐥𝐯𝐞𝐫𝐬: 𝐁𝐞𝐬𝐭 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬 Last week, I shared insights from the study 𝘞𝘦𝘢𝘬 𝘉𝘢𝘴𝘦𝘭𝘪𝘯𝘦𝘴 𝘢𝘯𝘥 𝘙𝘦𝘱𝘰𝘳𝘵𝘪𝘯𝘨 𝘉𝘪𝘢𝘴𝘦𝘴 𝘓𝘦𝘢𝘥 𝘵𝘰 𝘖𝘷𝘦𝘳𝘰𝘱𝘵𝘪𝘮𝘪𝘴𝘮 𝘪𝘯 𝘔𝘢𝘤𝘩𝘪𝘯𝘦 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘧𝘰𝘳 𝘍𝘭𝘶𝘪𝘥-𝘙𝘦𝘭𝘢𝘵𝘦𝘥 𝘗𝘢𝘳𝘵𝘪𝘢𝘭 𝘋𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘵𝘪𝘢𝘭 𝘌𝘲𝘶𝘢𝘵𝘪𝘰𝘯𝘴 by Nick McGreivy and Ammar Hakim (link in comments). The authors highlighted a crucial issue: many #ML -based #solvers aren't benchmarked against appropriate baselines, leading to misleading conclusions. ⚠️ 𝐒𝐨, 𝐰𝐡𝐚𝐭’𝐬 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡? 🤔 The key lies in comparing the #Cost vs. #Accuracy of #algorithms, reflecting the inherent trade-off between efficiency and precision in numerical methods. While quick, low-accuracy approximations are common, highly accurate results typically require more computational time. ⏱️ 📊 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲-𝐒𝐩𝐞𝐞𝐝 𝐏𝐚𝐫𝐞𝐭𝐨 𝐂𝐮𝐫𝐯𝐞𝐬: 𝐀 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤𝐢𝐧𝐠 𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝 From my experience, the most effective way to benchmark is by using Pareto curves of accuracy versus computational time (see the figure below). These curves offer a clear, visual comparison, showing how different methods perform under the same hardware conditions. They also mirror real-world engineering decisions, where finding a balance between speed and accuracy is critical. ⚖️ An example of this can be seen in Aditya Phopale’s master thesis, where the performance of a #NeuralNetwork-based solver was compared against the state-of-the-art general purpose #Fenics solver. 🔍 𝐂𝐡𝐨𝐨𝐬𝐢𝐧𝐠 𝐭𝐡𝐞 𝐑𝐢𝐠𝐡𝐭 𝐁𝐚𝐬𝐞𝐥𝐢𝐧𝐞 𝐒𝐨𝐥𝐯𝐞𝐫 Nick McGreivy and Ammar Hakim also emphasize the importance of selecting an appropriate baseline. While Fenics might not be the top-notch choice when it comes to computational efficiency for a specific problem (e.g., vs spectral solvers), it is still highly relevant from an #engineering perspective. Both the investigated solver and Fenics share a similar philosophy: they are general-purpose, Python-based solvers that are based on equation formulations. 🧩 Additionally, unlike #FiniteElement solvers like Fenics, the investigated Neural Network solvers don’t require complex discretization. Thus, Fenics serves as a suitable baseline for practical engineering applications, despite its "limitations" from a more theoretical context. 💡 𝐖𝐡𝐚𝐭 𝐀𝐫𝐞 𝐘𝐨𝐮𝐫 𝐁𝐞𝐬𝐭 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬? I’m curious to hear from others: what best practices do you follow when benchmarking ML-based PDE solvers? Let’s discuss! 👇
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development