Comparing LLM Variants for Model Performance

Explore top LinkedIn content from expert professionals.

  • View profile for Chris Fregly

    Engineering and Product Leader, 3x O’Reilly Author (ex-AWS, Databricks, Netflix), Investor, Advisor, Friend

    42,192 followers

    TL;DR 🧠 Smaller LLMs outperform giants: A 1B LLM can surpass a 405B LLM on reasoning tasks like MATH-500 using compute-optimal Test-Time Scaling (TTS). 🚀 Efficiency boost: Smaller models achieve higher accuracy with 14.1× faster inference and 256× fewer FLOPS compared to larger models. 🔍 Key insight: TTS strategies depend on policy model size, Process Reward Models (PRMs), and problem difficulty. Problems & Solutions 🛑 Problem 1: Lack of systematic analysis of how policy models, PRMs, and problem difficulty affect TTS. ✅ Solution: Introduced reward-aware compute-optimal TTS to dynamically adapt strategies. 🛑 Problem 2: PRMs struggled with out-of-distribution (OOD) responses and token-length bias. ✅ Solution: Implemented absolute difficulty thresholds and PRM-Vote aggregation to improve robustness. Experiments & Setup 📚 Tasks: MATH-500 (500 problems) and AIME24 (advanced math challenges). 🤖 Models: Llama 3 (1B-405B), Qwen2.5 (0.5B-72B), and DeepSeek-R1 variants. ⚖️ Metrics: Pass@k, token efficiency, FLOPS comparison. 🔧 Ablations: PRM scoring methods (Min/Last/Avg) and voting strategies (Majority/PRM-Max/PRM-Vote). 💻 Hardware: 8×A100 GPU clusters for TTS experiments with beam width=4 and max tokens=8192. Novel Insights 🧩 Policy model size matters: Best-of-N (BoN) works well for large models, while Beam Search and DVTS excel for smaller ones. 📉 PRM limitations: Observed over-criticism, error neglect, and token-length bias in PRMs, impacting TTS performance. ⚖️ Trade-off: TTS gains diminish as policy model size increases (e.g., 154.6% gain for 1B vs. 9.5% for 72B). Improvements Over Prior Work 🚀 135× size gap: A 3B model outperforms a 405B model, improving the prior benchmark of 23×. 🔬 Enhanced PRMs: Qwen2.5-Math-PRM-72B enables 7B models to surpass o1 and DeepSeek-R1. ⏱️ Efficiency: 1B model + TTS achieves 256× fewer FLOPS compared to 405B CoT models. Key Implementation Details 🔄 Reward-aware TTS: Integrated PRM scores into a Markov Decision Process (MDP) framework for dynamic scaling. 🌳 DVTS: Parallel subtree exploration for diverse reasoning paths. 📉 Absolute difficulty bins: Replaced quantile-based thresholds with fixed Pass@1 ranges (easy: 50%-100%, medium: 10%-50%, hard: 0%-10%). Resources Paper: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (https://lnkd.in/g55ybikb) 🤖 Models: Llama-3.2-3B-Instruct (https://lnkd.in/gnQ3d87S), Qwen2.5-Math-PRM (https://lnkd.in/gk6gMqMw). 🔧 Framework: OpenR (https://lnkd.in/gCPxPR4H) for TTS pipelines. 📊 Datasets: MATH-500 (https://lnkd.in/g4jvAzsp), PRM800K (https://lnkd.in/gEb6XE3A). 🌐 Project Page: Compute-Optimal TTS (https://lnkd.in/gVutpamZ).

  • View profile for Sebastian Raschka, PhD
    Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

    ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

    233,805 followers

    Training LLMs for spam classification: I added 14 experiments comparing different approaches: https://lnkd.in/gTNVvGcj - which token to train - which layers to train - different model sizes - LoRA - unmasking - and more! Any additional experiments you'd like to see? And here are the take aways for the table shown in the picture: 1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. 2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is also results in substantially better results than training only the last layer. 3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. 4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) 5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. 6. Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4): Keeping the model frozen and adding trainable LoRA layers (see Appendix E for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. 7. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10): Padding the input to the full supported context length results is significantly worse. 8. Padding vs no padding (Row 1 vs. 11 and 12): The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments. 9. Disabling the causal attention mask (Row 1 vs. 13): Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    628,042 followers

    Most people still think of LLMs as “just a model.” But if you’ve ever shipped one in production, you know it’s not that simple. Behind every performant LLM system, there’s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs aren’t one-dimensional. They’re systems. And each dimension introduces new failure points or optimization levers. Let’s break it down: 🧠 Pre-Training Start with modality. → Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. → Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. 🛠 Fine-Tuning This is where most teams underestimate complexity: → PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. → Alignment techniques- RLHF, DPO, RAFT, aren’t interchangeable. They encode different human preference priors. → Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. ⚡️ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. 📏 Evaluation One benchmark doesn’t cut it. You need a full matrix: → NLG (summarization, completion), NLU (classification, reasoning), → alignment tests (honesty, helpfulness, safety), → dataset quality, and → cost breakdowns across training + inference + memory. Evaluation isn’t just a model task, it’s a systems-level concern. 🧾 Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isn’t trivial anymore. It’s an orchestration layer in itself. Whether you’re building for legal, education, robotics, or finance, the “general-purpose” tag doesn’t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

  • View profile for Brij kishore Pandey
    Brij kishore Pandey Brij kishore Pandey is an Influencer

    AI Architect & Engineer | AI Strategist

    720,897 followers

    Many people talk about LLMs like they’re all the same. But in reality, not all transformer-based models are built for the same job. If you truly want to design AI systems (not just use APIs), you must understand the 4 major LLM architecture families and what each one is optimized for. Here’s the simplified breakdown : 1) Decoder-Only Models (GPT, LLaMA) Best for: text generation Chatbots Code generation Agent reasoning Next-token prediction Key idea: Only predicts the next token based on prior context. 2) Encoder-Only Models (BERT, RoBERTa) Best for: understanding + classification Semantic search embeddings Intent classification NER / sentiment analysis Feature extraction Key idea: Reads the full input context at once and builds a deep representation. 3) Encoder–Decoder Models (T5, BART) Best for: input → output transformations Summarization Translation Rewrite/paraphrase Structured generation tasks Key idea: Encoder understands input, decoder generates output. 4) Mixture of Experts (MoE: Mixtral, GLaM) Best for: scaling efficiently High performance at lower compute cost Activates only selected “experts” for each input Key idea: Not all parameters are used every time → smarter compute usage. Why this matters? Because in real enterprise architecture: You don’t always need generation Sometimes you need embeddings Sometimes you need transformation Sometimes you need scale at lower cost Understanding architecture = better AI design decisions. If you're building Agentic workflows, RAG pipelines, or GenAI platforms… this is foundational knowledge. Which of these are you using most today — Decoder-only or MoE models?

  • View profile for Greg Coquillo
    Greg Coquillo Greg Coquillo is an Influencer

    AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

    229,005 followers

    You need to check out the Agent Leaderboard on Hugging Face! One question that emerges in the midst of AI agents proliferation is “which LLMs actually delivers the most?” You’ve probably asked yourself this as well. That’s because LLMs are not one-size-fits-all. While models thrive in structured environments, others don’t handle the unpredictable real world of tool calling well. The team at Galileo🔭 evaluated 17 leading models in their ability to select, execute, and manage external tools, using 14 highly-curated datasets. Today, AI researchers, ML engineers, and technology leaders can leverage insights from Agent Leaderboard to build the best agentic workflows. Some key insights that you can already benefit from: - A model can rank well but still be inefficient at error handling, adaptability, or cost-effectiveness. Benchmarks matter, but qualitative performance gaps are real. - Some LLMs excel in multi-step workflows, while others dominate single-call efficiency. Picking the right model depends on whether you need precision, speed, or robustness. - While Mistral-Small-2501 leads OSS, closed-source models still dominate tool execution reliability. The gap is closing, but consistency remains a challenge. - Some of the most expensive models barely outperform their cheaper competitors. Model pricing is still opaque, and performance per dollar varies significantly. - Many models fail not in accuracy, but in how they handle missing parameters, ambiguous inputs, or tool misfires. These edge cases separate top-tier AI agents from unreliable ones. Consider the below guidance to get going quickly: 1- For high-stakes automation, choose models with robust error recovery over just high accuracy. 2- For long-context applications, look for LLMs with stable multi-turn consistency, not just a good first response. 3- For cost-sensitive deployments, benchmark price-to-performance ratios carefully. Some “premium” models may not be worth the cost. I expect this to evolve over time to highlight how models improve tool calling effectiveness for real world use case. Explore the Agent Leaderboard here: https://lnkd.in/dzxPMKrv #genai #agents #technology #artificialintelligence

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,027 followers

    I just came across a groundbreaking paper titled "Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders" that provides comprehensive insights into how large language models (LLMs) perform in recommendation tasks. The researchers from The Hong Kong Polytechnic University, Huawei Noah's Ark Lab, Nanyang Technology University, and National University of Singapore have developed RecBench - a systematic evaluation platform that thoroughly assesses the capabilities of LLMs in recommendation scenarios. >> Key Technical Insights: This benchmark evaluates various item representation forms: - Unique identifiers (traditional approach) - Text representations (using item descriptions) - Semantic embeddings (leveraging pre-trained LLM knowledge) - Semantic identifiers (using discrete encoding techniques like RQ-VAE) The study covers two critical recommendation tasks: - Click-through rate (CTR) prediction (pair-wise recommendation) - Sequential recommendation (list-wise recommendation) Their extensive experiments evaluated 17 different LLMs across five diverse datasets from fashion, news, video, books, and music domains. The results are eye-opening: - LLM-based recommenders outperform conventional recommenders by up to 5% AUC improvement in CTR prediction and a staggering 170% NDCG@10 improvement in sequential recommendation - However, these performance gains come with significant computational costs, making real-time deployment challenging - Conventional deep learning recommenders enhanced with LLM support can achieve 95% of standalone LLM performance while being thousands of times faster Under the hood, the researchers implemented a conditional beam search technique for semantic identifier-based models to ensure valid item recommendations. They also employed low-rank adaptation (LoRA) for parameter-efficient fine-tuning of the large models. Most interestingly, they found that while most LLMs have limited zero-shot recommendation abilities, models like Mistral, GLM, and Qwen-2 performed significantly better, likely due to exposure to more implicit recommendation signals during pre-training. This research opens exciting avenues for recommendation system development while highlighting the need for inference acceleration techniques to make LLM-based recommenders practical for industrial applications.

  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    165,277 followers

    How good are LLMs in long-context RAGs? Databricks Mosaic Research ran over 2,000 experiments on 13 open and closed LLMs on 4 curated RAG datasets. 👀 TL;DR: 📈 Retrieving more documents generally improves RAG performance up to a point 🔬 Most models show decreased performance after ~32k context size (e.g., 32k for Llama-3.1-405b, 64k for GPT-4) 🆚 Different models fail in distinct ways with long contexts (e.g., copyright concerns, summarization instead of answering) 🚫 Claude 3.5 copyright-related failures increased from 3.7% at 16k context to 49.5% at 64k context. 👎🏻 DBRX failure to follow instructions jumped from 5.2% at 8k context to 50.4% at 32k. 🔄 Mixtral started to generated repeated content ("梦梦梦梦梦梦") 🤷🏻♂️ LLMs still suffer from the "lost in the middle", where they fail to utilize information from the middle portions of long texts effectively 📊 Optimal context size depends on both the model and the specific task 🧠 Lack of long-context post-training may be the reason for model failures Blog: https://lnkd.in/eCXDuZPP A good example where fine-tuning on domain/task-specific long-context synthetic datasets could significantly improve the performance of open models compared to closed models. 🚀

  • View profile for Aditi Kulkarni

    Lead - Accenture Advanced Technology Centers - Global Network & India. | Passionate to help clients drive their enterprise transformation and innovation journey

    14,688 followers

    I recently spent time getting more hands-on with LLM & Agentic AI engineering through Ed Donner's training. Instead of stopping at examples, I built a mini multi-agent logistics delivery optimization framework. Building real AI systems quickly makes one thing clear: 𝙏𝙝𝙚 𝙝𝙖𝙧𝙙 𝙥𝙖𝙧𝙩 𝙞𝙨𝙣’𝙩 𝙩𝙝𝙚 𝙢𝙤𝙙𝙚𝙡 — 𝙞𝙩’𝙨 𝙩𝙝𝙚 𝙖𝙧𝙘𝙝𝙞𝙩𝙚𝙘𝙩𝙪𝙧𝙚 𝙙𝙚𝙘𝙞𝙨𝙞𝙤𝙣𝙨 𝙖𝙧𝙤𝙪𝙣𝙙 𝙞𝙩. A few practical lessons: 1. 𝗟𝗟𝗠 𝗺𝗼𝗱𝗲𝗹 𝘀𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗶𝘀 𝗳𝗮𝗿 𝗺𝗼𝗿𝗲 𝗻𝘂𝗮𝗻𝗰𝗲𝗱 𝘁𝗵𝗮𝗻 𝗰𝗼𝘀𝘁 𝘃𝘀 𝗹𝗮𝘁𝗲𝗻𝗰𝘆. Trade-offs: • reasoning maturity for complex planning • context window & memory strategy • proprietary models vs smaller open models • infra costs (GPU/hosting) vs token-based API costs • tool-calling reliability & structured output adherence • benchmark performance vs real task behavior • model stability across releases In practice, it becomes a hybrid strategy: 𝘀𝗺𝗮𝗹𝗹𝗲𝗿/𝗰𝗵𝗲𝗮𝗽𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝗿𝗼𝘂𝘁𝗶𝗻𝗲 𝘁𝗮𝘀𝗸𝘀 + 𝗦𝗟𝗠 𝘄𝗶𝘁𝗵 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝗱𝗼𝗺𝗮𝗶𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 + 𝘀𝘁𝗿𝗼𝗻𝗴𝗲𝗿 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗺𝗼𝗱𝗲𝗹𝘀 𝗳𝗼𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘅 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻𝘀. 𝟮. 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗮𝘀 𝗺𝘂𝗰𝗵 𝗮𝘀 𝘁𝗵𝗲 𝗟𝗟𝗠: Many AI demos over-engineer the stack. In reality, simplicity, latency, security and reliability matter more than novelty. • Use orchestration frameworks only where coordination complexity exists • Combine prompts with structured outputs to reduce ambiguity • Watch serialization and tool-call overhead — they impact latency and UX • Reduce unnecessary LLM calls when deterministic code can solve the task Besides lowering token cost, this improves context efficiency, letting models focus on real reasoning. Sometimes best architecture decision is 𝙣𝙤𝙩 𝙞𝙣𝙩𝙧𝙤𝙙𝙪𝙘𝙞𝙣𝙜 𝙖𝙣𝙤𝙩𝙝𝙚𝙧 𝙡𝙖𝙮𝙚𝙧. 3. 𝗕𝗶𝗴𝗴𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 ≠ 𝗯𝗲𝘁𝘁𝗲𝗿 𝗼𝘂𝘁𝗰𝗼𝗺𝗲𝘀 Smaller models with fine-tuning on domain data can perform more consistently than larger ones. Fine-tuning helps when: • tasks are repetitive but require precision • domain vocabulary is specialized • prompts become fragile But 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝗮𝗹𝘀𝗼 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗲𝘀 𝗹𝗶𝗳𝗲𝗰𝘆𝗰𝗹𝗲 𝗼𝘃𝗲𝗿𝗵𝗲𝗮𝗱. Base model upgrades trigger retesting and partial rewrites. 4. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝗴𝗮𝗽: 𝗽𝗿𝗼𝘁𝗼𝘁𝘆𝗽𝗲 → 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 Demos are easy. Production requires 𝙚𝙫𝙖𝙡𝙪𝙖𝙩𝙞𝙤𝙣 𝙛𝙧𝙖𝙢𝙚𝙬𝙤𝙧𝙠𝙨, 𝙤𝙗𝙨𝙚𝙧𝙫𝙖𝙗𝙞𝙡𝙞𝙩𝙮, 𝙨𝙚𝙘𝙪𝙧𝙞𝙩𝙮, 𝙥𝙚𝙧𝙛𝙤𝙧𝙢𝙖𝙣𝙘𝙚, 𝙘𝙤𝙨𝙩 𝙜𝙤𝙫𝙚𝙧𝙣𝙖𝙣𝙘𝙚 & 𝙜𝙪𝙖𝙧𝙙𝙧𝙖𝙞𝙡𝙨. That’s where most engineering effort goes. 𝟱. 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝗹𝗲𝗮𝗱𝗲𝗿𝘀 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 𝗔𝗜 𝗽𝗿𝗼𝗴𝗿𝗮𝗺𝘀 Many AI conversations focus on SDLC productivity- Useful but the bigger opportunity is 𝙧𝙚𝙞𝙢𝙖𝙜𝙞𝙣𝙞𝙣𝙜 𝙡𝙚𝙜𝙖𝙘𝙮 𝙗𝙪𝙨 𝙥𝙧𝙤𝙘𝙚𝙨𝙨𝙚𝙨 𝙪𝙨𝙞𝙣𝙜 𝘼𝙜𝙚𝙣𝙩𝙞𝙘 AI. By simply automating existing steps, we risk making inefficient tasks efficient and missing the real transformation.

  • View profile for Weili Xu

    Senior Research Engineer | Team Lead

    1,886 followers

    I read a paper from NVIDIA Research last month that made a strong case for shifting from giant large language models (LLMs) to leaner, more specialized small language models (SLMs). I couldn’t agree more. https://lnkd.in/gbBNd_Bm Here are my top three takeaways: 1. Efficiency First – Models under 10B parameters consume fewer tokens, run faster, and cost significantly less to operate. Lower latency, reduced infrastructure demands, and greener AI. 2. Specialized Power – While large models excel at general conversation, small models shine in narrowly scoped tasks. Fine-tuning for a specific job can often match or exceed the performance of much larger models. 3. Better Fit for Agentic Systems – Most AI agents repeat structured, tool-based actions. SLMs are easier to fine-tune, deploy on-device, and integrate into modular multi-agent workflows, resulting in faster, cheaper, and more aligned systems. To test the theory, I built a specialized agent that generates a typical energy model based on building type and climate zone. I swapped between Qwen3:14B and Qwen3:4B on my local computer (M3, 18GB RAM). Running the same user query to generate results: Qwen3:14B – Input tokens: 3,052 | Output tokens: 2,070 | Duration: 164.24 s Qwen3:4B – Input tokens: 2,048 | Output tokens: 619 | Duration: 8.34 s That’s about 30% fewer tokens and 20× faster — achieving the same result. Sometimes, the future of AI is not about going bigger, but about going smaller, smarter, and faster. #AI #ArtificialIntelligence #MachineLearning #LLM #SLM #SmallLanguageModels #LargeLanguageModels #AgenticAI #MultiAgentSystems #EdgeAI #OnDeviceAI #NaturalLanguageProcessing #EnergyModeling #BuildingPerformance #EfficiencyInAI #TokenOptimization #ModelOptimization #AITesting #AIResearch

  • Benchmarking LLMs for voice agent use cases. New open source repo, along with a deep dive into how we think about measuring LLM performance. The headline results: - The newest SOTA models are all *really* good, but too slow for production voice agents. GPT-4.1 and Gemini 2.5 Flash are still the most widely used models in production. The benchmark shows why. - Ultravox 0.7 shows that it's possible to close the "intelligence gap" between speech-to-speech models and text-mode LLMs. This is a big deal! - Open weights models are climbing up the capability curve. Nemotron 3 Nano is almost as capable as GPT-4o. (And achieves this with only 30B parameters.) GPT-4o was the most widely used model for voice agents until quite recently, so a small open weights model scoring this well is a strong indication that production use of open weights models will grow this year. Voice agents are a moderately "out of distribution" use case for all of our SOTA LLMs today. Literally, in the sense that there's not enough long, multi-turn conversation data in the training sets. Everyone who builds voice agents knows this intuitively, from doing lots of manual testing. (Vibes-based evals!) This benchmark scores LLMs quantitatively on instruction following, tool calling, and knowledge retrieval in long-context, multi-turn conversations. Blog post: https://lnkd.in/eBygqsTR Benchmark code: https://lnkd.in/eTaKZMwj Side note: we call this the aiwf_medium_context benchmark because it's a descendant of tooling we originally built to test the performance of the pre-release Gemini Live model that powered the @aidotengineer World's Fair voice concierge.

Explore categories