Researchers from Oxford University just achieved a 14% performance boost in mathematical reasoning by making LLMs work together like specialists in a company. In their new MALT (Multi-Agent LLM Training) paper, they introduced a novel approach where three specialized LLMs - a generator, verifier, and refinement model - collaborate to solve complex problems, similar to how a programmer, tester, and supervisor work together. The breakthrough lies in their training method: (1) Tree-based exploration - generating thousands of reasoning trajectories by having models interact (2) Credit attribution - identifying which model is responsible for successes or failures (3) Specialized training - using both correct and incorrect examples to train each model for its specific role Using this approach on 8B parameter models, MALT achieved relative improvements of 14% on the MATH dataset, 9% on CommonsenseQA, and 7% on GSM8K. This represents a significant step toward more efficient and capable AI systems, showing that well-coordinated smaller models can match the performance of much larger ones. Paper https://lnkd.in/g6ag9rP4 — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai
Improving LLM Performance for Algorithm Discovery
Explore top LinkedIn content from expert professionals.
Summary
Improving LLM performance for algorithm discovery means making large language models more skilled at finding, designing, or reasoning about algorithms—allowing them to solve complex problems more reliably and creatively. Researchers are now combining advanced training methods, real-time adaptation, and model collaboration to unlock new levels of AI problem-solving capabilities.
- Embrace collaborative models: Encourage specialized language models to work together, with each focusing on generating, verifying, or refining solutions, to boost performance on challenging reasoning tasks.
- Try real-time adaptation: Use techniques like test-time training, where models adjust their parameters while solving each new problem, to help them tackle unfamiliar tasks more successfully.
- Merge models strategically: Combine multiple language models using advanced merging methods, such as SLERP or branch-solve-merge, to tap into their unique strengths and create more robust and consistent results.
-
-
I discovered I was designing my AI tools backwards. Here’s an example. This was my newsletter processing chain : reading emails, calling a newsletter processor, extracting companies, & then adding them to the CRM. This involved four different steps, costing $3.69 for every thousand newsletters processed. Before: Newsletter Processing Chain (first image) Then I created a unified newsletter tool which combined everything using the Google Agent Development Kit, Google’s framework for building production grade AI agent tools : (second image) Why is the unified newsletter tool more complicated? It includes multiple actions in a single interface (process, search, extract, validate), implements state management that tracks usage patterns & caches results, has rate limiting built in, & produces structured JSON outputs with metadata instead of plain text. But here’s the counterintuitive part : despite being more complex internally, the unified tool is simpler for the LLM to use because it provides consistent, structured outputs that are easier to parse, even though those outputs are longer. To understand the impact, we ran tests of 30 iterations per test scenario. The results show the impact of the new architecture : (third image) We were able to reduce tokens by 41% (p=0.01, statistically significant), which translated linearly into cost savings. The success rate improved by 8% (p=0.03), & we were able to hit the cache 30% of the time, which is another cost savings. While individual tools produced shorter, “cleaner” responses, they forced the LLM to work harder parsing inconsistent formats. Structured, comprehensive outputs from unified tools enabled more efficient LLM processing, despite being longer. My workflow relied on dozens of specialized Ruby tools for email, research, & task management. Each tool had its own interface, error handling, & output format. By rolling them up into meta tools, the ultimate performance is better, & there’s tremendous cost savings. You can find the complete architecture on GitHub.
-
Exciting Research Alert: Enhancing Dense Retrieval with Deliberate Thinking I just came across a fascinating new paper titled "Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search" that introduces DEBATER (Deliberate Thinking based Dense Retriever), a novel approach to improve information retrieval using large language models. The research team from Northeastern University and Tsinghua University has developed a method that significantly outperforms existing dense retrieval systems by enabling LLMs to "think deliberately" before generating document representations. >> Technical Details DEBATER enhances LLM-based retrievers through two key mechanisms: 1. Chain-of-Deliberation (CoD): This approach delays the computation of document embeddings by performing several steps of reasoning. It incorporates a sequence of prompt tokens that stimulate the reasoning capability of LLMs, encouraging the model to think step-by-step before producing the final document embedding. 2. Self Distillation (SD): This mechanism distills knowledge from different thinking steps into the final document representation. It identifies the most informative thinking steps and integrates them into a unified text embedding. The implementation uses cosine similarity to measure the similarity between queries and documents. During training, DEBATER calculates similarity scores between query representation and document representations at each thinking step, then selects the most useful thinking step from CoD. >> Performance What's particularly impressive is that DEBATER-4B outperforms larger 7B-scale LLM-based dense retrievers while using significantly fewer parameters. In experiments on the BEIR benchmark, DEBATER achieved more than a 2% improvement over baseline retrievers. The researchers found that an appropriate thinking depth (around 4-8 steps) effectively activates the reasoning capabilities of LLM-based retrievers. Interestingly, larger models benefit more from extended reasoning due to their stronger ability to integrate and retain detailed intermediate steps. This work represents an important advancement in dense retrieval technology by leveraging the reasoning capabilities of LLMs to generate more effective document representations. If you're interested in information retrieval or LLMs, this paper is definitely worth checking out!
-
Given a single model, how do we improve an #LLM’s reasoning performance with limited resources 💻 and inference time⌛️? Can a smaller 1.5B model outperform a 7B model without incurring long inference time from sequential queries? In the work of Gregory Lau, Wenyang Hu, See-Kiong Ng, Bryan Kian Hsiang Low, et al. that was presented at the #NeurIPS2024 Workshop on Foundation Model Interventions, we introduce the framework called Dipper to create #LLMs ensembles from an optimized set of diverse reasoning prompts to improve performance. Unlike sequential inference time methods, Dipper runs queries in parallel, making it super fast ⏩️ and effective. Furthermore, Dipper can work with LLM APIs without model access 📦! With Dipper, we demonstrated how a small ensemble of just three 1.5B models can outperform a 7B model on MATH, while taking almost the same inference time and just <3x compute for a normal query thanks to accelerated batch inference methods 😱 ! Paper: https://lnkd.in/gXvmh_9X
-
Reasoning Models 2.0, combine Reasoning with Tool Use! ✨ START teaches LLMs to use tools, such as code interpreter to improve reasoning and problem-solving. Self-taught Reasoner with Tools (START) integrates tool usage with chain-of-thought reasoning by enabling tool calls, self-check, exploration, and self-debug while reasoning using a self-learning framework. 👀 Implementation 1️⃣ Collect math problems (AIME, MATH) and coding tasks (Codeforces, LiveCodeBench) 2️⃣ Create context-specific hints like "Maybe using Python here is a good idea" 3️⃣ Generate tool-assisted reasoning data (insert hints after conjunctions like "Wait" and before stop tokens) 6️⃣ Score trajectories, remove repetitive patterns, and create a seed dataset with successful tool-assisted reasoning examples. 7️⃣ Fine-tune model on seed dataset, then self self-Distill to generate more diverse reasoning trajectories 6️⃣ Fine-tune the base model using rejection sampling (RFT) on the extended dataset Insights 💡 Improves math accuracy by +15% (AMC23: 95.0%) and coding by +38.6% on medium problems. 📈 Test-time scaling via sequential hints boosts AIME24 performance by 12%. 🐞 Code template modification reduces debug errors by 41% in training data. 💡 Adding tools (Python interpreter) improves performance more than adding more training data. 🧠 Large models already possess latent tool-using abilities that can be activated through hints. 🛠️ Two-phase training (Hint-RFT then RFT) allows the model to learn effective tool usage. 📍 Hint place selection is important. After conjunction Token and before stop token. Paper: https://lnkd.in/emF_m8Qz
-
#LLMs trained with multi-token prediction show improved #performance and faster inference, especially for #code generation tasks. Researchers at AI at Meta, have developed a novel approach to training large language models (LLMs) that demonstrates significant improvements. By predicting multiple future tokens simultaneously during training, models achieve better sample #efficiency and downstream performance compared to traditional next-token prediction. Key findings: - Up to 17% improvement on coding benchmarks for 13B parameter models 3x faster inference speed using self-speculative decoding - Increasingly beneficial as model size grows (tested on models from 300M to 13B parameters) - Promotes learning of longer-term patterns and algorithmic reasoning This method addresses inefficiencies in how LLMs currently learn language and reasoning capabilities, potentially reducing the massive amounts of training data required. This research was introduced in April with this new training approach for better & faster LLMs using multi-token prediction and to enable further exploration by researchers, pretrained models for code completion using this approach is now available on Hugging Face Paper: https://lnkd.in/gKY8CDxi Hugging Face: https://lnkd.in/gMeVxmBb
-
🤖 New paper on improving #Agents and #LLMs through test-time learning (without updating parameters!) 🔍 We built a simple dynamic memory system that allows language models to track their best problem-solving strategies during inference. On "Game of 24", GPT-4o hit 99% accuracy after discovering (and remembering!) an efficient Python solution it could apply to all test cases. 🧠 You can reproduce this method with a few lines of code! See also: 📕 Paper: https://lnkd.in/dhqFVnVP 💻 Code: https://lnkd.in/drxHyDU9 Work led by Mirac Suzgun, with Mert Yuksekgonul, James Zou and Dan Jurafsky!
-
New paper: https://lnkd.in/gYJ4_XTz A groundbreaking new RAG architecture: REFRAG: Rethinking RAG-based Decoding REFRAG tackles one of LLM systems’ biggest challenges: - balancing fast responses with deep retrieved context. Instead of overwhelming the model with full token sequences, it intelligently injects compressed chunk embeddings, reusing retrieval work and drastically speeding up inference without sacrificing accuracy. A lightweight RL policy determines when to decompress key chunks, offering dynamic, efficient flexibility. The results? Up to 30× faster time to first token, with no loss in RAG performance, and often better accuracy than baselines when operating within the same latency budget. This is a big leap toward making knowledge-intensive AI applications faster and more scalable. Perfect for anyone working on real-time retrieval, agentic systems, or multi-turn dialog.
-
Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
-
Is this the next level of LLM "thinking"? We've been trying to make AI better at reasoning by forcing it to show more work - longer chains of thought, more attempts, more compute. But what if there was another way? New research from CMU and Stanford introduces something fascinating: teaching LLMs to first generate their own "reasoning abstractions" - essentially, creating their own cheat sheets before solving problems. Think of it like a student who, before tackling a math problem, first writes down the key principles and potential pitfalls they should watch for. The approach is surprisingly simple. They train two models that work together: one generates helpful hints and strategies (the abstraction generator), and another uses these hints to solve problems (the solution generator). The abstraction generator gets rewarded when its hints actually help solve problems, creating a virtuous cycle of improvement. The results? 44% improvement over previous state-of-the-art methods on challenging math competitions. One interesting thing to note: when given more compute budget, it's actually more effective to generate diverse strategies than to just try solving the problem more times. This suggests our models might have been stuck in local reasoning patterns, and abstractions help them explore genuinely different approaches. The same technique improved performance by 30% across legal reasoning, medical diagnosis, and other domains. Are we watching LLMs develop something that looks increasingly like metacognition? The ability to think about how to think. That's a capability that could fundamentally change how we deploy these systems in education, research, and problem-solving. ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development