Recent Developments in LLM Models

ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

233,891 followers 1y

2024 was a great year for building general-purpose LLMs, with some specialized fine-tuning for math and code. So far, 2025 seems to be the year of diverging into two key areas: (1) Reasoning models (focused on math and code), and (2) Agents (essentially LLM-based workflow automation). This year is going to be another eventful one for LLM research and development! I plan to write more about reasoning models soon. In the meantime, if you're looking for focused reading this weekend, I've re-compiled my take on the noteworthy LLM research papers of 2024. The topics include mixture-of-experts, new scaling laws for precision, scaling inference-time compute, and more. It's all packed into a PDF-friendly, 47-page article with a table of contents for easy navigation: https://lnkd.in/gFUT9cnk Topics: 1. January: Mixtral’s Mixture of Experts Approach 1.1 Understanding MoE models 1.2 The relevance of MoE models today 2. February: Weight-decomposed LoRA 2.2 LoRA Recap 2.2 From LoRA to DoRA 2.3 The future of LoRA and LoRA-like methods 3. March: Tips for Continually Pretraining LLMs 3.1 Simple techniques work 3.2 Will these simple techniques continue to work? 4. April: DPO or PPO for LLM alignment, or both? 4.1 RLHF-PPO and DPO: What Are They? 4.2 PPO Typically Outperforms DPO 4.3 How are PPO and DPO used today? 5. May: LoRA learns less and forgets less 5.1 LoRA learns less 5.2 LoRA forgets less 5.3 The LoRA trade-off 5.4 Future approaches to finetuning LLMs 6. June: The 15 Trillion Token FineWeb Dataset 6.1 Comparison to other datasets 6.2 Principled dataset development 6.3 The relevance of FineWeb today 7. July: The Llama 3 Herd of Models 7.1 Llama 3 architecture summary 7.2 Llama 3 training 7.3 Multimodal Llamas 7.4 Llama 3 impact and usage 8. August: Improving LLMs by scaling inference-time compute 8.1 Improve outputs by using more test-time computation 8.2 Optimizing test-time computation techniques 8.3 Test-time computation versus pretraining a larger model 8.4 Future relevance of test-time compute scaling 9. September: Comparing multimodal LLM paradigms 9.1 Multimodal LLM paradigms 9.2 Nvidia’s hybrid approach 9.3 Multimodal LLMs in 2025 10. October: Replicating OpenAI O1’s reasoning capabilities 10.1 Shortcut learning vs journey learning 10.2 Constructing long thoughts 10.3 Distillation – the quick fix? 10.4 The state of AI research 10.5 The future of LLMs in the light of o1 (and o3) 11. November: LLM scaling laws for precision 11.1 Chinchilla scaling laws refresher 11.2 Low-precision training 11.3 Precision scaling laws takeaways 11.4 Model scaling laws in 2025 12. December: Phi-4 and learning from synthetic data 12.1 Phi-4 performance 12.2 Synthetic data learnings 12.4 Future importance of synthetic data Conclusions and outlook Multimodal LLMs Computational efficiency State space models Scaling What I am looking forward to

Noteworthy LLM Research Papers of 2024 sebastianraschka.com

24 Comments

Aishwarya Srinivasan

628,128 followers 6mo

If you’re an AI engineer trying to understand how reasoning actually works inside LLMs, this will help you connect the dots. Most large language models can generate. But reasoning models can decide. Traditional LLMs followed a straight line: Input → Predict → Output. No self-checking, no branching, no exploration. Reasoning models introduced structure, a way for models to explore multiple paths, score their own reasoning, and refine their answers. We started with Chain-of-Thought (CoT) reasoning, then extended to Tree-of-Thought (ToT) for branching, and now to Graph-based reasoning, where models connect, merge, or revisit partial thoughts before concluding. This evolution changes how LLMs solve problems. Instead of guessing the next token, they learn to search the reasoning space- exploring alternatives, evaluating confidence, and adapting dynamically. Different reasoning topologies serve different goals: • Chains for simple sequential reasoning • Trees for exploring multiple hypotheses • Graphs for revising and merging partial solutions Modern architectures (like OpenAI’s o-series reasoning models, Anthropic’s Claude reasoning stack, DeepSeek R series and DeepMind’s AlphaReasoning experiments) use this idea under the hood. They don’t just generate answers, they navigate reasoning trajectories, using adaptive depth-first or breadth-first exploration, depending on task uncertainty. Why this matters? • It reduces hallucinations by verifying intermediate steps • It improves interpretability since we can visualize reasoning paths • It boosts reliability for complex tasks like planning, coding, or tool orchestration The next phase of LLM development won’t be about more parameters, it’ll be about better reasoning architectures: topologies that can branch, score, and self-correct. I’ll be doing a deep dive on reasoning models soon on my Substack- exploring architectures, training approaches, and practical applications for engineers. If you haven’t subscribed yet, make sure you do: https://lnkd.in/dpBNr6Jg ♻️ Share this with your network 🔔 Follow along for more data science & AI insights

55 Comments

Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,028 followers 1y

Exciting news in the world of AI and information retrieval! Researchers have developed DRAMA (Dense Retriever from diverse LLM AugMentAtion), a groundbreaking framework that leverages large language models (LLMs) to create smaller, more efficient dense retrievers without sacrificing performance. Key innovations of DRAMA: 1. Data Augmentation: Utilizes LLMs to generate high-quality training data, including cropped sentences as queries, synthetic queries, and LLM-based reranking. 2. Pruned LLM Backbones: Starts with Llama3.2 1B and prunes it down to 0.1B and 0.3B models, preserving multilingual and long-context capabilities. 3. Single-Stage Training: Combines LLM-based data augmentation with pruned LLM backbones in a streamlined training process. 4. Matryoshka Representation Learning: Enables flexible dimensionality selection at inference time for various deployment scenarios. DRAMA achieves impressive results across multiple benchmarks: - Matches or outperforms larger models on BEIR and MIRACL datasets - Demonstrates strong multilingual capabilities - Excels in long-context retrieval tasks This work, led by researchers from FAIR at Meta and the University of Waterloo, showcases the potential of aligning smaller retriever training with ongoing LLM advancements. It's a significant step towards more efficient and generalizable information retrieval systems.

Sharada Yeluri

Engineering Leader

21,533 followers 1y

A lot has changed since my #LLM inference article last January—it’s hard to believe a year has passed! The AI industry has pivoted from focusing solely on scaling model sizes to enhancing reasoning abilities during inference. This shift is driven by the recognition that simply increasing model parameters yields diminishing returns and that improving inference capabilities can lead to more efficient and intelligent AI systems. OpenAI's o1 and Google's Gemini 2.0 are examples of models that employ #InferenceTimeCompute. Some techniques include best-of-N sampling, which generates multiple outputs and selects the best one; iterative refinement, which allows the model to improve its initial answers; and speculative decoding. Self-verification lets the model check its own output, while adaptive inference-time computation dynamically allocates extra #GPU resources for challenging prompts. These methods represent a significant step toward more reasoning-driven inference. Another exciting trend is #AgenticWorkflows, where an AI agent, a SW program running on an inference server, breaks the queried task into multiple small tasks without requiring complex user prompts (prompt engineering may see end of life this year!). It then autonomously plans, executes, and monitors these tasks. In this process, it may run inference multiple times on the model while maintaining context across the runs. #TestTimeTraining takes things further by adapting models on the fly. This technique fine-tunes the model for new inputs, enhancing its performance. These advancements can complement each other. For example, an AI system may use agentic workflow to break down a task, apply inference-time computing to generate high-quality outputs at each step and employ test-time training to learn unexpected challenges. The result? Systems that are faster, smarter, and more adaptable. What does this mean for inference hardware and networking gear? Previously, most open-source models barely needed one GPU server, and inference was often done in front-end networks or by reusing the training networks. However, as the computational complexity of inference increases, more focus will be on building scale-up systems with hundreds of tightly interconnected GPUs or accelerators for inference flows. While Nvidia GPUs continue to dominate, other accelerators, especially from hyperscalers, would likely gain traction. Networking remains a critical piece of the puzzle. Can #Ethernet, with enhancements like compressed headers, link retries, and reduced latencies, rise to meet the demands of these scale-up systems? Or will we see a fragmented ecosystem of switches for non-Nvdia scale-up systems? My bet is on Ethernet. Its ubiquity makes it a strong contender for the job... Reflecting on the past year, it’s clear that AI progress isn’t just about making things bigger but smarter. The future looks more exciting as we rethink models, hardware, and networking. Here’s to what the 2025 will bring!

14 Comments

Sohrab Rahimi

Director, AI/ML Lead @ Google

23,610 followers 1y

Recent research is advancing two critical areas in AI: autonomy and reasoning, building on their strengths to make them more autonomous and adaptable for real-world applications. Here is a summary of a few papers that I found interesting and rather transformative: • 𝐋𝐋𝐌-𝐁𝐫𝐚𝐢𝐧𝐞𝐝 𝐆𝐔𝐈 𝐀𝐠𝐞𝐧𝐭𝐬 (𝐌𝐢𝐜𝐫𝐨𝐬𝐨𝐟𝐭): These agents use LLMs to interact directly with graphical interfaces—screenshots, widget trees, and user inputs—bypassing the need for APIs or scripts. They can execute multi-step workflows through natural language, automating tasks across web, mobile, and desktop platforms. • 𝐀𝐅𝐋𝐎𝐖: By treating workflows as code-represented graphs, AFLOW dynamically optimizes processes using modular operators like “generate” and “review/revise.” This framework demonstrates how smaller, specialized models can rival larger, general-purpose systems, making automation more accessible and cost-efficient for businesses of all sizes. • 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥-𝐀𝐮𝐠𝐦𝐞𝐧𝐭𝐞𝐝 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 (𝐑𝐀𝐑𝐄): RARE integrates real-time knowledge retrieval with logical reasoning steps, enabling LLMs to adapt dynamically to fact-intensive tasks. This is critical in fields like healthcare and legal workflows, where accurate and up-to-date information is essential for decision-making. • 𝐇𝐢𝐀𝐑-𝐈𝐂𝐋:: Leveraging Monte Carlo Tree Search (MCTS), this framework teaches LLMs to navigate abstract decision trees, allowing them to reason flexibly beyond linear steps. It excels in solving multi-step, structured problems like mathematical reasoning, achieving state-of-the-art results on challenging benchmarks. By removing the reliance on APIs and scripts, systems like GUI agents and AFLOW make automation far more flexible and scalable. Businesses can now automate across fragmented ecosystems, reducing development cycles and empowering non-technical users to design and execute workflows. Simultaneously, reasoning frameworks like RARE and HiAR-ICL enable LLMs to adapt to new information and solve open-ended problems, particularly in high-stakes domains like healthcare and law. These studies highlight key emerging trends in AI: 1. APIs and Simplifying Integration: A major trend is the move away from API dependencies, with AI systems integrating directly into existing software environments through natural language and GUI interaction. This addresses one of the largest barriers to AI adoption in organizations. 2. Redefining User Interfaces: Traditional app interfaces with icons and menus are being reimagined. With conversational AI, users can simply ask for what they need, and the system executes it autonomously. 3. Tackling More Complex Tasks Autonomously: As reasoning capabilities improve, AI systems are expanding their range of activities and elevating their ability to plan and adapt. As these trends unfold, we’re witnessing the beginning of a new era in AI. Where do you see the next big research trends in AI heading?

3 Comments

Jason Eshraghian

Assistant Professor, University of California, Santa Cruz

6,997 followers 6mo

Rui-Jie Zhu has done it again. While our previous LLMs targeted efficiency (SpikeGPT, MatMul-free LM), this work is straight up focused on hitting SoTA. Our latest models "Ouro 1.4B" and "Ouro 2.6B", match or beat frontier models 2-3x larger (Qwen3, Llama3.2, Gemma3). The modern path to improving LLMs has been: 1) add parameters, 2) scale tokens, and 3) add "reasoning" by letting models talk to themselves via Chain-of-Thought. (1) More parameters require more data to remain compute-optimal. (2) Frontier models already train on most of the entire internet. We've run into a bit of a data wall. (3) These limits turned attention to reasoning models. Reasoning models are powerful, but they bloat your finite context window. They also under-utilize the trillions of tokens available in the pretraining process, as reasoning is usually accounted for post-hoc. So are parameters and tokens the only two scaling dimensions in pretraining? Our latest paper shows a third scaling dimension: the depth of computation. We scale latent-space reasoning via recurrent depth during pretraining. Inspired by Universal Transformers, we push this to an industrial scale, which introduced non-trivial challenges. - We pretrained on 7.7 trillion tokens. This is an unprecedented industry-scale of "looped computation". - We trained the model to learn when to stop. This needed an entropy regularized objective to prevent reward-hacking, and constantly maximizing the number of loops. For simple tokens, the model might exit early at T=2, while for complex problems, it is more likely to max out the recursion steps. - Ouro 1.4B (up to 4 loops) matches the performance of SOTA 4B Transformers across the board. Ouro 2.6B (4 loops) matches 8B & 12B models. - We tried to lift the hood on why this is so powerful. On raw knowledge storage (bits-per-parameter), Ouro is similar to standard LLMs. On manipulation/reasoning-heavy tasks, looped computation wins. This was consistent with the "physics of LLMs" view that more compute-per-tokens improves reasoning efficiency. Thank you to all collaborators for making this happen - across ByteDance, Princeton, Conscium, MILA, Univ of Montreal, Peking, Carnegie Mellon, UPenn, Univ of Manchester, and M-A-P. Base models and Thinking variants have been open-sourced with vLLM/SGLang integration: https://lnkd.in/giBjDKYE Preprint: https://lnkd.in/gmEkUpNs Rui-Jie Zhu: In need of a well-deserved nap.

24 Comments

Mary Newhauser

Member of Technical Staff @ Fastino Labs

28,596 followers 8mo

The biggest lie in AI is that every new LLM is a revolution. The core design is ancient. Sure, performance and capabilities of LLMs have progressed quickly over the last several years. But when it comes to model architectures, are we seeing revolutionary changes or just spiced up versions of older techniques? Sebastian Raschka, PhD answers this with beautiful visuals and the most thorough dive into modern transformer architectures I’ve seen in ages. In his recent article, he focuses on: • Specific architectural developments that lead to improved performance (e.g. RoPE, Grouped-Query Attention, MoE) • Head-to-head architectural comparisons of SOTA open models (e.g. Qwen3 vs. DeepSeek V3, Qwen3 vs. SmoLM3, Kimi K2 vs. DeepSeek V3) • Architectural summaries of the most defining LLMs Highly recommend reading the full post, with visuals. 🔗: https://lnkd.in/gk_z9Y_u

90 Comments

Greg Coquillo

229,035 followers 8mo

Large Language Model architectures have evolved from basic transformer blocks into smart systems that deliver massive intelligence without crushing your compute budget. Engineers stopped just making models bigger and started making them cleverer, solving the core problem of how to scale AI brains while keeping them practical to actually run. This shift shows how the industry learned to work smarter, not harder, when building the next generation of AI systems. Here's how modern LLM architectures tackle different needs and scales: 🔹 Lightweight Speed (LLAMA 3.2 1B): This is built for edge devices and fast responses, packing standard attention mechanisms into a dense 8192-dimension architecture. It runs quickly on limited hardware while handling 128k vocabulary efficiently, perfect when you need AI that works anywhere without massive servers. 🔹 Smart Balance (QWEN3 4B, SmolLM3 3B): These models introduce clever tricks like SwiGLU activation and NoPE techniques to squeeze more performance from fewer resources. They support 128k token contexts while training faster and running cheaper than traditional approaches, hitting the sweet spot between capability and practicality. 🔹 Expert Networks (DeepSeek V3, QWEN3 235B): These architectures transforms scaling by using Mixture-of-Experts routing - 671B total parameters but only 22-37B active per question. The system dynamically picks which expert networks to use, giving you trillion-parameter intelligence at billion-parameter costs. It's like having a massive team where only the right specialists work on each problem. 🔹 Extreme Intelligence (KIMI K2 1T): This represents the current peak with 1 trillion parameters but only 32B active per token. Advanced routing mechanisms let it handle complex reasoning tasks that smaller models can't touch, while still running efficiently enough for real-world deployment. Engineers moved from "throw more parameters at it" to "design smarter systems that activate intelligence selectively." #llm #artificialintelligence

52 Comments

Pascal Biese

AI Lead at PwC </> Daily AI highlights for 80k+ experts 📲🤗

85,079 followers 1y

LLMs are becoming universal problem solvers - and it's not just hype. Researchers from MIT just published a paper showing how Large Language Models (LLMs) can be used as general-purpose planners, tackling a wide range of optimization problems. Most existing frameworks rely heavily on task-specific examples and pre-defined critics, which limits their ability to generalize across different types of problems. But what if we could leverage the reasoning and programming capabilities of LLMs to tackle planning problems in a more universal way? LLMFP is a new framework that formulates planning problems as optimization problems. By doing that, can capture key information and solve them from scratch - no task-specific examples needed! The researchers put LLMFP to the test on 9 different planning problems, ranging from multi-constraint decision making to multi-step planning. These are the results: 1. LLMFP boosted the performance of both GPT-4o and Claude 3.5 Sonnet 2. Both significantly outperformed the best out-of-the-box baseline (OpenAI o1-preview) by over 35% 3. This includes multi-constraint decision making and multi-step planning problems. So what does this mean for you? While we shouldn't think of LLMs as replacement for human planners, they might be getting quite capable at basic planning very soon. Developing strong planning capabilities will be an important cornerstone of the next stage of AI agents. For more AI highlights, check out this week's LLM Watch: https://lnkd.in/d3MBfMzU

11 Comments

Recent Developments in LLM Models

More in Large Language Models Insights

Explore categories