Thinking in loops, not layers: A look at Tiny Recursive Models (TRM) and Other Recursive Intelligence
Credit: Gemini Nano Banana

Thinking in loops, not layers: A look at Tiny Recursive Models (TRM) and Other Recursive Intelligence

Executive Summary

The artificial intelligence industry is currently navigating a period of profound turbulence and transformation. For the better part of a decade, the dominant heuristic driving progress in machine learning has been the "Scaling Hypothesis"—the empirical observation that increasing model parameter counts, training dataset sizes, and computational budgets yields predictable and continuous improvements in downstream performance. This "bigger is better" paradigm has culminated in the development of trillion-parameter behemoths, the massive capitalization of hyper-scale data centers, and a geopolitical race for semiconductor supremacy. However, as we approach the midpoint of the decade, emerging technical and economic signals suggest that this era of brute-force scaling is hitting a hard ceiling. We are witnessing the dawn of the Recursive Intelligence Era, a fundamental architectural shift from static, retrieval-based generation to dynamic, inference-time reasoning.

This report provides an exhaustive, multi-dimensional analysis of the transition from massive Transformer-based Large Language Models (LLMs) to highly efficient, recursive architectures such as the Hierarchical Reasoning Model (HRM) and the Tiny Recursion Model (TRM). We posit that the future of AI lies not in models that "know" more, but in models that "think" better. By shifting the computational burden from training-time parameter bloat to test-time recursive reasoning, next-generation architectures are demonstrating that small, agile models (ranging from 7 million to 1 billion parameters) can outperform massive incumbents on complex reasoning benchmarks.

This analysis explores the technical limitations of the current scaling paradigm, including the "Scaling Wall" and the diminishing returns of data; the "band-aid" nature of Mixture of Experts (MoE) architectures which fail to address fundamental memory bandwidth bottlenecks; and the mechanistic breakthroughs of latent-space reasoning. We detail the specific architectural innovations of HRM and TRM, and broaden the context to include other efficiency-first architectures like Liquid Neural Networks and RWKV. Finally, we articulate the economic implications of this shift, predicting a move from "cost per token" to "cost per thought" and the democratization of high-level intelligence at the network edge.

1. The Plateau of Parameter Efficiency

The trajectory of AI development since the introduction of the Transformer architecture in 2017 has been defined by an exponential increase in model size. From the 117 million parameters of early GPT models to the multi-trillion parameter frontiers of GPT-4, Gemini, and Claude, the industry has operated under the assumption that scale is the primary, if not the only, driver of capability. This assumption, codified in the Kaplan et al. (2020) and Hoffmann et al. (2022) scaling laws, provided a roadmap that equated capital expenditure on GPUs directly with intelligence. However, as we move through 2024-25, the laws of diminishing returns are asserting themselves with undeniable force, creating what analysts are terming the "Scaling Wall."

1.1 The Plateau of Parameter Efficiency

The "Scaling Hypothesis"—championed by researchers like Ilya Sutskever—posited that simply increasing compute and data would indefinitely yield intelligence gains. While this held true for general knowledge retrieval, linguistic fluency, and few-shot adaptation, it is failing to generalize to complex reasoning, novel problem-solving, and long-horizon planning.

Recent evaluations indicate that while test loss (the metric measuring how well a model predicts the next token) continues to decrease marginally with scale, the economic value of that improvement is collapsing. We are observing a decoupling of "perplexity" (a measure of how surprised a model is by text) and "capability" (the ability to solve a real-world problem). A model that is 10x larger may be 1% better at predicting the next word in a Wikipedia article, but no better at solving a novel differential equation or debugging a complex software codebase.

The industry is encountering a "Scaling Wall" where the cost to train a model 10x larger yields only incremental improvements in practical capability.

1.2 The Energy and Economic Sustainability Crisis

The most immediate and tangible barrier to further scaling is physical and economic sustainability. The computational demand for training state-of-the-art (SOTA) models is approaching the energy output of small nations. Training a single 175B parameter model consumes roughly 1,287 MWh of energy—equivalent to the annual consumption of over 100 U.S. homes—and emits hundreds of tons of carbon dioxide.

However, training is merely the down payment. The true environmental cost lies in inference—the ongoing operation of the model.

Data centers are becoming the primary bottleneck in the AI supply chain, not due to a lack of silicon, but due to a lack of power.

Moreover, the "cost per token" logic of the cloud-based API economy is breaking down for complex agentic workflows. If an AI agent requires 50 iterative steps to solve a coding problem (planning, writing, reviewing, fixing, deploying), paying for a 700B parameter model for every intermediate step—regardless of the difficulty of that specific step—is economically non-viable.

2. Mixture of Experts (MoE): A Band-Aid, Not a Cure

Faced with the unsustainable computational costs of dense models (where every parameter is active for every token), the industry has largely pivoted to Mixture of Experts (MoE) architectures (e.g., Mixtral, GPT-4, Grok). While MoE is often touted as a solution to the scaling problem—allowing for trillion-parameter capacities with reduced inference costs—rigorous analysis suggests it is merely a stopgap measure, a "band-aid" rather than a fundamental cure for the limitations of the Transformer.

2.1 The Mechanics of MoE: Sparse Activation

MoE models differ from dense models by replacing the standard Feed-Forward Network (FFN) layers with a set of multiple "expert" layers. A trainable gating network (or "router") determines which experts (usually the top-1 or top-2) are activated for a given token.11

For example, in a model like Mixtral 8x7B, the total parameter count might be 47 billion, but for any specific token, only roughly 13 billion parameters are active (2 experts x 7B). This theoretically creates a model with the knowledge breadth of a giant (due to the large total parameter count) but the inference speed (FLOPs) of a dwarf. It allows labs to claim "Trillion Parameter" status while keeping inference latency manageable.

2.2 Limitations/Complexity

  • Memory bandwidth bottleneck for router stage
  • Routing Collapse (favored experts)
  • Tough training mechanism

It extends the life of Transformers but does not eradicates the limitations properly.

3. The Paradigm Shift: From Retrieval to Reasoning

The most significant development in AI research in the 2024-2025 window is the shift from Training-Time Compute to Test-Time Compute. This marks the transition from models that rely on memorized patterns (Retrieval) to models that actively process information at runtime (Reasoning). This is the core thesis of the "Recursive Intelligence" movement.

3.1 The Rise of System 2 AI

Cognitive science, popularized by Daniel Kahneman, distinguishes between two modes of thought:

  • System 1: Fast, automatic, intuitive, and unconscious. (e.g., answering "2+2", reading a stop sign).
  • System 2: Slow, effortful, logical, and calculating. (e.g., solving "17 x 24", planning a travel itinerary).

Standard Transformers are the ultimate System 1 engines. They predict the next token based on statistical likelihood derived from their training data. They do not "pause" to verify their logic; they merely flow. If the training data contains the answer, they retrieve it. If it requires a novel logical leap, they often hallucinate a statistically plausible but factually incorrect continuation.

The new wave of "Reasoning Models" exemplified by OpenAI's o1 and emerging recursive architectures, introduces System 2 capabilities. These models are designed to "think" before they speak. This thinking process is not magic; it is the allocation of computational resources during the inference phase to explore solution paths, verify intermediate steps, and backtrack if necessary.

3.2 Test-Time Compute: The New Scaling Law

Historical scaling laws focused on pre-training: more GPUs running for more months on more data equals a smarter model. The new scaling law operates at inference time: more time spent "thinking" on a specific prompt equals a smarter answer.

OpenAI's o1 model demonstrates this vividly. Its performance on complex benchmarks (like the AIME math competition) scales logarithmically with the amount of time it is allowed to "reason" before outputting a final answer. A visual analysis of o1's performance shows that accuracy increases predictably relative to log-scale compute during inference.

This suggests a profound architectural implication: A smaller model, given sufficient time to recurse, refine its thoughts, and verify its logic, can outperform a massive model that is forced to answer instantly. This "inference-time scaling" is the mechanism that breaks the parameter wall. We do not need trillion-parameter static weights; we need dynamic architectures that can effectively utilize test-time compute to generate depth of thought rather than just accessing a breadth of knowledge.

3.3 Mechanisms of Inference Reasoning

How does a model "think"? Currently, several mechanisms are being explored:

  1. Chain of Thought (CoT): Explicitly generating tokens that describe the steps. (e.g., "First I need to calculate X, then Y...").
  2. Best-of-N with Verification: Generating $N$ solutions and using a separate "verifier" model to score them.
  3. Tree of Thoughts: Exploring multiple branching possibilities and pruning dead ends.
  4. Latent Recursion: (The focus of this report) Processing information in the hidden states without generating text, looping the context through the model multiple times.

While CoT has been the standard, it is inefficient. It forces the model to serialize high-dimensional concepts into low-bandwidth natural language. The future lies in Latent Space Reasoning, where the thinking happens in vectors, not words.

4. Recursive Intelligence: The Architectural Solution

The implementation of System 2 thinking in current Large Language Models (LLMs) is often achieved through explicit "Chain of Thought" (CoT)—prompting the model to generate text tokens that represent its reasoning steps. However, this is computationally expensive and inefficient. It forces the model to "speak" to itself in English, losing the rich, high-dimensional nuance of its internal state. The true breakthrough lies in Recursive Intelligence implemented via Latent Space Reasoning.

4.1 Latent Space vs. Token Space

Traditional CoT forces the model to externalize its thought process. This is slow and constrained by the ambiguity of human language. Latent Space Reasoning moves this process inside the neural network's hidden layers.

In a Latent Reasoning architecture, the model performs iterative updates on its internal state vectors (embeddings) without generating an output token. It "thinks" in pure math—high-dimensional vectors that can represent complex, nuanced concepts that do not translate 1:1 to words. This allows for:

  • Higher Bandwidth: Concepts are processed in their full dense representation, retaining uncertainty and multiple meanings until the final collapse into an output token.
  • Privacy: The reasoning trace is hidden within the model weights, not exposed in the output text.
  • Efficiency: No need to run the massive "un-embedding" matrix to convert vectors to words and back again for every step of thought.

4.2 The Recursive Loop

Recursion in this context refers to the architecture's ability to loop its output state back as its input state multiple times to refine a representation. Unlike the single-pass feed-forward nature of a standard Transformer block, a recursive block can cycle N times to deepen the reasoning depth without adding parameters.

Consider a 12-layer model. In a standard run, the data passes through 12 layers. In a recursive run where the data loops 10 times, the effective computational depth is equivalent to a 120-layer model. However, the memory footprint remains that of a 12-layer model. This allows for massive effective depth on constrained hardware.

This approach aligns with the "Recurrent" Neural Network (RNN) history but is modernized with attention mechanisms and stable training techniques to avoid the vanishing gradient problems of the past.

5. The Tiny Giants: HRM and TRM

The embodiment of this new paradigm is found in two cutting-edge architectures: the Hierarchical Reasoning Model (HRM) and the Tiny Recursion Model (TRM). These models challenge the industry obsession with size, proving that structural efficiency can trump raw scale.

5.1 Hierarchical Reasoning Model (HRM)

Developed by Sapient Intelligence, HRM is a "brain-inspired" architecture designed to mimic the biological interplay between slow planning and fast execution. It explicitly targets the limitations of CoT by internalizing the reasoning process.

  • Architecture: HRM utilizes a dual-module structure that fundamentally separates planning from execution:

  • H-Module (High-Level): A recurrent network that operates at a slow timescale. It handles abstract planning, long-horizon context, and global strategy. It updates its state only after a sequence of lower-level operations.
  • L-Module (Low-Level): A fast-updating module that executes rapid, concrete computations and token generation, guided by the state of the H-Module.

  • Mechanism: The H-Module effectively sets the "strategy" or "intent" in the latent space (a vector $z_H$), which guides the L-Module's token generation. This separation of concerns allows the model to maintain coherence over long tasks without the context drift that plagues standard Transformers. It mirrors the prefrontal cortex's role in guiding motor functions.
  • Performance: With only 27 million parameters, HRM has achieved shocking results on the ARC-AGI benchmark (a test of novel reasoning and pattern recognition). HRM scored 40.3%, a score comparable to models orders of magnitude larger (e.g., Claude 3.7 at 21.2% in some settings, though benchmarks vary).
  • Significance: HRM proves that reasoning capability is a function of architecture, not just parameter count. By dedicating specific structures to "planning" (System 2) versus "token generation" (System 1), it achieves high performance with minimal memory footprint (under 200MB of RAM). This allows it to run on standard CPUs, democratizing access to reasoning.

5.2 Tiny Recursion Model (TRM)

Emerging from research at Samsung's Advanced Institute of Technology (Montreal), led by researchers like Alexia Jolicoeur-Martineau, TRM pushes the limits of miniaturization even further.

  • Architecture: TRM employs a highly condensed recursive neural network structure. Instead of stacking hundreds of unique layers (which explodes parameter count), it reuses a single, highly optimized block recursively.
  • Mechanism: The model iterates on the problem in the latent space. When presented with a logic puzzle, it does not immediately predict the answer. It cycles the input through its recursive block, refining the internal representation until a confidence threshold is met, and only then generates the output. This is effectively "thinking time" implemented as a loop.
  • Performance: TRM (approx. 7 million parameters) has demonstrated the ability to outperform models 10,000x larger (like GPT-4 class models and Google's Gemini 2.5 Pro) on specific mathematical and logical reasoning benchmarks.
  • The "David vs. Goliath" Metric: The key metric here is Parameter Efficiency. TRM demonstrates that for defined reasoning tasks, massive knowledge bases (trillions of tokens of internet text) are unnecessary. Pure reasoning rules can be encoded in very small, recursive weights.
  • Deployment: Because it is so small, TRM is essentially free to run in terms of memory. It can reside in the L2 cache of a modern processor, offering blazing fast access speeds compared to fetching weights from VRAM.

6. The Landscape of Efficiency: Beyond HRM and TRM

While HRM and TRM represent the bleeding edge of recursive reasoning, they sit within a broader movement of "Efficient Architectures" challenging the Transformer monopoly. Below is the examination of other non-Transformer approaches that prioritize efficiency (significantly fewer parameters, often 50-100x smaller than SOTA LLMs).

  • LFMs (Liquid Neural Networks): Liquid AI, an MIT spin-off, has introduced Liquid Foundation Models (LFMs), based on dynamic systems theory rather than discrete layers. LFMs use "Liquid" networks that adapt their underlying differential equations based on the input data stream. They utilize adaptive linear operators that allow the model's behavior to shift dynamically during inference. On benchmarks, a 1.3B parameter LFM outperforms Meta’s Llama 3.1-8B and Microsoft's Phi-3.5, effectively punching 6x above its weight class while using significantly less memory.
  • State Space Models: A resurgence of recurrent ideas modernized for GPUs, developed by researchers at CMU and Princeton. It allows the model to selectively remember or forget information in a continuous stream, similar to an RNN but parallelizable during training (the "Parallel Scan" algorithm). It Provides linear scaling instead of Transformer's quadratic. Similar approach - Receptance weighted key value (RWKV).
  • DisCIPL: Manager-Worker framework [not an architecture]; large llm will plan and assigns small tasks to small llms for optiomizations.

7. Strategic Recommendations for Enterprise

The evidence suggests that 2025-27 will be the years the industry pivots from "Pre-training Scale" to "Inference Efficiency." Enterprises need to adjust their AI strategies accordingly.

  1. Stop Chasing Parameters: Do not assume that the largest model is the best for every task. For logic, math, and specific reasoning workflows, test smaller, recursive models or reasoning-specialized models (like o1-mini or future TRM implementations).
  2. Adopt the "Router" Architecture: Do not use a sledgehammer to crack a nut. Deploy a routing layer that sends creative/knowledge tasks to a large LLM (Cloud) and logic/data tasks to a small, efficient model (Edge/Local). This "Hybrid AI" approach optimizes TCO.
  3. Invest in Edge AI: Prepare for a world where AI runs on the user's device. Investigate frameworks like DisCIPL or architectures like LFM that enable on-premise or on-device intelligence, securing data privacy and reducing latency.
  4. Value "Thinking Time": In user interfaces, move away from the expectation of instant text generation. Build UI paradigms that account for "reasoning pauses"—show the user that the model is thinking. The value is in the correctness of the answer, not the speed of the first token.

Conclusion: The Recursive Horizon

The "Scaling Laws" were not wrong, but they were incomplete. They described the scaling of knowledge retrieval via parameter count, but they failed to account for the scaling of reasoning via recursive compute. We have hit the wall of the former and are just discovering the vast potential of the latter.

Models like the Hierarchical Reasoning Model (HRM) and Tiny Recursion Model (TRM) are not merely interesting outliers; they are harbingers of a sustainable, efficient AI future. They demonstrate that intelligence is not solely an emergent property of massive size, but a result of structured, iterative processing. By moving reasoning into the latent space and leveraging recursion, we can break the energy curve, democratize access to high-level intelligence, and finally solve the "System 2" gap that has plagued deep learning.

The era of the Trillion-Parameter Dinosaur is ending. The era of the Recursive Agent has begun.

Key Takeaways

  1. Scaling Has Hit a Wall: Diminishing returns on parameter scaling and unsustainable energy costs (1287 MWh per training run) render the "bigger is better" strategy obsolete for future growth.
  2. MoE is a Band-Aid: Mixture of Experts improves throughput but fails to solve memory bandwidth and routing collapse issues; it is a transitional technology, not a final solution.
  3. Recursion is the Key: Small models (HRM, TRM) using recursive loops and latent space reasoning can outperform massive models on logic tasks by "thinking" before generating.
  4. Efficiency Wins: Architectures like TRM (7M params) and LFM (1.3B params) offer 100x-1000x efficiency gains, enabling true edge AI and privacy.

Shift to Inference: Value is moving from pre-training compute (building the brain) to test-time compute (using the brain). The new scaling law is "Compute per Thought".

To view or add a comment, sign in

Others also viewed

Explore content categories