The Mathematics of World Models
While LLMs have made massive strides, they remain hindered by high training costs and a lack of grounded, experiential reasoning. This has sparked a surge in World Model research—specifically Abstract World Models—which leverage advanced geometric and mathematical frameworks to bridge the gap between statistical prediction and true autonomous reasoning.
🎯 Overview
🌐 World Models
Standard World Models
Probabilistic World Models
Latent Manifold World Models
🧮 Latent Manifold Models
Overview
Latent Representation
Group-structured Latent Space Model
Joint-Embedding Prediction Architecture
📐 Mathematics of Abstract World Models
🔗 Graph Theory
🍩 Algebraic Topology
🌀 Differential Geometry
📘 References
👉 The full article is available on the Substack article The Mathematics of Abstract World Models
🎯 Overview
Although Large Language Models (LLMs) have advanced significantly in their reasoning and multimodal capabilities, they still face several fundamental limitations that prevent them from being fully autonomous or universally reliable.
Interest in World models have been growing in recent years to address LLMs shortcomings with noisy, uneven and in some case scarce data, lack of interpretability, lack of experiential knowledge of the observed world and costly training. This article deals into Abstract World Models and their reliance of advanced geometric and mathematical concepts to deliver true reasoning capabilities.
This article introduces the World Model principles, highlighting the shift toward Abstract architectures. We examine established models to illustrate how Differential Geometry and Topology provide the necessary geometric structure for sophisticated latent-space reasoning.
🌐 World Models
World model learning can capture meaningful representations by embedding complex, high-dimensional data into lower-dimensional abstract spaces that reside in the latent space [ref 1, 2, 3].
📌 Contrary to the popular narrative that World Models are a brand-new breakthrough, the field has actually been evolving since 2016—and arguably even earlier.
There are many ways to categorize world models as illustrated in the diagram below.
This review focuses on three primary architectural frameworks for world models:
⚠️ This classification of world models, though arbitrary, is intended to highlight the underlying relationship between these architectures and the principles of Differential Geometry, Graph and Topology. The following list of world models is far from being exhaustive!
Standard World Models
Introduction
The standard (generative) World Model is a neural network designed to understand and simulate the dynamics of the observed world, including its physical and spatial properties:
We described two examples of architectures of generative world models:
Generative World Models
Introduced by Ha & Schmidhuber [ref 4] and often cited as the catalyst for the modern interest in this field, this model decomposed the problem into three distinct components managed/orchestrated by an agent:
OpenAI’s Sora
Although primarily discussed as a video generation tool, OpenAI describes Sora as a “world simulator” [ref 5].
Probabilistic World Models
This model was proposed by Yoshua Bengio [ref 6]. Unlike the JEPA approach, which focuses on latent consistency, or Fei-Fei Li’s Spatial Intelligence, which focuses on 3D geometric realism, Bengio’s model focuses on epistemic uncertainty—the model’s ability to know what it does not know,
This is Bayesian probabilistic model that forms hypotheses about the world, and estimates the posterior probabilities of those hypotheses using prior probabilities and Bayesian inference from evidence. The world model could have strong prior probability of the known laws of physics, geometry and so on.
📌 It is reasonable to categorize Bengio’s Bayesian World Model as an abstract. I treat is a separate category because if unique focus on safety and causality.
Latent Manifold World Models
Introduction
Latent Manifold World Models - sometimes referred as Abstract World Models - are structured representations that purposely omit unnecessary details to focus on key patterns, symmetries, and causal structures.
Let’s introduce 3 well-known Abstract World Models
Joint Embedding Prediction Architecture (JEPA)
Unlike generative models that rely on observation-level reconstruction, Joint Embedding Predictive Architectures (JEPAs), introduced by Yann LeCun, optimize for representation consistency across multiple data views in the latent manifold [ref 7]. This approach avoids the computational burden of density estimation, enabling the encoder to capture abstract, task-invariant features. By operating independently of raw input constraints, the architecture offers superior flexibility in feature encoding. JEPA embeds features that are useful for predicting the next state while discarding unpredictable features.
JEPAs are particularly relevant in processing geometric priors and manifolds because they treat the latent space as the primary arena for learning.
📌. The earliest and most common JEPAs models are purely self-supervised and do not use Reinforcement learning (RL) during pre-training. However, some of the latest variants explicitly integrate RL.
Group-structured Latent Spaces
Geometric priors ensure that an agent’s internal representation of states and actions respects the underlying symmetry of its environment. Because a Markov Decision Process typically consists of a mix of symmetric and non-symmetric elements, embedding these as structural priors within the latent space allows the agent to generalize more efficiently from limited experience.
The two main characteristics are
Einsteinian World Model
This model explicitly learns globally consistent solutions using the Space-Time manifold.
Spatial Intelligence
Spatial Intelligence models developed by Dr Fei-Fei Li, focuses on understanding of 3D space and time [ref 8]. Unlike LLMs that process tokens, Spatial Intelligence models build persistent 3D worlds to interact with the physical world. These simulated worlds represent the geometric structure of the observable world in represented in the latent space. Therefore these spatial models are inherently multimodal.
These models incorporate geometry and physical laws to predict an action and the resulting next state. The latent space is often structured as a 3D/4D occupancy grid or a neural radiance field (NeRF). It isn't just a vector of numbers.
Conceptually, Spatial Intelligence is an abstract world model because it aims to provide the "scaffolding" for cognition. Like the JEPA family, it seeks to move beyond pixel-level correlations toward a deep understanding of physical laws, causality, and geometry.
👉 For the reminder of this article we focus on the Abstract World Models and more specifically JEPA and Group-structured Latent Space Model to illustrate the role of advanced mathematics.
🧮 Latent Manifold Models
Overview
There is currently no universal, strictly standardized definition of an “abstract world model” in the AI research community. While the term is widely used, its boundaries shift depending on whether the researcher focuses on generative reconstruction or latent prediction.
However, the community generally agrees on several core functional pillars that characterize these architectures:
Core Functional Pillars
To understand the shift toward Abstract World Models, we must contrast them with traditional deep learning and standard Transformer architectures. As the following table illustrates, the fundamental divergence lies in their geometric assumptions—moving from flat Euclidean spaces to smooth or discrete manifolds—and the underlying mathematical frameworks that govern their representations.
Latent Representation
Let’s consider a bouncing ball. Each video frame consists of millions of pixel values changing every millisecond. However, the underlying attributes are its position (Geometry), and its velocity, gravity (Physics Law). The latent space is the mathematical space that maps the pixels into these underlying attributes.
One established fact is that entities and transformations/operations on the latent space is critical to abstract world models. The abstract world model replicate & compress the observed world (data, physics laws, geometric constraint, prior knowledge ….) into the latent space along with the target.
Therefore the state of the ‘world’ is predicted as a latent variable and loss/objective are computed in a low dimensional manifold.
📌 A significant number of world models enforce geometric priors in the latent space through regularization term in the loss function in training
Invariance constraint
Recommended by LinkedIn
Equivariance constraint
The decoder reconstructs or renders the updated latent state back to the observed world.
How do we define the latent space in this context? It is best understood through three distinct lenses:
⚠️. When learning state representations, the most significant obstacle is the collapse of the latent space. This occurs when the encoder maps distinct data points ($x$) to an identical or nearly identical embedding ($z$), causing the model to lose the ability to differentiate between unique inputs. Furthermore, even if the embeddings are distinct, the model may fail to properly preserve the geometric distances between states during the prediction process, leading to inaccurate simulations of reality.
Group-structured Latent Space Models
World Models are advanced AI systems learning the rules of reality (physics, cause-and-effect) from data like videos to simulate and predict dynamic 3D environments, enabling more robust reasoning, planning, and creation beyond static content [ref 9].
World Models go beyond Large Language Models (LLMs) by understanding spatial relationships and physical interactions, allowing AI to generate immersive, interactive worlds for robotics, design, medicine, and complex problem-solving, representing a significant leap towards human-level intelligence.
🤖 For math-minded readers
Let’s consider the Markov Decision Process (MDP) fully defined by a state space S, an action space A, a reward function R and a state transition T.
A self-supervised world model is defined in a latent space Z with the following learnable components
JEPA
JEPA’s approach avoids the computational burden of density estimation, enabling the encoder to capture abstract, task-invariant features [ref 10, 11].
📌 The original JEPA framework has evolved into a diverse ecosystem of specialized architectures. Notable variants include V-JEPA for video, LeJEPA, and Hierarchical JEPA, alongside targeted iterations such as Multi-Resolution, Cell-JEPA, and Lp-JEPA.
The following mathematical formalism reflects the original definition of JEPA
🤖 For math-minded readers
In JEPA and its video-based extension (V-JEPA), the core mathematical framework is Energy-Based. The energy is minimized in the latent space. The energy measures the L2 norm of the predicted latent representation and the target latent representation.
Let’s look at the predictor model. It simply generates an estimate of the target latent state from the contextual features sx and the latent variable z
JEPA uses variance-based regularization as architectural constraint to prevent information to collapse [ref 11, 12]
📌 Traditionally, self-supervised deep learning models would use contrastive learning.
The loss known as the Variance-Invariance-Covariance Regularization component is expressed as:
The total loss is then computed as
The weights of the target encoder are computed using a simple exponential moving average on the context weights theta
The following diagram illustrates the 3 top components of an abstract world model: Encoder, Latent space and Decoder with associated energy minimization, and actuation policy.
📌 Beyond JEPA and Markovian frameworks, the landscape of abstract world models includes several specialized paradigms, such as Equivariant World Models [ref 12] , Symplectic World Models, [ref 13] and Spacetime World Manifolds [ref 14] . A technical overview of these alternatives is provided in the Appendix.
📐 Mathematics of Abstract World Models
The latent space serves as the functional arena for prediction and planning, relying on structures like graphs, manifolds, and simplicial complexes. Understanding the underlying mathematics—specifically Differential Geometry for local curvature, invariance and equivariance, Algebraic Topology for global connectivity, and Group Theory for symmetry—is vital for anyone looking to build or optimize models within these non-Euclidean spaces.
📌 This section examines abstract world models that utilize smooth or discrete manifolds to represent the world within a latent space. While standard world models typically employ reinforcement learning techniques—such as Policy Optimization or Q-Learning—directly on these latent representations, this framework explores the geometric properties of the underlying manifolds.
In world models, the shift from processing pixels to understanding/modeling world states relies heavily on specific structures from differential geometry, topology, and graph theory
The following diagram illustrates the relation between Latent space functions and the underlying mathematics fields requires for their implementation.
Differential geometry, topology, category theory and graph theory are the underpinning of geometry deep learning and described in details in a previous article Demystifying the Math of Geometric Deep Learning
Here is a summary of these fields.
🔗 Graph Theory
Graph theory concerns mathematical graphs used to encode pairwise relations. A graph comprises vertices and edges; a digraph differs from an undirected graph in that each edge is oriented [ref 15].
Here are basic concepts in differential geometry that are applicable to Abstract World Models:
🍩 Algebraic Topology
General Topology and Algebraic Topology in particular are the foundation of Topological Deep Learning [ref 16].
Here are basic concepts in topology that are applicable to Abstract World Models:
🌀 Differential Geometry
Differential geometry studies smooth spaces (manifolds) by doing calculus on them, using tools like tangent spaces, vector fields, and differential forms. It formalizes intrinsic notions such as curvature, geodesics, and connections—foundations for modern physics and for algorithms in graphics, robotics, and machine learning on curved data [ref 17].
Here are basic concepts in differential geometry that are applicable to Abstract World Models:
👉 Applicability, Key Takeaways, Q&A and additional world models are available in the original article The Mathematics of Abstract World Models
📘 References
👉 Share in the comments the next topic you’d like me to tackle.
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 and Hands-on Geometric Deep Learning Newsletter.
The original article can be found at https://tinyurl.com/3bx769nf
Patrick Nicolas nice perspective; I think the key shift is moving from models that work in isolation to systems that integrate physics, real-world data and engineering workflows end-to-end
🕉️🙏🕉️What is mathe -matics?🕉️🙏🕉️
Absolutely spot on