The Architecture of Intelligence: From Attention to Nested Learning

Stephen Kelley

Published Nov 30, 2025

Eight years ago, a team at Google published "Attention Is All You Need." That paper sparked the AI revolution we're living through—ChatGPT, Claude, and every major language model trace their lineage directly to it. Now, Google has published research on Nested Learning that may represent the next fundamental AI evolution. These aren't just technical milestones. They're windows into how machines are becoming better mirrors of human intelligence.

The Attention Revolution (2017)

Before transformers, AI language models processed text like reading through a straw—one word at a time, struggling to remember what came sentences before. The attention paper introduced a mechanism that changed everything: self-attention.

Instead of processing words sequentially, what if a model could consider relationships between all words simultaneously? When you read "The cat sat on the mat because it was tired," your brain effortlessly knows "it" refers to the cat. Self-attention gives AI this same capability—each word can "attend to" every other word, weighing their relevance.

This mirrors how your brain processes language. You don't understand sentences word-by-word in isolation. Comprehension emerges from dynamic relationships—context, syntax, meaning interweaving at once. The transformer architecture enabled models to scale massively, producing the large language models that now write code, draft emails, and engage in nuanced conversation.

The Continual Learning Problem

But transformers have a fundamental limitation: catastrophic forgetting. When you learn to ride a bicycle, you don't forget how to walk. Human learning is additive—new knowledge integrates without destroying existing knowledge.

Current AI models don't work this way. Train a model on new data, and it "forgets" what it previously knew. This is why training requires presenting all data together in massive, expensive runs. The model can't simply add new knowledge—it must relearn everything.

Nested Learning: A New Paradigm (2025)

Google's Nested Learning tackles this directly, introducing an architecture where learning happens at multiple nested timescales—remarkably similar to how human memory works. Instead of a single learning process, nested learning creates hierarchical loops operating at different frequencies. Fast, high-frequency learning adapts to immediate inputs. Slower, low-frequency learning consolidates stable patterns. These loops are nested—faster loops operate within the context of slower ones.

This mirrors how your brain operates on multiple simultaneous rhythms. Your brain produces distinct oscillation patterns—gamma waves (30-100 Hz) handle rapid processing like attention. Beta waves (12-30 Hz) manage active thinking. Alpha waves (8-12 Hz) coordinate relaxed awareness. Theta waves (4-8 Hz) support memory encoding. Delta waves (0.5-4 Hz) govern deep restoration and long-term consolidation.

Recommended by LinkedIn

AI Guru and new way of learning !

Advait Khare 2 months ago

Unveiling the World of Machine Learning: A…

Muhammad Zain 2 years ago

Reinforcement Learning: The Power of Machines Learning…

Adeoluwa Atanda 3 years ago

Critically, these aren't independent systems. They're nested—faster oscillations ride on slower ones, like ripples on waves on swells. No single centralized clock synchronizes everything. Each layer operates at its own timescale.

Nested Learning implements this architecture artificially. Different network levels update at different frequencies—earlier layers adapt quickly to capture immediate patterns, while later layers integrate information over slower cycles. This mirrors neuroplasticity, the brain's ability to rewire itself continuously.

From Lab to Laptop: The Democratization of AI

Perhaps most remarkable is how quickly breakthroughs move from expensive proprietary research to tools anyone can run on a laptop. The transformer architecture was published in 2017. Within two years, open implementations proliferated. Today, sophisticated transformer models run on consumer hardware. The pattern repeats with striking consistency.

OpenAI recently released GPT-OSS, an open-source implementation of their foundational architecture. What required a data center yesterday runs on a MacBook today. Implementations like nested_learning are already appearing on GitHub, with code that reproduces the Google paper. This democratization cycle accelerates with each generation. The gap between "breakthrough" and "accessible" keeps shrinking. Today's proprietary advantage is tomorrow's "pip install".

Why This Matters

The trajectory from attention to nested learning reveals something profound: progress often comes from understanding biological intelligence. Attention succeeded because it captured something true about how brains process information. Nested learning may succeed because it captures something true about how brains learn—accumulating knowledge over time without starting over.

Stay awake.

Sources:

The Architecture of Intelligence: From Attention to Nested Learning

Stephen Kelley

The Attention Revolution (2017)

The Continual Learning Problem

Nested Learning: A New Paradigm (2025)

Recommended by LinkedIn

From Lab to Laptop: The Democratization of AI

Why This Matters

Awake

235 followers

More articles by Stephen Kelley

Others also viewed

AN UNKOWN FEATURE OF LERNING MACHINE LEARNING

Active Learning for AI: How Machines Learn to Learn

Zero Data: Beyond Human-Guided Learning

Understanding the Basics and Steps Involved

Case Study Of Machine Learning

Symbiotic Agent Learning Architecture (SALA)

Agentic AI Learning Roadmap

How to Start Learning AI from Scratch

A layperson’s take on adapting to skill shift in the era of AI, Machine Learning and Automation

No.145: The Essence of Machine Learning – A New Approach to Learning from Data

How Large Language Models Solve Problems Without Introspection

Why Large Language Models Require More Computing Power

How Large Language Models Reshape Data Patterns

2-Simplicial Attention in Large Language Model Training

Explore content categories

The Attention Revolution (2017)

The Continual Learning Problem

Nested Learning: A New Paradigm (2025)

Recommended by LinkedIn

From Lab to Laptop: The Democratization of AI

Why This Matters

Awake

235 followers

More articles by Stephen Kelley

Physiology of Consciousness: What Biological Computation Reveals About Minds and Machines

Knowledge Representation and Graphs

The Symbol Grounding Problem: How Language Connects to Meaning

Personal AI Knowledge Management

Accelerating Innovation: Wonder and Awe

Perception and Action: Interacting with the World

Using AI for Data Science

LangChain Deep Agents: How Thought Solves Complex Tasks

Corporate AI: Your Chief AI Officer (CAIO)

Others also viewed

AN UNKOWN FEATURE OF LERNING MACHINE LEARNING

Active Learning for AI: How Machines Learn to Learn

Zero Data: Beyond Human-Guided Learning

Understanding the Basics and Steps Involved

Case Study Of Machine Learning

Symbiotic Agent Learning Architecture (SALA)

Agentic AI Learning Roadmap

How to Start Learning AI from Scratch

A layperson’s take on adapting to skill shift in the era of AI, Machine Learning and Automation

No.145: The Essence of Machine Learning – A New Approach to Learning from Data

Similar topics

How Large Language Models Solve Problems Without Introspection

Why Large Language Models Require More Computing Power

How Large Language Models Reshape Data Patterns

2-Simplicial Attention in Large Language Model Training

Explore content categories