🚀 Inside the GPU: Threads, Warps, Blocks & Memory

Frank Underdown, PhD

Published Apr 30, 2026

+ Follow

(What Actually Happens When a GPU Kernel Runs)

Article content — GPU Grid, Blocks and Treads

At this point in the series, we’ve discussed:

why modern simulation requires parallel computing
how GPUs accelerate large-scale workloads
and what makes a problem GPU-friendly

Now let’s go one layer deeper:

What actually happens inside the GPU when a kernel runs?

This is where GPU programming stops being abstract—and starts becoming architecture-aware engineering.

🧠 The Mental Model Most People Miss

Many people think:

“A GPU is just a CPU with more cores.”

That’s not really accurate.

A modern GPU is designed around:

massive concurrency
throughput optimization
and latency hiding through warp scheduling

The architecture is fundamentally different from a CPU.

⚙️ The Hierarchy: Grid → Block → Warp → Thread

At a high level, CUDA-style execution works like this:

Grid → all work launched by the kernel
Block → groups of threads
Warp → groups of 32 executing threads
Thread → smallest execution unit visible to the programmer

🔁 Threads: The Basic Workers

Each thread:

executes the kernel code
operates on a small piece of data
has its own registers and local variables

Example:

one thread per pixel
one thread per grid cell
one thread per field sample

🧩 Blocks: Cooperative Thread Groups

Threads are grouped into blocks.

Why blocks matter:

threads within a block can cooperate
they can synchronize
and they can share fast on-chip memory

This shared memory is critical for performance optimization.

⚡ Warps: The Real Execution Unit

Here’s the key insight:

GPUs do not actually execute individual threads independently.

Threads are grouped into: 👉 warps (typically 32 threads)

A warp executes:

the same instruction
at the same time
across multiple data elements

This is called: 👉 SIMT (Single Instruction, Multiple Threads)

⚠️ Warp Divergence (A Major Performance Issue)

Suppose half the threads take one branch:

if (condition) {

do_A();

} else {

do_B();

}

Now the warp must:

execute path A
then execute path B

👉 Parallel efficiency drops.

This is called:

warp divergence

GPU-friendly kernels:

minimize branching
keep warp execution aligned

💾 Memory Hierarchy (This Is Where Performance Is Won)

Modern GPUs have multiple memory layers:

Registers

fastest memory
private to each thread

Shared Memory

fast on-chip memory
shared within a block

Used for:

tiling
caching reused data
reducing global memory traffic

Global Memory

large but relatively slow
accessible by all threads

👉 Most GPU kernels are actually limited by:

memory bandwidth, not compute

🔥 Why Memory Access Patterns Matter

Best case:

neighboring threads access neighboring addresses
accesses become coalesced

Worst case:

random scattered access
inefficient transactions
bandwidth bottlenecks

This is why:

data layout
memory alignment
access ordering

matter so much in GPU performance engineering.

🚀 Occupancy: Keeping the GPU Busy

GPUs hide latency by switching between warps rapidly.

The goal: 👉 keep enough warps active so computation never stalls.

But occupancy is limited by:

registers per thread
shared memory usage
thread count
warp limits per SM (Streaming Multiprocessor)

This is why:

Higher occupancy does not always mean higher performance.

Sometimes:

fewer threads
but better memory access

👉 runs faster.

🧠 The Bigger Insight

At a deeper level, GPU optimization is not really about “coding.”

It’s about:

mapping computation onto hardware efficiently
minimizing memory bottlenecks
structuring execution to maximize throughput

🩺 From Weather Models to Medical Imaging

The same architecture principles appear everywhere:

weather simulation
medical image reconstruction
CFD
electromagnetic solvers
AI inference systems

Different applications.

Same underlying hardware realities.

🔭 Looking Ahead

In the next article, I’ll walk through:

➡️ why many GPU implementations underperform

➡️ common optimization mistakes

➡️ and how memory bandwidth often becomes the real bottleneck

💬 Final Thought

GPU acceleration is not just about launching more threads.

It’s about understanding how computation, memory, and execution interact at the hardware level.

That’s where real performance engineering begins.

💬 Working on GPU Acceleration or Large-Scale Simulation?

If you're dealing with:

GPU performance bottlenecks
parallelizing large-scale simulations
medical imaging or physics-based modeling challenges
or trying to determine whether a GPU approach is even the right fit

I work with teams to:

map complex problems onto parallel architectures
identify performance limits (memory, occupancy, data layout)
and design efficient, scalable solutions

Feel free to reach out if you’d like to discuss your project.

Dr. Frank Underdown Jr.

HPC | Electromagnetics | AI Systems

Founder, Keweenaw Nanoscience Center

Physics-Based Simulation | GPU Acceleration

www.keweenawnano.com

🚀 Inside the GPU: Threads, Warps, Blocks & Memory

Frank Underdown, PhD

🧠 The Mental Model Most People Miss

⚙️ The Hierarchy: Grid → Block → Warp → Thread

🔁 Threads: The Basic Workers

🧩 Blocks: Cooperative Thread Groups

⚡ Warps: The Real Execution Unit

⚠️ Warp Divergence (A Major Performance Issue)

if (condition) {

💾 Memory Hierarchy (This Is Where Performance Is Won)

Registers

Shared Memory

Global Memory

🔥 Why Memory Access Patterns Matter

🚀 Occupancy: Keeping the GPU Busy

🧠 The Bigger Insight

🩺 From Weather Models to Medical Imaging

🔭 Looking Ahead

💬 Final Thought

💬 Working on GPU Acceleration or Large-Scale Simulation?

Climate Edge

352 followers

More articles by Frank Underdown, PhD

Explore content categories

🧠 The Mental Model Most People Miss

⚙️ The Hierarchy: Grid → Block → Warp → Thread

🔁 Threads: The Basic Workers

🧩 Blocks: Cooperative Thread Groups

⚡ Warps: The Real Execution Unit

⚠️ Warp Divergence (A Major Performance Issue)

if (condition) {

💾 Memory Hierarchy (This Is Where Performance Is Won)

Registers

Shared Memory

Global Memory

🔥 Why Memory Access Patterns Matter

🚀 Occupancy: Keeping the GPU Busy

🧠 The Bigger Insight

🩺 From Weather Models to Medical Imaging

🔭 Looking Ahead

💬 Final Thought

💬 Working on GPU Acceleration or Large-Scale Simulation?

Climate Edge

352 followers

More articles by Frank Underdown, PhD

🚀 What Makes a Problem GPU-Friendly? (And Why Some Workloads Fail to Scale)

AI Data Centers Have a Water Problem Nobody Talks About

🚀 How GPUs Actually Accelerate Large-Scale Computation (From Weather Models to Medical Imaging)

🌪️ Why Weather Prediction Needs Supercomputers (And Why Parallel Programming Is No Longer Optional)

Computational Fluid Dynamics Series Article 7: A New Framework for Predicting Extreme Weather Events

Computational Fluid Dynamics Series Article 6: AI + CFD for Next-Generation Forecasting

Computational Fluid Dynamics Series Article 5: The Limits of Weather Prediction — Chaos, Uncertainty, and the Edge of Computation

Computational Fluid Dynamics Series Article 4: CFD in the Real World — From Hurricanes to Geothermal Energy

The Strait of Hormuz Crisis Just Exposed a Fatal Flaw in Global Energy Infrastructure

Computational Fluid Dynamics Series Article 3: The Turbulence Problem — Why Fluid Simulation Is Still One of Physics’ Hardest Challenges

Explore content categories