Writing for the Hardware: How Cache-Friendly Data Structures Boost Performance

Pawan Wagh

Published Aug 11, 2025

The biggest bottleneck in modern computers is the slow speed of RAM compared to the CPU. Hardware designers solve this with caches, but it's our job as engineers to write 'cache-friendly' code that uses them effectively. This means organizing data to maximize performance, a crucial technique for any high-performance domain.

The Two types of Cache Locality

Caches operate on a simple principle called locality of reference, which comes in two flavors.

Temporal Locality: This is the principle that if a program accesses a particular memory location, it is highly likely to access that same location again in the near future.

Spatial Locality: This is the principle that if you access a piece of data, you're likely to access data located near it in memory. When the CPU fetches data, it doesn't just grab one byte; it pulls in a whole "cache line" (e.g., 64 bytes) of adjacent memory. If your data is laid out contiguously, subsequent data you need is often pre-fetched into the fast cache for free.

The Modern CPU Cache Hierarchy: L1, L2, and L3

Modern CPUs feature a multi-level cache hierarchy to balance speed and size. Each level is progressively larger but slower than the one before it.

L1 (Level 1) Cache: The fastest and smallest cache, located directly on the CPU core. It's split into an instruction cache (for code) and a data cache (for variables). Access is nearly instantaneous.
L2 (Level 2) Cache: Larger and slightly slower than L1. It serves as a backup; if the CPU can't find data in L1 (a "cache miss"), it checks L2 next.
L3 (Level 3) Cache: The largest on-chip cache, shared among all CPU cores. It's the last line of defense before the CPU must access the much slower main memory (RAM).

When your program runs, the CPU always checks L1 first. If it's a miss, it checks L2, then L3. Effective code keeps its most-used data in the L1 and L2 caches as much as possible.

Data Layout Matters: AoS vs. SoA

How you structure your data directly impacts spatial locality. Let's take a common example: managing a large number of particles, each with a position and velocity.

A typical approach is the Array of Structs (AoS):

In memory, this looks like: [pos1, vel1], [pos2, vel2], [pos3, vel3], ...

Recommended by LinkedIn

Locking Strategy in Dataplanes

Prashant Upadhyaya 6 years ago

Replacing string with text & keyword types in elastic

Srinivas Ganti 7 years ago

I/O-bound and CPU-bound operation

Phat Nguyen 1 year ago

Now, imagine a function that only updates particle positions. When it accesses particles[0].position, the CPU loads a cache line containing that position data, but also the unneeded velocity data. This pollutes the cache with irrelevant information.

A more cache-friendly approach is the Struct of Arrays (SoA):

In memory, this is organized as two distinct, contiguous blocks: [pos1, pos2, pos3, ...] [vel1, vel2, vel3, ...]

When the position-update function runs now, every cache line it loads is packed only with the position data it needs. This perfect alignment with spatial locality allows the CPU to process the data much more efficiently.

Evaluation

Let's test this. Save the following as cache_test.cpp.

Compile and run it from your command line with optimizations enabled:

g++ -O3 -o cache_test cache_test.cpp  
./cache_test

You will see that the SoA version runs significantly faster which is a direct result of better cache utilization.

Writing for the Hardware: How Cache-Friendly Data Structures Boost Performance

Pawan Wagh

The Two types of Cache Locality

The Modern CPU Cache Hierarchy: L1, L2, and L3

Data Layout Matters: AoS vs. SoA

Recommended by LinkedIn

Evaluation

The Dangling Pointer

501 followers

More articles by Pawan Wagh

Others also viewed

Dynamic vs. Static Calculations in Custody-Transfer Flow Computers: Why “Hand-Recalculated” Tickets Don’t Match

Optimizing Transformers: How Flash Attention Solves Key Challenges in Memory and Computation

Cracking the Quantum Code: Stabilizers, Errors, and Fault Tolerance

The Silicon Rebellion — Why RISC-V Is the Linux of Hardware

What makes your All-Flash-Array special?

4-BIT ALU IN LOGISIM

Designing Tier-1 Exchange Matching Engine

Demystifying the cache

Greased Lightning: the GPU DBMS

Cache Optimization through Prefetching: Enhancing Non-Contiguous Data Processing

Explore content categories

The Two types of Cache Locality

The Modern CPU Cache Hierarchy: L1, L2, and L3

Data Layout Matters: AoS vs. SoA

Recommended by LinkedIn

Evaluation

The Dangling Pointer

501 followers

More articles by Pawan Wagh

Running Open-Source LLMs on Your Own Machine

False Sharing and Cache Line Contention

2025 Computing Recap: Chips, Quantum, Models, and What's Next

Memory Barriers and CPU Reordering

The Real Cost of Virtual Functions: A Performance Deep Dive

Why Hardware is the New Frontier of Memory Safety

Heap Fragmentation and Custom Allocators

Building Your First MCP Client

Build an MCP server

Intro to MCP : Part 2 (Interactions)

Others also viewed

Dynamic vs. Static Calculations in Custody-Transfer Flow Computers: Why “Hand-Recalculated” Tickets Don’t Match

Optimizing Transformers: How Flash Attention Solves Key Challenges in Memory and Computation

Cracking the Quantum Code: Stabilizers, Errors, and Fault Tolerance

The Silicon Rebellion — Why RISC-V Is the Linux of Hardware

What makes your All-Flash-Array special?

4-BIT ALU IN LOGISIM

Designing Tier-1 Exchange Matching Engine

Demystifying the cache

Greased Lightning: the GPU DBMS

Cache Optimization through Prefetching: Enhancing Non-Contiguous Data Processing

Explore content categories