Writing for the Hardware: How Cache-Friendly Data Structures Boost Performance

The biggest bottleneck in modern computers is the slow speed of RAM compared to the CPU. Hardware designers solve this with caches, but it's our job as engineers to write 'cache-friendly' code that uses them effectively. This means organizing data to maximize performance, a crucial technique for any high-performance domain.

The Two types of Cache Locality

Caches operate on a simple principle called locality of reference, which comes in two flavors.

Temporal Locality: This is the principle that if a program accesses a particular memory location, it is highly likely to access that same location again in the near future.

Spatial Locality: This is the principle that if you access a piece of data, you're likely to access data located near it in memory. When the CPU fetches data, it doesn't just grab one byte; it pulls in a whole "cache line" (e.g., 64 bytes) of adjacent memory. If your data is laid out contiguously, subsequent data you need is often pre-fetched into the fast cache for free.

The Modern CPU Cache Hierarchy: L1, L2, and L3

Modern CPUs feature a multi-level cache hierarchy to balance speed and size. Each level is progressively larger but slower than the one before it.

  • L1 (Level 1) Cache: The fastest and smallest cache, located directly on the CPU core. It's split into an instruction cache (for code) and a data cache (for variables). Access is nearly instantaneous.
  • L2 (Level 2) Cache: Larger and slightly slower than L1. It serves as a backup; if the CPU can't find data in L1 (a "cache miss"), it checks L2 next.
  • L3 (Level 3) Cache: The largest on-chip cache, shared among all CPU cores. It's the last line of defense before the CPU must access the much slower main memory (RAM).

When your program runs, the CPU always checks L1 first. If it's a miss, it checks L2, then L3. Effective code keeps its most-used data in the L1 and L2 caches as much as possible.

Data Layout Matters: AoS vs. SoA

How you structure your data directly impacts spatial locality. Let's take a common example: managing a large number of particles, each with a position and velocity.

A typical approach is the Array of Structs (AoS):

Article content
Array of structs

In memory, this looks like: [pos1, vel1], [pos2, vel2], [pos3, vel3], ...

Now, imagine a function that only updates particle positions. When it accesses particles[0].position, the CPU loads a cache line containing that position data, but also the unneeded velocity data. This pollutes the cache with irrelevant information.

A more cache-friendly approach is the Struct of Arrays (SoA):

Article content
Struct of Arrays

In memory, this is organized as two distinct, contiguous blocks: [pos1, pos2, pos3, ...] [vel1, vel2, vel3, ...]

When the position-update function runs now, every cache line it loads is packed only with the position data it needs. This perfect alignment with spatial locality allows the CPU to process the data much more efficiently.

Evaluation

Let's test this. Save the following as cache_test.cpp.

Article content

Compile and run it from your command line with optimizations enabled:

g++ -O3 -o cache_test cache_test.cpp  
./cache_test        


Article content

You will see that the SoA version runs significantly faster which is a direct result of better cache utilization.


To view or add a comment, sign in

More articles by Pawan Wagh

  • Running Open-Source LLMs on Your Own Machine

    A few years ago, running a large language model on your laptop felt like a weekend experiment: fragile, slow, and…

  • False Sharing and Cache Line Contention

    Your multithreaded code is slow, and you have no idea why. You profiled it.

  • 2025 Computing Recap: Chips, Quantum, Models, and What's Next

    Last year was a turning point for computing systems. We saw breakthroughs that were theoretical for decades finally…

    1 Comment
  • Memory Barriers and CPU Reordering

    Your CPU is lying to you. Not maliciously, but the instructions you write aren't executing in the order you wrote them.

    4 Comments
  • The Real Cost of Virtual Functions: A Performance Deep Dive

    Virtual functions are everywhere in modern C++ codebases. We use them without thinking twice because they give us…

  • Why Hardware is the New Frontier of Memory Safety

    For the last twenty years, the software industry has been fighting a losing battle against memory safety…

    2 Comments
  • Heap Fragmentation and Custom Allocators

    Welcome to the 20th Edition of The Dangling Pointer! Heap fragmentation is a persistent challenge in systems…

  • Building Your First MCP Client

    If you've been hearing about the Model Context Protocol and wondering how to actually build a client that talks to MCP…

    1 Comment
  • Build an MCP server

    When you use an AI model, it's often in a sandbox, disconnected from your other applications and data. If you want the…

  • Intro to MCP : Part 2 (Interactions)

    MCP has 3 core participants: Host, Client and Server Host - Manage one or multiple MCP clients Client - Manage…

    1 Comment

Others also viewed

Explore content categories