Writing for the Hardware: How Cache-Friendly Data Structures Boost Performance
The biggest bottleneck in modern computers is the slow speed of RAM compared to the CPU. Hardware designers solve this with caches, but it's our job as engineers to write 'cache-friendly' code that uses them effectively. This means organizing data to maximize performance, a crucial technique for any high-performance domain.
The Two types of Cache Locality
Caches operate on a simple principle called locality of reference, which comes in two flavors.
Temporal Locality: This is the principle that if a program accesses a particular memory location, it is highly likely to access that same location again in the near future.
Spatial Locality: This is the principle that if you access a piece of data, you're likely to access data located near it in memory. When the CPU fetches data, it doesn't just grab one byte; it pulls in a whole "cache line" (e.g., 64 bytes) of adjacent memory. If your data is laid out contiguously, subsequent data you need is often pre-fetched into the fast cache for free.
The Modern CPU Cache Hierarchy: L1, L2, and L3
Modern CPUs feature a multi-level cache hierarchy to balance speed and size. Each level is progressively larger but slower than the one before it.
When your program runs, the CPU always checks L1 first. If it's a miss, it checks L2, then L3. Effective code keeps its most-used data in the L1 and L2 caches as much as possible.
Data Layout Matters: AoS vs. SoA
How you structure your data directly impacts spatial locality. Let's take a common example: managing a large number of particles, each with a position and velocity.
A typical approach is the Array of Structs (AoS):
In memory, this looks like: [pos1, vel1], [pos2, vel2], [pos3, vel3], ...
Recommended by LinkedIn
Now, imagine a function that only updates particle positions. When it accesses particles[0].position, the CPU loads a cache line containing that position data, but also the unneeded velocity data. This pollutes the cache with irrelevant information.
A more cache-friendly approach is the Struct of Arrays (SoA):
In memory, this is organized as two distinct, contiguous blocks: [pos1, pos2, pos3, ...] [vel1, vel2, vel3, ...]
When the position-update function runs now, every cache line it loads is packed only with the position data it needs. This perfect alignment with spatial locality allows the CPU to process the data much more efficiently.
Evaluation
Let's test this. Save the following as cache_test.cpp.
Compile and run it from your command line with optimizations enabled:
g++ -O3 -o cache_test cache_test.cpp
./cache_test
You will see that the SoA version runs significantly faster which is a direct result of better cache utilization.