One of the most significant factors determining whether your code runs fast or crawls is something that many engineers never directly interact with: CPU cache locality.
While we write code thinking about algorithms and data structures, the CPU is quietly making millions of decisions about what data to keep close and what to evict from its precious cache memory.
Understanding cache locality is the difference between code that scales linearly and code that hits performance walls seemingly out of nowhere. A simple change in how you access memory can result in 10x, 50x, or even 100x performance differences. Let’s dig deeper…
Why Cache Exists
Speed v/s capacity is the fundamental trade-off here. Fast memory is expensive and limited, while cheap memory is slow. This creates a hierarchy where each level trades speed for capacity
- CPU Registers: ~1 cycle access, 32-64 registers
- L1 Cache: ~1-3 cycles, 32-64 KB per core
- L2 Cache: ~10-20 cycles, 256 KB - 1 MB per core
- L3 Cache: ~40-75 cycles, 8-32 MB shared
- Main Memory (RAM): ~200-300 cycles, 8-128 GB
- SSD Storage: ~50,000-100,000 cycles, 500 GB - 4 TB
- Hard Drive: ~10,000,000 cycles, 1-20 TB
The gap between L1 cache and main memory is roughly 100-300x in latency. This isn’t just a minor inconvenience; it’s a performance cliff that can make or break your application.
Cache Lines
CPUs don’t fetch individual bytes from memory. Instead, they work with cache lines, typically 64 bytes on modern x86 processors. When you access a single byte, the CPU fetches the entire 64-byte block containing that byte.
This design assumes spatial locality, i.e., if you access one memory location, you’ll likely access nearby locations soon. This assumption drives much of cache behavior and optimization strategies.
Types of Cache Locality
Temporal Locality
Temporal locality means that recently accessed data is likely to be accessed again soon. This is why keeping frequently used variables in scope and avoiding unnecessary memory allocations can dramatically improve performance.
cache = {}
cache_size = 8 # Small cache to demonstrate eviction
def fib(n):
if n in cache:
print(f" Cache HIT: fib({n}) = {cache[n]}")
return cache[n]
print(f" Computing fib({n})...")
# Base cases
if n <= 1:
result = n
else:
# Recursive calls - this creates temporal locality patterns
result = fib(n-1) + fib(n-2)
# Cache management - keep only recent values
if len(cache) >= cache_size:
cache.evict()
# Store result in cache
cache[n] = result
print(f" Cache STORE: fib({n}) = {result}")
return result
return fib
Spatial Locality
Spatial locality means that accessing nearby memory locations is more efficient than accessing scattered locations. This is directly related to cache line behavior.
Let’s take an example of taking a matrix of size n x n
and setting every value to 1. But let’s do this in two flavours
- row-major access
- column-major access
void poor_spatial_locality(int matrix, int size) {
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
matrix[j][i] += 1; // Column-major access in row-major storage
}
}
}
void good_spatial_locality(int matrix, int size) {
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
matrix[i][j] += 1; // Row-major access matches storage
}
}
}
The difference between these two approaches is dramatic. The following table shows the benchmark for different sizes.
Matrix Size | Poor Locality (s) | Good Locality (s) | Speedup
------------|-------------------|-------------------|---------
1000 | 0.015380 | 0.010350 | 1.49 x
1500 | 0.034624 | 0.021061 | 1.64 x
2000 | 0.054616 | 0.035029 | 1.56 x
2500 | 0.110248 | 0.066116 | 1.67 x
3000 | 0.238320 | 0.131807 | 1.81 x
3500 | 0.315481 | 0.124164 | 2.54 x
4000 | 0.447303 | 0.180625 | 2.48 x
CPUs fetch 64-byte cache lines, which cover 16 integers on 32-bit systems. When we are accessing the matrix in row-major access, all 16 integers in the cache line are used. But, in column-major access, only 1 integer per cache line is used (15 wasted).
Example: Redis Eviction Pool
Redis maintains its eviction pool as a static array rather than a linked list primarily to optimize for spatial locality during the critical eviction process.
Redis samples random keys and maintains a small pool (typically 16 entries) of the best eviction candidates sorted by their idle time or access frequency. Using a contiguous array ensures that all candidate entries reside within a few cache lines, allowing the CPU to efficiently compare and sort candidates without cache misses.
The array-based approach eliminates the pointer-chasing behavior inherent in linked lists, potentially causing cache misses and memory stalls during time-sensitive eviction decisions.
Here’s the snippet from Redis’s source code src/evict.c).
...
if (pool[EVPOOL_SIZE-1].key == NULL) {
/* Free space on the right? Insert at k shifting
* all the elements from k to end to the right. */
/* Save SDS before overwriting. */
sds cached = pool[EVPOOL_SIZE-1].cached;
memmove(pool+k+1,pool+k,
sizeof(pool[0])*(EVPOOL_SIZE-k-1));
pool[k].cached = cached;
} else {
/* No free space on right? Insert at k-1 */
k--;
/* Shift all elements on the left of k (included) to the
* left, so we discard the element with smaller idle time. */
sds cached = pool[0].cached; /* Save SDS before overwriting. */
if (pool[0].key != pool[0].cached) sdsfree(pool[0].key);
memmove(pool,pool+1,sizeof(pool[0])*k);
pool[k].cached = cached;
}
...
Measuring Cache Performance
CPUs provide hardware counters that let you measure cache behavior directly using perf
. Some key metrics to look for are
- Cache miss rate: % of memory accesses that miss the cache
- L1/L2/L3 miss rates: Miss rates at each cache level
- Instructions per cycle (IPC): Overall CPU efficiency
$ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads ./your_program
# 1,234,567 cache-misses
# 10,000,000 cache-references
# 2,345,678 L1-dcache-load-misses
# 10,000,000 L1-dcache-loads
Optimization Strategies
- Loop Tiling/Blocking: Break large loops into smaller chunks that fit in cache to maximize data reuse before eviction.
- Cache-Oblivious Algorithms: Design algorithms that perform well across different cache sizes without knowing specific cache parameters.
- Prefetching: Provide hints to the CPU about future memory accesses to load data into cache before it’s needed.
- False Sharing Mitigation: Pad data structures to ensure different threads access separate cache lines and avoid unnecessary cache coherency traffic.
We will look at each one of these in depth in some other blog.
Footnotes
Cache locality is one of the most impactful yet often overlooked aspects of performance optimization. Understanding how CPUs cache data and designing your algorithms and data structures accordingly can provide massive performance improvements.
As engineers, we often focus on algorithmic complexity (Big O notation), but cache performance can dominate real-world execution time, especially in-memory data processing workloads.
Interestingly, a linear algorithm with poor cache locality can be slower than a logarithmic algorithm with good cache locality for realistic data sizes.