source: quanta magazine

Demystifying CPU Caches with Examples

Abhijit Mondal
11 min readNov 22, 2023

--

This post takes inspiration from one of the famous blogs on the different CPU cache effects:

Gallery of Processor Cache Effects (igoro.com)

Some notable factors that can affect the performance of a computer program utilizing the CPU cache are as follows:

  1. Impact of Cache Lines
  2. Cache Associativity
  3. False Cache Line Sharing

Before getting into each of the above cache performance parameters, let us go over some important pointers regarding CPU caches.

  • Reading from and writing to main memory (RAM) is expensive as compared to CPU caches.
  • According to the CS61 course website, the access times for RAM (main memory) is 62.9 ns whereas it is 12.8 ns for L3 cache, 3.3 ns for L2 cache and 1.1 ns for L1 cache.
  • During each read query, it is first checked in CPU’s L1 cache, if it is found then returned, else check L2 cache and so on till L3 cache or L4 cache (if present).
  • If not found in any CPU caches, fetch from main memory and add to L1, L2 and L3 cache.
  • When a byte is read from main memory, not only the byte is returned but a block of bytes is returned known as the cache line.
  • Generally the cache line is 64 bytes. Any read for a byte in the same cache line is read from either L1, L2 or L3 cache because the cache line of 64 bytes is already cached in the CPU.
  • For e.g. if the cache lines are aligned s.t. 0–63 bytes is one cache line, 64–127 is next one and so on. Then if byte 10 is read, all bytes from 0 to 63 is transferred from RAM to CPU and cached. Next if request for byte 57 is made, it is already present in CPU cache since 0 <= 57 <= 63, hence read from cache.
  • A cache line read from main memory can occupy one of N cache slots i.e. if an existing cache line already occupies one of the N slots, then the incoming cache line can only occupy one of remaining N-1 slots.
  • If all the N slots are occupied then we have to evict one of the N slots to make room for the new cache line. LRU policy is used to evict cache slots.

Cache replacement policies — wikipedia

Finding L1, L2 and L3 cache sizes

In Linux, we can issue the following commands to find the L1, L2 and L3 cache sizes:

getconf -a | grep CACHE

The output:

LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 49152
LEVEL1_DCACHE_ASSOC 12
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 1310720
LEVEL2_CACHE_ASSOC 20
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 12582912
LEVEL3_CACHE_ASSOC 12
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC
LEVEL4_CACHE_LINESIZE

Size of L1 instruction cache (cache for storing the program) = 32768 bytes or 32 KB.

Size of L1 data cache (cache for storing the data) = 49152 bytes or 48 KB.

Note that the L1 cache exists in each core of the CPU i.e. if a CPU is 4-core, then each core has 32 KB of L1 instruction cache and 48KB of L1 data cache. Total size of L1 data cache is 48*4 KB = 192 KB.

Also L2 cache is present in each core of the CPU. Each core of CPU has 1.25 MB of cache, thus total size of L2 cache for a 4-core CPU is 1.25*4 = 5 MB.

But L3 cache is shared across all cores of the CPU. Total size of L3 cache is 12 MB.

Each byte present in L3 cache is also present in L2 and L1 cache.

Cache line size i.e. number of bytes prefetched for each byte read from main memory is 64 bytes.

L1 data cache associativity is 12 i.e. each cache line can occupy one of 12 slots in the L1 cache. Similarly for L2 cache, each cache line can occupy one of 20 slots and for L3 cache can occupy one of 12 cache slots.

Impact of Cache Lines

Assuming you have an integer array of size N, in the first case read M integers contiguously starting at index 0 to index M-1.

Since the cache line size is 64 bytes or 16 integers (each integer is 4 bytes), reading M integers contiguously requires accessing approximately M/16 cache lines.

More importantly each cache line is accessed from main memory at-most once.

int main(int argc, char* argv[]) {

const int length = 512 * 1024 * 1024;
const int cache_line_size = 16;
const int m = length/cache_line_size;

int *arr = new int[length];

// contiguous access
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < m; i++)
arr[i]++;
auto stop = std::chrono::high_resolution_clock::now();

auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

std::cout << (float) duration.count() << std::endl;

free(arr);
return 0;
}

Compiling the above program with the default g++ compiler (v11.3.0) options:

g++ cache_line.cpp -o cache_line
./cache_line

Time taken in this case is 296 ms.

But if M integers are read with a jump of step=16 i.e. 0, 16, 32, … 16*(M-1), then although the number of elements scanned is still M but the number of cache lines accessed is 16*M/16 = M.

Thus in this case we are accessing 16 times more cache lines.

int *arr = new int[length];

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < m*cache_line_size; i+=cache_line_size)
arr[i]++;
auto stop = std::chrono::high_resolution_clock::now();

auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

std::cout << (float) duration.count() << std::endl;

Time taken in this stepped access case is 1871 ms.

This clearly shows that although we are iterating over same number of indices M, but the amount of work done in fetching the additional cache lines in the 2nd case is significantly high as compared to the 1st case.

In the 1st case when we fetch index 0, all indices from 0 to 15 are also fetched and cached in L1, L2 and L3 cache, hence updating the indices from 1 to 15 is done within the CPU cache which is not the case when we are doing stepped access.

In a 3rd scenario, we are iterating with a step size of 32, but we are updating integers at index i and i+16 i.e. we are still accessing same number of cache lines.

int *arr = new int[length];

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < m*cache_line_size; i+=2*cache_line_size) {
arr[i]++;
arr[i+cache_line_size]++;
}
auto stop = std::chrono::high_resolution_clock::now();

auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

std::cout << (float) duration.count() << std::endl;

Time taken in this case is around 1500 ms.

The purpose of the 3rd scenario is to show that although we are iterating over half the number of indexes in the array, but the running time should remain almost same since we are accessing the same number of cache lines.

Key Takeaway : Performance is dependent not only on how many elements we are looping over but also how many cache lines we are accessing from main memory.

False Cache Line Sharing

Let’s say there is an array of size N and using 4 threads A, B, C and D we repeatedly update the integers at indices i, j, k and l.

Now if i, j, k and l belong to the same cache line for e.g. i=0, j=1, k=2 and l=3, then whenever a thread updates an integer, the entire cache line is invalidated and the entire cache line is again fetched from main memory.

void update(int *inp, int j) {
for (int i = 0; i < 1000000; i++) inp[j]++;
}

int main(int argc, char* argv[]) {

const int length = 1024;
int *arr = new int[length];

long long b = 1000;
float a = 0.0;

// updates happen in different cache lines
for (int r = 0; r < b; r++) {
auto start = std::chrono::high_resolution_clock::now();

std::thread thread1(update, arr, 0);
std::thread thread2(update, arr, 1);
std::thread thread3(update, arr, 2);
std::thread thread4(update, arr, 3);

thread1.join();
thread2.join();
thread3.join();
thread4.join();

auto stop = std::chrono::high_resolution_clock::now();

auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>
(stop - start);

a += (float) duration.count();
}

std::cout << float(a)/float(b) << std::endl << std::endl;

return 0;
}

Average time taken when updates takes place in same cache line is around 9.0 ms.

Each thread runs in a different CPU core, and as we saw earlier that each core has its own L1 and L2 cache.

The advantage of having per core L1 and L2 cache is that when working in a non-threaded environment, reads and writes are very fast as the core where the program runs can directly use its own L1 or L2 cache.

But disadvantage is that with multiple threads, where each core updates its own L1 or L2 cache independently, the values needs to be in sync across all cores. Syncing takes time.

For e.g. if thread A updates the integer at index i=0, then the thread B cannot update the integer at index j=1 in its own L1/L2 cache concurrently because the entire cache line is invalidated and until the update is synced across all cores, the next update cannot happen in the same cache line.

But if i, j, k and l belong to different cache lines for e.g. i=0, j=16, k=32 and l=48, then an update made by a thread does not invalidate the other cache lines and hence threads updating the other cache lines will update the integers in its own core’s cache concurrently without having to sync across all cores.

float a = 0.0;
int *arr = new int[length];

// updates happen in same cache line
for (int r = 0; r < b; r++) {
auto start = std::chrono::high_resolution_clock::now();

std::thread thread1(update, arr, 0);
std::thread thread2(update, arr, 16);
std::thread thread3(update, arr, 32);
std::thread thread4(update, arr, 48);

thread1.join();
thread2.join();
thread3.join();
thread4.join();

auto stop = std::chrono::high_resolution_clock::now();

auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>
(stop - start);

a += (float) duration.count();
}

std::cout << float(a)/float(b) << std::endl << std::endl;

Average time taken when updates takes place in different cache lines is around 3.3 ms.

Thus it takes almost 3 times more to update integers in the same cache line as compared to different cache lines when using multiple threads.

Key Takeaway : Care must be taken while using multi threading with CPU bound tasks such as above as the additional costs due to context switching, creation and deletion of threads and false cache line sharing can significantly reduce the gains.

Cache Associativity

If we look at only the highest cache i.e. L3 cache in my system, it has 12 MB cache size and 12-way associativity. It implies that each cache line can be placed in one of 12 slots in the L3 cache.

Number of cache lines that can be placed in L3 cache = 12 MB/64 bytes = 196608.

This is the total number of slots available in the L3 cache.

Since L3 is 12-ways associative, number of different 12-slots in the L3 cache = 196608/12 = 16384.

For brevity, let us call a “12-slot” as a “group”.

But how to determine which cache lines goes into which group ?

Since 0 to 16383 can be represented using 14 bits, thus if we index each cache line as 0, 1, 2 … and so on then all cache lines whose lower 14 bits are the same, can be put in the same group in L3 cache. If there are more than 12 cache lines with 14 lower bits same then there will be evictions in L3 cache.

Thus all integers separated by 16384*16 = 262144 positions in the array would occupy the same group in L3 cache.

If there are more than 262144*12 = 3145728 integers present in the array, there will be evictions and if we try to read the evicted values, they will have to be fetched again from main memory.

int main(int argc, char* argv[]) {

const int length = 512*1024*1024;
const int m = 262144;
int *arr = new int[length];

auto start = std::chrono::high_resolution_clock::now();
for (int r = 0; r < 10000; r++) {
for (int i = 0; i < length; i+=m)
arr[i]++;
}
auto stop = std::chrono::high_resolution_clock::now();

auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

std::cout << (float) duration.count() << std::endl << std::endl;

free(arr);
return 0;
}

If we run the above program with values of m ranging from 262100 to 262200, we can see that the time taken for m=262144 is almost 3x of that for any other value of m in that range.

This is because for all m except for m=262144 gets distributed across multiple groups whereas for m=262144, all integers maps to the same group and hence the number of evictions in that group is very high as the number of slots is only 12 per group.

In-fact there are 512*1024*1024/262144 = 2048 different elements that maps to the same group (cache line index have same lower 14 bits).

For values of m that are multiples of 262144, they will similarly be placed in the same group.

For values of m lower than 262144 but powers of 2 such as 131072, 65536, 16384 … and so on, the integers will be distributed across multiple groups.

For e.g. with m=65536, the integers have the lower 12 bits same in cache line index and thus they will be distributed across 4 different groups (14–12=2 bits), similarly for m=16384, have lower 11 bits same in cache line index and hence distributed across 8 different groups (14–11=3 bits).

Although the integers are distributed across groups but the number of cache lines mapping to the same group remains the same since we are accessing more number of cache lines with m=256 as compared to m=262144.

Thus for values of m which are powers of 2, there is an upward spike in running time as compared to its neighboring values.

Key Takeaway : When accessing elements in an array in a stepped fashion, try not to use a step size that is a power of 2.

References

--

--