I have the following C code that I am testing to understand perf and caching. It sequentially accesses an array of doubles.
// test.c
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>
int main(int argc, char **argv)
{
size_t n = 10000000;
double* arr;
if (posix_memalign((void**)&arr, 64, n * sizeof(double)) != 0) {
fprintf(stderr, "posix_memalign failed\n");
exit(1);
}
for (size_t i = 0; i < n; i += 64) {
_mm_clflush(&arr[i]);
}
for (size_t i = 0; i < n; i++) arr[i] = (double)i;
double sum = 0;
for (size_t i = 0; i < n; i++) sum += arr[i];
printf("Sum: %f\n", sum);
free(arr);
return 0;
}
I compile the code without optimization, and run perf.
$ gcc -o test test.c
$ perf stat -e LLC-loads,LLC-loads-misses -- ./test
Sum: 49999995000000.000000
Performance counter stats for 'test':
94,765 LLC-loads:u
91,979 LLC-loads-misses:u # 97.06% of all LL-cache accesses
0.082974254 seconds time elapsed
0.047802000 seconds user
0.034893000 seconds sys
A few questions puzzle me.
First, LLC-loads is much smaller than expected. Should it be close to 10000000/8 (cache line: 64 bytes)? Or is it possible that it doesn't profile the whole duration of runtime?
Second and more importantly, the miss rate is too high. I expect each read miss will bring a cache line with 8 doubles. So overall, the miss rate should come close to 1/8? Actually, it should be much smaller as pre-fetching will help a lot in this case, right?
BTW, below is the basic info.
$ gcc --version
gcc (GCC) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
$ perf --version
perf version 4.18.0-553.84.1.el8_10.x86_64
$ lscpu | grep "Model name"
Model name: Intel(R) Xeon(R) Gold 6430
$ lscpu | grep "L3 cache"
L3 cache: 61440K
While it's impossible to say for certain without knowing your exact CPU model and without seeing the actual machine code that runs on your CPU, your CPU likely doesn't have enough cache.
Given
size_t n = 10000000
and 8 bytes/double, that's 80,000,000 bytes of memory - close to 80 MB.
That almost certainly doesn't fit into whatever cache your CPU has.
A top-end CPU, such as an AMD Ryzen Threadripper PRO 9995WX has the following cache:
Cache per CPU Package:
L1 Instruction Cache: 96 x 32 KB
L1 Data Cache: 96 x 48 KB
L2 Cache: 96 x 1024 KB
L3 Cache: 384 MB
Most likely, you're not running on a CPU like that.
A more typical CPU, perhaps something like an Intel Xeon Gold 6230R from a few years ago, probably has cache somewhat like this:
Cache per CPU Package:
L1 Instruction Cache: 26 x 32 KB
L1 Data Cache: 26 x 32 KB
L2 Cache: 26 x 1024 KB
L3 Cache: 36 MB
80 MB isn't going to fit into that even if every element has its own cache line.
After this loop
for (size_t i = 0; i < n; i++) arr[i] = (double)i;
the elements at the beginning of the array have almost certainly been overwritten in all of the L1, L2, and L3 caches, so when this runs
double sum = 0;
for (size_t i = 0; i < n; i++) sum += arr[i];
the data has to be retrieved all the way from RAM.
EDIT:
The CPU is an Intel(R) Xeon(R) Gold 6430, with the following cache:
Cache per CPU Package:
L1 Instruction Cache: 32 x 32 KB
L1 Data Cache: 32 x 48 KB
L2 Cache: 32 x 2048 KB
L3 Cache: 60 MB
It'd be interesting to see the cache hit rate for a smaller value of n, perhaps
size_t n = 5000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With