Consider the following code segment:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define ARRAYSIZE(arr) (sizeof(arr)/sizeof(arr[0]))
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx");
return a | ((uint64_t)d << 32);
}
inline int func() { return 5;}
inline void test()
{
uint64_t start, end;
char c;
start = rdtsc();
func();
end = rdtsc();
printf("%ld ticks\n", end - start);
}
void flushFuncCache()
{
// Assuming function to be not greater than 320 bytes.
char* fPtr = (char*)func;
clflush(fPtr);
clflush(fPtr+64);
clflush(fPtr+128);
clflush(fPtr+192);
clflush(fPtr+256);
}
int main(int ac, char **av)
{
test();
printf("Function must be cached by now!\n");
test();
flushFuncCache();
printf("Function flushed from cache.\n");
test();
printf("Function must be cached again by now!\n");
test();
return 0;
}
Here, i am trying to flush the instruction cache to remove the code for 'func', and then expecting a performance overhead on the next call to func but my results don't agree to my expectations:
858 ticks
Function must be cached by now!
788 ticks
Function flushed from cache.
728 ticks
Function must be cached again by now!
710 ticks
I was expecting CLFLUSH to also flush the instruction cache, but apparently, it is not doing so. Can someone explain this behavior or suggest how to achieve the desired behavior.
Your code does almost nothing in func
, and the little you do gets inlined into test
, and probably optimized out since you never use the return value.
gcc -O3 gives me -
0000000000400620 <test>:
400620: 53 push %rbx
400621: 0f a2 cpuid
400623: 0f 31 rdtsc
400625: 48 89 d7 mov %rdx,%rdi
400628: 48 89 c6 mov %rax,%rsi
40062b: 0f a2 cpuid
40062d: 0f 31 rdtsc
40062f: 5b pop %rbx
...
So you're measuring time for the two moves that are very cheap HW-wise - your measurement is probably showing the latency of cpuid
which is relatively expensive..
Worse, your clflush
would actually flush test
as well, this means you pay the re-fetch penalty when you next access it, which is out of the rdtsc
pair so it's not measured. The measured code on the other hand, sequentially follows, so fetching test
would probably also fetch the flushed code you measure, so it could actually be cached by the time you measure it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With