Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

measuring code execution times in C using RDTSC instruction

Tags:

c

I wrote a simple program to measure the code execution times using RDTSC instruction. But I don't know whether my result is correct and anything wrong with my code...I have no idea how to verify it.

#include <stdio.h>
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#define N (1024*4)

unsigned cycles_low, cycles_high, cycles_low1, cycles_high1;

static __inline__ unsigned long long rdtsc(void)
{
    __asm__ __volatile__ ("RDTSC\n\t"
            "mov %%edx, %0\n\t"
            "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
            "%rax", "rbx", "rcx", "rdx");
}

static __inline__ unsigned long long rdtsc1(void)
{
    __asm__ __volatile__ ("RDTSC\n\t"
            "mov %%edx, %0\n\t"
            "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1)::
            "%rax", "rbx", "rcx", "rdx");
}

int main(int argc, char* argv[])
{
    uint64_t start, end;

    rdtsc();
    malloc(N);
    rdtsc1();

    start = ( ((uint64_t)cycles_high << 32) | cycles_low );
    end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );

    printf("cycles spent in allocating %d bytes of memory: %llu\n",N, end - start);

    return 0;
}
like image 356
HuangJie Avatar asked Oct 14 '25 21:10

HuangJie


2 Answers

Problems that may effect the results you get are:

  • on most modern 80x86 CPUs TSC measures a fixed frequency clock and not cycles, and therefore the same piece of code can have wildly different "cycles" depending on power management, the load on other logical CPUs in the same core (hyper-threading), the load on other cores (turbo-boost), CPU temperature (thermal throttling), etc.

  • nothing prevents the OS's scheduler from pre-empting your thread immediately after the first rdtsc(); causing the resulting "cycles spent allocating" to include the time the CPU spent executing any number of completely different processes.

  • on some computers the TSC on different CPUs isn't synchronised; and nothing prevents the OS from pre-empting your thread immediately after the first rdtsc(); and then running your thread on a completely different CPU (with a completely different TSC). In this case it's possible for end - start to be negative (like time is going backwards).

  • nothing prevents an IRQ (from hardware) from interrupting your code immediately after the first rdtsc(); causing the resulting "cycles spent allocating" to include the time the OS spent handling any number of IRQs.

  • its impossible to prevent an SMI ("System Management Interrupt") causing the CPU to enter SMM ("System Management Mode") and executing hidden firmware code after the first rdtsc(); causing the resulting "cycles spent allocating" to include the time the CPU spent executing firmware code.

  • some (old) CPUs have a bug where rdtsc gives dodgy results when the lower 32 bits overflow (e.g. when the TSC goes from 0x00000000FFFFFFFF to 0x0000000100000000 you can use rdtsc at the exact wrong time and get 0x0000000000000000).

  • nothing prevents an "out-of-order" modern CPU from rearranging the order that most instructions are executed in, including your rdtsc instructions.

  • your measurement includes the overhead of measuring (e.g. if rdtsc takes 5 cycles and your malloc() costs 20 cycles, then you report 25 cycles and not 20 cycles).

  • with or without a virtual machine; it's possible that the rdtsc instruction is virtualised (e.g. nothing other than common sense prevents a kernel from making rdtsc report how much free disk space there is or anything else it likes). Ideally rdtsc should be virtualised to prevent most of the problems mentioned above and/or to prevent timing side-channels (but it almost never is).

  • on extremely old CPUs (80486 and older) the TSC and rdtsc instruction doesn't exist.


Note: I'm not an expert in GCC's inline assembly; but I strongly suspect your macros are buggy and that the compiler could chose to generate something like this:

    rdtsc
    mov %edx, %eax        ;Oops, trashed the low 32 bits
    mov %eax, %ebx

It should be possible to tell GCC that the value/s are returned in EDX:EAX and get rid of both mov instructions completely.

like image 198
Brendan Avatar answered Oct 17 '25 14:10

Brendan


There are some (non-obvious) issues that you should keep in mind when using RDTSC to time things:

  1. The frequency of the clock that it counts may be unpredictable. On older hardware, the frequency may actually change in between two RDTSC instructions, and even on newer hardware where it is fixed, it can be difficult to tell what frequency it runs at.

  2. Since RDTSC has no inputs, the CPU itself may reorder the RDTSC instruction to come before the code you are trying to measure. Note that this is a different problem from the compiler reordering the code, which you've avoided with __volatile__. To effectively avoid this, you have to execute a serializing instruction, which is an instruction which will prevent the CPU from moving an instruction before it. You can use either CPUID or RDTSCP (which is just a serializing form of RDTSC)

My suggestion: just use whatever high frequency timer API your OS has. On Windows this is QueryPerformanceCounter and on Unix you have gettimeofday or clock_gettime.

Aside from that, your RDTSC code has a few structural issues. The return type is "unsigned long long", but nothing is actually returned. If you fix that, you can avoid storing the result in global variables and you can avoid having to write multiple versions.

like image 29
Evelyn Parenteau Avatar answered Oct 17 '25 14:10

Evelyn Parenteau



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!