I'm trying to understand how to measure performance and decided to write the very simple program:
section .text
    global _start
_start:
    mov rax, 60
    syscall
And I ran the program with perf stat ./bin  The thing I was surprised by is the stalled-cycles-frontend was too high.
      0.038132      task-clock (msec)         #    0.148 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.052 M/sec                  
       107,386      cycles                    #    2.816 GHz                    
        81,229      stalled-cycles-frontend   #   75.64% frontend cycles idle   
        47,654      instructions              #    0.44  insn per cycle         
                                              #    1.70  stalled cycles per insn
         8,601      branches                  #  225.559 M/sec                  
           929      branch-misses             #   10.80% of all branches        
   0.000256994 seconds time elapsed
As I understand the stalled-cycles-frontend it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete. 
So what caused CPU frontend to wait for most of the time in that simplest case?
And 2 page faults? Why? I read no memory pages.
Page faults includes code pages.
perf stat includes startup overhead.
IDK the details of how perf starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).
I guess most of the 47,654 instructions that were recorded was kernel code.  Perhaps including the page-fault handler!
I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall to invoke sys_exit, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit system call.  And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.
The user->kernel transition itself explains about 150 stalled cycles, BTW.  syscall is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With