Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpectedly slow performance on moving from register to frequently accessed variable

I'm learning about how caching works using the following example:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint32_t data_t;
const int U = 10000000;   // size of the array. 10 million vals ~= 40MB
const int N = 100000000;  // number of searches to perform

int main() {
  data_t* data = (data_t*) malloc(U * sizeof(data_t));
  if (data == NULL) {
    free(data);
    printf("Error: not enough memory\n");
    exit(-1);
  }

  // fill up the array with sequential (sorted) values.
  int i;
  for (i = 0; i < U; i++) {
    data[i] = i;
  }

  printf("Allocated array of size %d\n", U);
  printf("Summing %d random values...\n", N);

  data_t val = 0;
  data_t seed = 42;
  for (i = 0; i < N; i++) {
    int l = rand_r(&seed) % U;
    val = (val + data[l]);
  }

  free(data);
  printf("Done. Value = %d\n", val);
  return 0;
}

The relevant annotation of the slow random access loop, found using perf record ./sum, and then perf report, is

  0.05 │       mov    -0x18(%rbp),%eax                                                                 ▒
  0.07 │       mov    -0x10(%rbp),%rcx                                                                 ▒
       │       movslq -0x20(%rbp),%rdx                                                                 ▒
  0.03 │       add    (%rcx,%rdx,4),%eax                                                               ▒
 95.39 │       mov    %eax,-0x18(%rbp)                                                                 ▒
  1.34 │       mov    -0x14(%rbp),%eax                                                                 ▒
       │       add    $0x1,%eax                                                                        ◆
       │       mov    %eax,-0x14(%rbp)

At this point, -0x18 holds val, -0x10 holds data, -0x14 holds i, and -0x20 holds l. The numbers on the left column show the % of time spent on that instruction. I expected that the line add (%rcx, %rdx, 4), %eax would take up the most time, since it has to perform a random access load for data[l] (which is just (%rcx, %rdx, 4)). This should only be in the L1 cache about 16k/U = 0.16% of the time, since my L1 cache has size 64k bytes, or 16k integers. So this operation should be slow. In contrast, the operation that apparently is slow just moves from a register %eax into val which is used so often that it is definitely in cache. Can anyone explain what's going on?

like image 935
Steven Mai Avatar asked Sep 01 '25 01:09

Steven Mai


1 Answers

HW performance counters usually "blame" the instruction waiting for the slow result (the store), not the instruction that was slow to produce it. (The memory-source add which micro-fused into a load+add uop).

And/or they blame the next instruction after a cache-miss load regardless of data dependencies. This is called a "skid" or "skew". See for example https://easyperf.net/blog/2018/08/29/Understanding-performance-events-skid and https://www.brendangregg.com/perf.html

My hypothesis for the cause of this effect is that Intel CPUs I think wait for the oldest instruction in the ROB to retire when an interrupt is raised, perhaps to avoid starving the main thread in a high-interrupt situation. For a cache-miss load that ends up stalling out-of-order execution, it will be the oldest in the ROB, unable to retire until the load data arrives (because x86's strongly-ordered memory model doesn't let loads retire before then, even if they're known to be non-faulting, unlike ARM). So when the "cycles" event's counter ticks down to zero and triggers a sample, the cache-miss load retires and the next instruction in program order gets the "blame" for that sample.

For events intended to attach to a specific instruction, like mem_load_retired.l3_miss, a skid is more problematic, but Intel PEBS avoids that. In the previous paragraph I was talking about the "cycles" event which ticks every cycle while stalled, but you could potentially get the same thing for mem_load_retired.l3_miss which couldn't be detected until hearing back from the L3 slice.

In code that doesn't stall, the first or second in a group of instructions that all retire in the same cycle may get the blame. The CPU has to pick one instruction to blame out of everything in-flight in the pipeline. Either by where it raises an interrupt (non-PEBS) or which instruction address goes in the PEBS buffer.


See also Inconsistent `perf annotate` memory load/store time reporting which is a less simple/obvious case but blaming the instruction waiting for the slow result is a key part of that.

like image 171
Peter Cordes Avatar answered Sep 05 '25 13:09

Peter Cordes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!