Unexpectedly slow performance on moving from register to frequently accessed variable

Question

I'm learning about how caching works using the following example:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint32_t data_t;
const int U = 10000000;   // size of the array. 10 million vals ~= 40MB
const int N = 100000000;  // number of searches to perform

int main() {
  data_t* data = (data_t*) malloc(U * sizeof(data_t));
  if (data == NULL) {
    free(data);
    printf("Error: not enough memory
");
    exit(-1);
  }

  // fill up the array with sequential (sorted) values.
  int i;
  for (i = 0; i < U; i++) {
    data[i] = i;
  }

  printf("Allocated array of size %d
", U);
  printf("Summing %d random values...
", N);

  data_t val = 0;
  data_t seed = 42;
  for (i = 0; i < N; i++) {
    int l = rand_r(&seed) % U;
    val = (val + data[l]);
  }

  free(data);
  printf("Done. Value = %d
", val);
  return 0;
}

The relevant annotation of the slow random access loop, found using perf record ./sum, and then perf report, is

  0.05 │       mov    -0x18(%rbp),%eax                                                                 ▒
  0.07 │       mov    -0x10(%rbp),%rcx                                                                 ▒
       │       movslq -0x20(%rbp),%rdx                                                                 ▒
  0.03 │       add    (%rcx,%rdx,4),%eax                                                               ▒
 95.39 │       mov    %eax,-0x18(%rbp)                                                                 ▒
  1.34 │       mov    -0x14(%rbp),%eax                                                                 ▒
       │       add    $0x1,%eax                                                                        ◆
       │       mov    %eax,-0x14(%rbp)

At this point, -0x18 holds val, -0x10 holds data, -0x14 holds i, and -0x20 holds l. The numbers on the left column show the % of time spent on that instruction. I expected that the line add (%rcx, %rdx, 4), %eax would take up the most time, since it has to perform a random access load for data[l] (which is just (%rcx, %rdx, 4)). This should only be in the L1 cache about 16k/U = 0.16% of the time, since my L1 cache has size 64k bytes, or 16k integers. So this operation should be slow. In contrast, the operation that apparently is slow just moves from a register %eax into val which is used so often that it is definitely in cache. Can anyone explain what's going on?

Peter Cordes · Accepted Answer

HW performance counters usually "blame" the instruction waiting for the slow result (the store), not the instruction that was slow to produce it. (The memory-source add which micro-fused into a load+add uop).

And/or they blame the next instruction after a cache-miss load regardless of data dependencies. This is called a "skid" or "skew". See for example https://easyperf.net/blog/2018/08/29/Understanding-performance-events-skid and https://www.brendangregg.com/perf.html

My hypothesis for the cause of this effect is that Intel CPUs I think wait for the oldest instruction in the ROB to retire when an interrupt is raised, perhaps to avoid starving the main thread in a high-interrupt situation. For a cache-miss load that ends up stalling out-of-order execution, it will be the oldest in the ROB, unable to retire until the load data arrives (because x86's strongly-ordered memory model doesn't let loads retire before then, even if they're known to be non-faulting, unlike ARM). So when the "cycles" event's counter ticks down to zero and triggers a sample, the cache-miss load retires and the next instruction in program order gets the "blame" for that sample.

For events intended to attach to a specific instruction, like mem_load_retired.l3_miss, a skid is more problematic, but Intel PEBS avoids that. In the previous paragraph I was talking about the "cycles" event which ticks every cycle while stalled, but you could potentially get the same thing for mem_load_retired.l3_miss which couldn't be detected until hearing back from the L3 slice.

In code that doesn't stall, the first or second in a group of instructions that all retire in the same cycle may get the blame. The CPU has to pick one instruction to blame out of everything in-flight in the pipeline. Either by where it raises an interrupt (non-PEBS) or which instruction address goes in the PEBS buffer.

See also Inconsistent `perf annotate` memory load/store time reporting which is a less simple/obvious case but blaming the instruction waiting for the slow result is a key part of that.

Unexpectedly slow performance on moving from register to frequently accessed variable

Tags:

c

caching

assembly

x86-64

perf

Steven Mai

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Unexpectedly slow performance on moving from register to frequently accessed variable

Tags:

c

caching

assembly

x86-64

perf

Steven Mai

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us