How to generate non-temporal instructions?

Question

Intel's compiler has a pragma that can be used to generate non-temporal stores. For example, I can write

void square(const double* x, double* y, int n) {
#pragma vector nontemporal
  for (int i=0; i<n; ++i) {
    y[i] = x[i] * x[i];
  }
}

and ICC will generate instructions like this (compiler-explorer)

...
  vmovntpd %ymm1, (%rsi,%r9,8) #4.5
...

Do gcc and clang have anything similar? (other than intrinsics)

The non-temporal store makes the code much faster. Using this benchmark

#include <random>
#include <memory>

#include <benchmark/benchmark.h>

static void generate_random_numbers(double* x, int n) {
  std::mt19937 rng{0};
  std::uniform_real_distribution<double> dist{-1, 1};
  for (int i=0; i<n; ++i) {
    x[i] = dist(rng);
  }
}

static void square(const double* x, double* y, int n) {
#ifdef __INTEL_COMPILER
#pragma vector nontemporal
#endif
  for (int i=0; i<n; ++i) {
    y[i] = x[i] * x[i];
  }
}

static void BM_Square(benchmark::State& state) {
  const int n = state.range(0);
  std::unique_ptr<double[]> xptr{new double[n]};
  generate_random_numbers(xptr.get(), n);
  for (auto _ : state) {
    std::unique_ptr<double[]> yptr{new double[n]};
    square(xptr.get(), yptr.get(), n);
    benchmark::DoNotOptimize(yptr);
  }
}

BENCHMARK(BM_Square)->Arg(1000000);

BENCHMARK_MAIN();

the non-temporal code runs almost twice as fast on my machine. Here are the full results:

icc:

> icc -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Square/1000000     430889 ns       430889 ns         1372

clang:

> clang++ -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Square/1000000     781672 ns       781470 ns          820

gcc:

> g++-mp-10 -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_Square/1000000     681684 ns       681533 ns          782

Note: clang has __builtin_nontemporal_store; but when I try it, it won't generate non-temporal instructions (compiler-explorer)

Regis Portalez · Accepted Answer

I'm really surprised that ICC delivers that performance on such a simple code. Back in the days, non temporal store improved bandwidth performance by a few percents only.

Maybe things changed (and then I'm surprised again that clang and gcc didn't do something about it).

Anyway, you can generate these instructions by using intrinsics.

Here is a sample (in which I didn't implement the scalar logic for the trailing bytes, passed the last multiple of 8) :

#include <immintrin.h>

void square_elements(
    const double * __restrict const x, 
    double* __restrict const y, 
    const int n) 
{
    // this should be enforced earlier by a call to an aligned allocation
const double* ax = (double*) __builtin_assume_aligned(x, 64);
double* ay = (double*) __builtin_assume_aligned(y, 64);
  for (int i=0; i < n; i += 8) {
    __m512d xi = _mm512_load_pd((void*) (ax + i));
    __m512d mul = _mm512_mul_pd(xi, xi);
    _mm512_stream_pd((void*) (ay + i), mul);
  }
}

With a godbolt link to see assembly.

[EDIT] Memory alignment (__builtin_assume_aligned) is actually really important. In my memories, aligned moves (vmovapd) is really faster than unaligned moves (vmovupd), even if documentations says both instructions have the same latency and throughput.

How to generate non-temporal instructions?

Tags:

c++

x86

gcc

clang

icc

Ryan Burn

1 Answers

Regis Portalez

Recent Activity

Donate For Us

How to generate non-temporal instructions?

Tags:

c++

x86

gcc

clang

icc

Ryan Burn

1 Answers

Regis Portalez

Related questions

Recent Activity

Donate For Us