Intel's compiler has a pragma that can be used to generate non-temporal stores. For example, I can write
void square(const double* x, double* y, int n) {
#pragma vector nontemporal
for (int i=0; i<n; ++i) {
y[i] = x[i] * x[i];
}
}
and ICC will generate instructions like this (compiler-explorer)
...
vmovntpd %ymm1, (%rsi,%r9,8) #4.5
...
Do gcc and clang have anything similar? (other than intrinsics)
The non-temporal store makes the code much faster. Using this benchmark
#include <random>
#include <memory>
#include <benchmark/benchmark.h>
static void generate_random_numbers(double* x, int n) {
std::mt19937 rng{0};
std::uniform_real_distribution<double> dist{-1, 1};
for (int i=0; i<n; ++i) {
x[i] = dist(rng);
}
}
static void square(const double* x, double* y, int n) {
#ifdef __INTEL_COMPILER
#pragma vector nontemporal
#endif
for (int i=0; i<n; ++i) {
y[i] = x[i] * x[i];
}
}
static void BM_Square(benchmark::State& state) {
const int n = state.range(0);
std::unique_ptr<double[]> xptr{new double[n]};
generate_random_numbers(xptr.get(), n);
for (auto _ : state) {
std::unique_ptr<double[]> yptr{new double[n]};
square(xptr.get(), yptr.get(), n);
benchmark::DoNotOptimize(yptr);
}
}
BENCHMARK(BM_Square)->Arg(1000000);
BENCHMARK_MAIN();
the non-temporal code runs almost twice as fast on my machine. Here are the full results:
icc:
> icc -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_Square/1000000 430889 ns 430889 ns 1372
clang:
> clang++ -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_Square/1000000 781672 ns 781470 ns 820
gcc:
> g++-mp-10 -O3 -march=native -std=c++11 benchmark.cpp -lbenchmark -lbenchmark_main
> ./a.out
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_Square/1000000 681684 ns 681533 ns 782
Note: clang has __builtin_nontemporal_store; but when I try it, it won't generate non-temporal instructions (compiler-explorer)
I'm really surprised that ICC delivers that performance on such a simple code. Back in the days, non temporal store improved bandwidth performance by a few percents only.
Maybe things changed (and then I'm surprised again that clang and gcc didn't do something about it).
Anyway, you can generate these instructions by using intrinsics.
Here is a sample (in which I didn't implement the scalar logic for the trailing bytes, passed the last multiple of 8) :
#include <immintrin.h>
void square_elements(
const double * __restrict const x,
double* __restrict const y,
const int n)
{
// this should be enforced earlier by a call to an aligned allocation
const double* ax = (double*) __builtin_assume_aligned(x, 64);
double* ay = (double*) __builtin_assume_aligned(y, 64);
for (int i=0; i < n; i += 8) {
__m512d xi = _mm512_load_pd((void*) (ax + i));
__m512d mul = _mm512_mul_pd(xi, xi);
_mm512_stream_pd((void*) (ay + i), mul);
}
}
With a godbolt link to see assembly.
[EDIT] Memory alignment (__builtin_assume_aligned) is actually really important. In my memories, aligned moves (vmovapd) is really faster than unaligned moves (vmovupd), even if documentations says both instructions have the same latency and throughput.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With