Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does using SIMD have an initialisation cost

Tags:

x86-64

simd

arm64

Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency? Do we measure the stall in clock cycles or microseconds?

Conversely, how many non-SIMD instructions can one typically execute before the SIMD performance is lost, or is such a condition detected by some other means?

I'm mostly interested in modern arm64 (Cortex-A53,55,75,77 implementations, M1).

EDIT

The Intel case seems to be reasonably covered in SIMD instructions lowering CPU frequency, which leads to further links stating a maximum 8.5us period for "hard transition", where the execution units are in a halt state (if I understood it correctly). Also it contradicts my intuition: using AVX-512 instructions requires apparently the frequency to be ramped down.

like image 208
Aki Suihkonen Avatar asked Mar 07 '26 22:03

Aki Suihkonen


1 Answers

This answer applies for PCs, not ARM64.

Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency?

“no” for start at all. SSE is designed to be a replacement for x87 FPU. CPUs never power off just SIMD hardware because most programs occasionally use floating point math.

However, Intel CPUs power off some of the hardware. First time a program uses 32-byte or 64-byte vectors, they will run a lot slower, until transitioned to the proper power state.

For Intel Sandy Bridge, Ivy Bridge, Haswell, that penalty applies to 32-byte vectors.

For Intel Skylake, that penalty applies to 32-byte and 64-byte vectors, warmup duration is 56000 clock cycles or 14 μs.

For Intel Ice Lake and Tiger Lake, the penalty only applies to 64-byte vectors, warmup duration is about 50000 clock cycles.

During that warm-up period, throughput is halved and instructions have extra latency. Note that warm-up is agnostic to instruction set, it only applies to the size of the vectors. AVX1, AVX2 and AVX512 instructions which handle 16-byte vectors always run at full speed.

how many non-SIMD instructions can one typically execute before the SIMD performance is lost

Skylake CPUs revert to idle state after 2.7 million clock cycles (675 μs) is spent running instructions with ≤ 16 bytes SIMD width.

For more information, see microarchitecture guide by Agner Fog.

like image 147
Soonts Avatar answered Mar 10 '26 10:03

Soonts