Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalid Operation with Arm64 fcmp and simd

Consider the following snippet:

ldr q0, [x0]
cmeq v0.16b, v0.16b, #0
shrn v0.8b, v0.8h, #4
fcmp d0, #0.0

This is a common way to implement functions such as strlen with SIMD. According to the Arm64 Architecture Reference Manual (version L.b), fcmp can generate an Invalid Operation floating-point exception if d0 is a signaling NaN. As a signaling NaN is represented by having all bits 62-52 set and bit 51 not set, and because this way of using shrn generates bands of 4 equal bits, it is possible to have bit 52 and bit 51 have different values and thus have a signaling NaN in d0.

I checked on Linux Arm64, the IOC bit of the FPSR register is set after performing fcmp with a signaling NaN (and it is clear before). However, it did not raise an exception / crash the program. Is this a characteristic of Linux? Could it be possible that this code raise an exception on some other OS / distro? If so, is it really portable to use these instructions to implement functions such as strlen?

like image 212
alexisrdt Avatar asked Nov 08 '25 04:11

alexisrdt


1 Answers

This is unsafe if you don't have full control of the FP environment for both exception masking and because it breaks in fast-math mode. FTZ mode treats subnormals aka denormals as 0.0 on input to all FP ops including compare. It would ignore 0 bytes near the start of a vector if the later bytes are all non-zero so the compare mask as a binary64 has the exponent field all 0. You'd also have to check that no AArch64 CPUs are really slow with subnormal compares in non-FTZ mode; some really old Intel CPUs do have microcode assists for that (e.g. Core 2 from 2006).

(Programs linked with gcc -ffast-math will set FTZ mode in their CRT startup code.)

It's potentially also slower than fmov x0, d0 / cbz integer compare; getting data forwarded from the FP execution units to integer should be about the same cost whether that's the flags result of a compare or a 64-bit register value. An FP compare take more cycles than an integer compare/test, as you can see from looking at the latency of SIMD packed-compare instructions that produce a mask in a vector reg.

(Update: @fuz says this fcmp / branch is more efficient in general, especially on some CPUs which have slow FP->integer data transfer. But of course you need full control of the FP environment to make it safe. So not usable in a library strlen, but potentially in your own programs.)

If you want to know the exact length down to the byte rather than just somewhere in a 16-byte vector, you need fmov / rbit / clz (the latter two insns finding the position of the lowest non-zero bit). Doing that fmov as part of your loop condition saves code-size later.


Your actual question about FP exceptions

FP exceptions are masked by default, so raising one only sets a bit in the FP environment as you found. Only if you unmask them is there an actual trap (branch to an exception handler in kernel code), which the kernel would handle and deliver SIGFPE. You can use glibc feenableexcept to unmask some or all exceptions.
(On AArch64, support for unmasked FP exceptions is optional; not all CPUs even support it. On x86 it's mandatory.)

For the same reason, dividing by 1.0 / 0.0 silently produces +inf, and taking the square root of a negative number silently produces NaN, with the exceptions raised just setting bits in the FP status register.

Same for other FP exception types, like precision exception which happens any time the bits discarded by rounding weren't all zeros.

Some operations like comparison don't normally trigger FP exceptions even with quiet-NaN as an input. Signalling NaN can create an FP exception even in operations that are normally "quiet", but doesn't override the exception-mask.


Fun fact: Glibc strlen doesn't use fcmp; it always uses fmov to a GPR for rbit / clz, or cmeq v0.8b, v0.8b, 0 / fmov/cbz to keep looping after using 2x uminp to pack 2 vectors (32 bytes) down to 8 bytes. https://codebrowser.dev/glibc/glibc/sysdeps/aarch64/multiarch/strlen_asimd.S.html#158
(startup for the first 32 bytes is done with scalar bithacks to make the short-string case fast.)


Anyway, if you care about not trapping in code that has unmasked some FP exceptions, and/or not raising spurious FP exceptions to pollute the FP environment, yes use fmov x0, d0 and cbz or whatever instead of fcmp / bne.

Reducing the compare mask down to 8 bytes means it can fit in an integer register, so that's a good option vs. treating it as an IEEE binary64 double FP value.

Code size is equal for fmov/cb[n]z vs. fcmp/bne to branch on it. Replacing fcmp/csel or something costs 3 instructions like fmov/tst/csel, but that's wouldn't be normal as part of strlen or memcmp or similar loops.


Semi-related:

  • How to exactly find the first matching zero in ARM using `shrn`, `fmov`, `rbit`, `clz`?
  • https://lemire.me/blog/2022/12/19/implementing-strlen-using-sve/ (SVE for wider vectors and fault-suppressing loads)
like image 177
Peter Cordes Avatar answered Nov 10 '25 22:11

Peter Cordes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!