look at this code
extern "C" long _InterlockedCompareExchange(long volatile * _Destination, long _Exchange, long _Comparand);
#define MAGIC 1
// Unlike InterlockedIncrement this function not increment from 0 to 1, but return FALSE
bool TryLock(long* pLock)
{
long Value = *pLock, NewValue;
for ( ; Value; Value = NewValue)
{
NewValue = _InterlockedCompareExchange(pLock, Value + 1, Value);
if (
#if MAGIC
NewValue == Value
#else
Value == NewValue
#endif
) return true;
}
return false;
}
are if set #define MAGIC 0 something changed ? by idea must not be. but if using CL.EXE 64bit compiler if we change NewValue == Value to Value == NewValue (simply long values) - generated code serious changed!
I try this with two version of CL - latest 19.00.24210.0 and with 14.00.50727.762 (more than 10 years old - December 2006) I got absolute equal code from both in all tests. compile with flags cl /c /FA /O1 - so /O1 optimization (same results with /Oxs )
with MAGIC 1 (NewValue == Value)
TryLock PROC
mov eax, [pLock]
jmp @@
@@loop:
lea edx, [rax+1]
lock cmpxchg [pLock], edx
je @@exit
@@:
test eax, eax
jne @@loop
ret
@@exit:
mov al, 1
ret
TryLock ENDP
but with MAGIC 0 (Value == NewValue)
TryLock PROC
mov r8d, [pLock]
test r8d, r8d
je @@0
@@loop:
lea edx, [r8+1]
mov eax, r8d
lock cmpxchg [pLock], edx
cmp r8d, eax ; !!!!!!!!
je @@exit
test eax, eax
mov r8d, eax
jne @@loop
@@0:
xor al, al
ret
@@exit:
mov al, 1
ret
TryLock ENDP
code become large, but main notable difference in instruction
cmp Value, NewValue
after lock cmpxchg in second variant. really lock cmpxchg [p], NewValue yourself set or reset ZF flag and additional cmp Value, NewValue become excess. we can omit it if we write in assembly, but on c/c++ we have no way use ZF for condition branch.
no statements like ifzf { /* if ZF == 1 */ } else { /* if ZF == 0 */ } as result we need write if (NewValue == Value) {} else {}
and as result must be cmp NewValue, Value in generated assembly. but how i discovered for CL x64 (but not for x86 !) already 10+ years (think all versions) do next
this code
NewValue = _InterlockedCompareExchange(p, fn(OldValue), OldValue);
if (OldValue == NewValue) ...
converted to
mov eax, OldValue
lock cmpxchg [p], fn(OldValue)
mov NewValue, eax
cmp OldValue, eax ; !!!!
jne @@
....
but this code
NewValue = _InterlockedCompareExchange(p, fn(OldValue), OldValue);
if (NewValue == OldValue) ...
converted to
mov eax, OldValue
lock cmpxchg [p], fn(OldValue)
mov NewValue, eax
jne @@
...
so CL understand cmpxchg semantic and can do optimization, but only in some case.
i test this feature in several test functions and everywhere got the same result for both (very old and new CL )
extern "C" long _InterlockedCompareExchange(long volatile * _Destination, long _Exchange, long _Comparand);
typedef long (*FN)(long* pLock, long Value);
#define MAGIC 1
void TestZF1(long* pLock)
{
long Value = *pLock, NewValue;
do
{
Value++;
NewValue = _InterlockedCompareExchange(pLock, Value ^ 1, Value);
} while (
#if MAGIC
NewValue != Value
#else
Value != NewValue
#endif
);
}
long TestZF2(long* pLock, FN fn1, FN fn2)
{
long Value = *pLock, NewValue;
NewValue = _InterlockedCompareExchange(pLock, Value ^ 1, Value);
return (
#if MAGIC
NewValue == Value
#else
Value == NewValue
#endif
? fn1 : fn2) (pLock, NewValue);
}
and generated assembly:
TestZF1 PROC
mov r8d, DWORD PTR [rcx]
@@loop:
add r8d, 1
mov edx, r8d
mov eax, r8d
xor edx, 1
lock cmpxchg [rcx], edx
IF !MAGIC
cmp r8d,eax ; ! in TestZF1 different exactly in this instruction
ENDIF
jne @@loop
ret 0
TestZF1 ENDP
IF MAGIC
TestZF2 PROC
mov r9d, [rcx]
mov eax, [rcx]
xor r9d, 1
lock cmpxchg [rcx], r9d
cmove r8, rdx
mov edx, eax
jmp r8
TestZF2 ENDP
ELSE
TestZF2 PROC
mov r10d, [rcx]
mov r9d, r10d
xor r9d, 1
mov eax, r10d
lock cmpxchg [rcx], r9d
cmp r10d, eax ; !!!!!!!!
cmove r8, rdx
mov edx, eax
jmp r8
TestZF2 ENDP
ENDIF
several questions:
CL x64 optimize case if (NewValue == Value) but not
optimize if (Value == NewValue) ?CL x86 not do this optimization ? how minimum in all my tests
cmp Value,NewValue instruction existc/c++ ,without assembler, for implement
this on x86 with CL ?c/c++ compilers have this kind of
optimization for _InterlockedCompareExchange[Pointer] ?
- why CL x64 optimize case if (NewValue == Value) but not optimize if (Value == NewValue) ?
- this is consciously, specially designed, or it was suddenly and unknown ?
I'm convinced it is a bug, so I've reported it.
If they respond, we will know if this is a bug or not.
- why CL x86 not do this optimization ? how minimum in all my tests cmp Value,NewValue instruction exist
x86 performance may be not optimized to the same level as x86-64, as it is of secondary importance. Though may be reported as another missed optimization bug.
- are possible write code on c/c++ ,without assembler, for implement this on x86 with CL ?
Apparently not. But clang-cl that ships with Visual Studio 2019 and supposed to emulate CL very closely seem to do better. It is also affected by MAGIC, but when it is enabled, it produces better code on x86.
- interesting - are another c/c++ compilers have this kind of optimization for _InterlockedCompareExchange[Pointer] ?
Other compilers have separate __sync_bool_compare_and_swap and __sync_val_compare_and_swap, and they implement this optimization for bool version https://godbolt.org/z/j97aEG5GY
Note that _InterlockedCompareExchange as well as __sync_bool_compare_and_swap are non-standard, and there are C standard and C++ standard alternatives.
The corresponding standard function returns both boolean directly, and observed value indirectly:
atomic_compare_exchange_strong, returns _Bool, and observed value is returned by pointer where you pass expectedstd::atomic<T>::compare_exchange_strong, returns bool, observed value is returned by reference instead of expectedThese standard alternatives are likely to have _InterlockedCompareExchange or __sync_bool_compare_and_swap under the hood though.
Regarding optimizations:
<stdatomic.h> C header yetIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With