_InterlockedCompareExchange optimization

Question

look at this code

extern "C" long _InterlockedCompareExchange(long volatile * _Destination, long _Exchange, long _Comparand);

#define MAGIC 1

// Unlike InterlockedIncrement this function not increment from 0 to 1, but return FALSE

bool TryLock(long* pLock)
{
    long Value = *pLock, NewValue;

    for ( ; Value; Value = NewValue)
    {
        NewValue = _InterlockedCompareExchange(pLock, Value + 1, Value);

        if (
#if MAGIC
            NewValue == Value
#else
            Value == NewValue
#endif
            ) return true;
    }

    return false;
}

are if set #define MAGIC 0 something changed ? by idea must not be. but if using CL.EXE 64bit compiler if we change NewValue == Value to Value == NewValue (simply long values) - generated code serious changed!

I try this with two version of CL - latest 19.00.24210.0 and with 14.00.50727.762 (more than 10 years old - December 2006) I got absolute equal code from both in all tests. compile with flags cl /c /FA /O1 - so /O1 optimization (same results with /Oxs )

with MAGIC 1 (NewValue == Value)

TryLock PROC
    mov eax, [pLock]
    jmp @@
@@loop:
    lea edx, [rax+1]
    lock cmpxchg [pLock], edx
    je  @@exit
@@:
    test    eax, eax
    jne @@loop
    ret
@@exit:
    mov al, 1
    ret
TryLock ENDP

but with MAGIC 0 (Value == NewValue)

TryLock PROC
    mov r8d, [pLock]
    test    r8d, r8d
    je  @@0
@@loop:
    lea edx, [r8+1]
    mov eax, r8d
    lock cmpxchg [pLock], edx
    cmp r8d, eax        ; !!!!!!!!
    je  @@exit
    test    eax, eax
    mov r8d, eax
    jne @@loop
@@0:
    xor al, al
    ret
@@exit:
    mov al, 1
    ret
TryLock ENDP

code become large, but main notable difference in instruction

cmp Value, NewValue

after lock cmpxchg in second variant. really lock cmpxchg [p], NewValue yourself set or reset ZF flag and additional cmp Value, NewValue become excess. we can omit it if we write in assembly, but on c/c++ we have no way use ZF for condition branch. no statements like ifzf { /* if ZF == 1 */ } else { /* if ZF == 0 */ } as result we need write if (NewValue == Value) {} else {} and as result must be cmp NewValue, Value in generated assembly. but how i discovered for CL x64 (but not for x86 !) already 10+ years (think all versions) do next

this code

NewValue = _InterlockedCompareExchange(p, fn(OldValue), OldValue);
if (OldValue == NewValue) ...

converted to

mov eax, OldValue
lock cmpxchg [p], fn(OldValue)
mov NewValue, eax
cmp OldValue, eax ; !!!!
jne @@
....

but this code

NewValue = _InterlockedCompareExchange(p, fn(OldValue), OldValue);
if (NewValue == OldValue) ...

converted to

mov eax, OldValue
lock cmpxchg [p], fn(OldValue)
mov NewValue, eax
jne @@
...

so CL understand cmpxchg semantic and can do optimization, but only in some case.

i test this feature in several test functions and everywhere got the same result for both (very old and new CL )

extern "C" long _InterlockedCompareExchange(long volatile * _Destination, long _Exchange, long _Comparand);

typedef long (*FN)(long* pLock, long Value);

#define MAGIC 1

void TestZF1(long* pLock)
{
    long Value = *pLock, NewValue;

    do 
    {
        Value++;
        NewValue = _InterlockedCompareExchange(pLock, Value ^ 1, Value);
    } while (
#if MAGIC
        NewValue != Value
#else
        Value != NewValue
#endif
        );
}

long TestZF2(long* pLock, FN fn1, FN fn2)
{
    long Value = *pLock, NewValue;

    NewValue = _InterlockedCompareExchange(pLock, Value ^ 1, Value);

    return (
#if MAGIC
        NewValue == Value
#else
        Value == NewValue
#endif
        ? fn1 : fn2) (pLock, NewValue);
}

and generated assembly:

TestZF1 PROC
    mov r8d, DWORD PTR [rcx]
@@loop:
    add r8d, 1
    mov edx, r8d
    mov eax, r8d
    xor edx, 1
    lock cmpxchg [rcx], edx
IF !MAGIC
    cmp r8d,eax     ; ! in TestZF1 different exactly in this instruction
ENDIF
    jne @@loop
    ret 0
TestZF1 ENDP

IF MAGIC

TestZF2 PROC
    mov r9d, [rcx]
    mov eax, [rcx]
    xor r9d, 1
    lock cmpxchg [rcx], r9d
    cmove   r8, rdx
    mov edx, eax
    jmp r8
TestZF2 ENDP

ELSE

TestZF2 PROC
    mov r10d, [rcx]
    mov r9d, r10d
    xor r9d, 1
    mov eax, r10d
    lock cmpxchg [rcx], r9d
    cmp r10d, eax   ; !!!!!!!!
    cmove   r8, rdx     
    mov edx, eax
    jmp r8
TestZF2 ENDP

ENDIF

several questions:

why CL x64 optimize case if (NewValue == Value) but not optimize if (Value == NewValue) ?
this is consciously, specially designed, or it was suddenly and unknown ?
why CL x86 not do this optimization ? how minimum in all my tests cmp Value,NewValue instruction exist
are possible write code on c/c++ ,without assembler, for implement this on x86 with CL ?
interesting - are another c/c++ compilers have this kind of optimization for _InterlockedCompareExchange[Pointer] ?

Alex Guteniev · Accepted Answer

why CL x64 optimize case if (NewValue == Value) but not optimize if (Value == NewValue) ?

this is consciously, specially designed, or it was suddenly and unknown ?

I'm convinced it is a bug, so I've reported it.

If they respond, we will know if this is a bug or not.

why CL x86 not do this optimization ? how minimum in all my tests cmp Value,NewValue instruction exist

x86 performance may be not optimized to the same level as x86-64, as it is of secondary importance. Though may be reported as another missed optimization bug.

are possible write code on c/c++ ,without assembler, for implement this on x86 with CL ?

Apparently not. But clang-cl that ships with Visual Studio 2019 and supposed to emulate CL very closely seem to do better. It is also affected by MAGIC, but when it is enabled, it produces better code on x86.

interesting - are another c/c++ compilers have this kind of optimization for _InterlockedCompareExchange[Pointer] ?

Other compilers have separate __sync_bool_compare_and_swap and __sync_val_compare_and_swap, and they implement this optimization for bool version https://godbolt.org/z/j97aEG5GY

Note that _InterlockedCompareExchange as well as __sync_bool_compare_and_swap are non-standard, and there are C standard and C++ standard alternatives.

The corresponding standard function returns both boolean directly, and observed value indirectly:

In C, it is atomic_compare_exchange_strong, returns _Bool, and observed value is returned by pointer where you pass expected
In C++ it is std::atomic<T>::compare_exchange_strong, returns bool, observed value is returned by reference instead of expected

These standard alternatives are likely to have _InterlockedCompareExchange or __sync_bool_compare_and_swap under the hood though.

Regarding optimizations:

GCC and clang optimize this in C: https://godbolt.org/z/hsnMzE689. Unfortunately MSVC does not have <stdatomic.h> C header yet
GCC and clang and MSVC optimize this in C++: https://godbolt.org/z/WfjaajE9W

_InterlockedCompareExchange optimization

Tags:

c

optimization

x86

assembly

x86-64

intel

RbMm

1 Answers

Alex Guteniev

Recent Activity

Donate For Us

_InterlockedCompareExchange optimization

Tags:

c

optimization

x86

assembly

x86-64

intel

RbMm

1 Answers

Alex Guteniev

Related questions

Recent Activity

Donate For Us