I was writing some code to clear the screen to a particular color. C++ code:
void clear_screen(unsigned int color, void *memory, int height, int width) {
unsigned int *pixel = (unsigned int *)memory;
for (auto y = 0; y < height; y++)
for (auto x = 0; x < width; x++)
*pixel++ = color;
}
I used g++ and objconv to generate the corresponding assembly. This is what I got, and I've commented what I think some of the lines do too.
renderer_clear_screen:
push r13
push r12
push rbp
push rdi
push rsi
push rbx
mov r11d, ecx ; move the color into r11d
mov ebx, r8d ; move the height into ebx
mov rcx, rdx ; 000E _ 48: 89. D1st
test r8d, r8d ;
jle _cls_return ; basically, return if width or height is 0
test r9d, r9d ; ( window minimized )
jle _cls_return ;
mov r8d, r9d ; height = width
mov esi, r9d ; esi = width
mov edi, r9d ; edi = width
xor r10d, r10d ; r10d = 0
shr esi, 2 ; esi = width / 2
movd xmm1, r11d ; move the lower 32-bits of the color into xmm1
lea r12d, [r9-1] ; r12d = width - 1
shl rsi, 4 ; 003F _ 48: C1. E6, 04
mov ebp, r8d ; 0043 _ 44: 89. C5
shl rdi, 2 ; 0046 _ 48: C1. E7, 02
pshufd xmm0, xmm1, 0 ; 004A _ 66: 0F 70. C1, 00
shl rbp, 2 ; 004F _ 48: C1. E5, 02
ALIGN 8
?_001: cmp r12d, 2
jbe ?_006 ; if (width - 1 <= 2) { ?_006 }
mov rax, rcx ; 005E _ 48: 89. C8
lea rdx, [rcx+rsi] ; 0061 _ 48: 8D. 14 31
ALIGN 8
?_002: movups oword [rax], xmm0 ; 0068 _ 0F 11. 00
add rax, 16 ; 006B _ 48: 83. C0, 10
cmp rdx, rax ; 006F _ 48: 39. C2
jnz ?_002 ; 0072 _ 75, F4
lea rdx, [rcx+rbp] ; 0074 _ 48: 8D. 14 29
mov eax, r8d ; 0078 _ 44: 89. C0
cmp r9d, r8d ; 007B _ 45: 39. C1
jz ?_004 ; 007E _ 74, 1C
?_003: lea r13d, [rax+1H] ; 0080 _ 44: 8D. 68, 01
mov dword [rdx], r11d ; 0084 _ 44: 89. 1A
cmp r13d, r9d ; 0087 _ 45: 39. CD
jge ?_004 ; 008A _ 7D, 10
add eax, 2 ; 008C _ 83. C0, 02
mov dword [rdx+4H], r11d ; 008F _ 44: 89. 5A, 04
cmp r9d, eax ; 0093 _ 41: 39. C1
jle ?_004 ; 0096 _ 7E, 04
mov dword [rdx+8H], r11d ; 0098 _ 44: 89. 5A, 08
?_004: add r10d, 1 ; 009C _ 41: 83. C2, 01
add rcx, rdi ; 00A0 _ 48: 01. F9
cmp ebx, r10d ; 00A3 _ 44: 39. D3
jnz ?_001 ; 00A6 _ 75, B0
_cls_return:
pop rbx ;
pop rsi ;
pop rdi ;
pop rbp ;
pop r12 ;
pop r13 ; pop all the saved registers
ret ;
?_006: ; Local function
mov rdx, rcx ; 00B1 _ 48: 89. CA
xor eax, eax ; 00B4 _ 31. C0
jmp ?_003 ; 00B6 _ EB, C8
Now, in ?_001
, the compiler compares width - 1
to 2, which is the same thing as comparing the width
to 3. My question is, with -O3
, why did the compiler choose two instead of three, and waste a lea
(to move width - 1
into r12d
).
The only thing which makes sense to me is that powers of two are somehow faster to compare. Or maybe it's a compiler quirk?
The usual reason for GCC tweaking compare constants is to create smaller immediates, which helps it fit in an immediate of whatever width. Understanding gcc output for if (a>=3) / GCC seems to prefer small immediate values in comparisons. Is there a way to avoid that? (It always does it, instead of checking whether it's actually useful with this constant on the target ISA.) This heuristic works well for most ISAs, but sometimes not for AArch64 or ARM Thumb which can encode some immediates as a bit-range / bit-pattern, so it's not always the case that a smaller-magnitude number is better.
The width-1
is not part of that. The -1
is part of a range check to skip the auto-vectorized loop (16 bytes at a time with movups
) and go straight to the cleanup, 1..3 scalar stores.
It seems to be checking width >= 1 && width <= 3
, i.e. cleanup needed but total size is less than a full vector width. It's not equivalent to signed or unsigned width <= 3
for width=0
. Note the unsigned compare: 0 - 1
is above 2U
, because -1U
is UINT_MAX.
But it already excluded width <= 0
with test r9d, r9d
/ jle _cls_return
, so it would have been better for GCC to just check width <= 3U
instead of doing extra work to exclude zero from the range-check. (An lea
, and save/restore of R12 which isn't otherwise used!)
(The cleanup could also looks over-complicated, e.g. using movq [rdx], xmm0
if more than 1 uint is needed, and some weird branching around for various cases. And even better, if the total size is >= 4 uints, just do another movups
that ends at the end of the range, possibly overlapping with previous stores.)
Yes, this is a missed optimization, you can report it on https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (now that you know it's a missed optimization; it's good that you asked here first instead of filing a bug without first figuring out if the instruction could be avoided.)
The only thing which makes sense to me is that powers of two are somehow faster to compare.
No, it's not faster; cmp
performance is not data-dependent at all. (No integer instructions are, except sometimes [i]div
. And on AMD CPUs before Zen3, pext
/ pdep
. But anyway, not simple integer add/compare/shift stuff. See https://uops.info/).
And BTW, we can reproduce your GCC asm output on Godbolt by telling it this function is __attribute__((ms_abi))
, or there's a command-line option to set the calling convention default. (It's really only useful for looking at the asm; it's still using GNU/Linux headers and x86-64 System V type widths like 64-bit long
. Only a proper MinGW (cross-)compiler could show you what GCC would really do when targeting Windows.)
It's GAS .intel_syntax noprefix
, which is MASM-like, not NASM, but the difference would only be obvious with addressing modes involving global variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With