I wanted to run some code through IACA analyzer to see how many uops it was using-- I started with a simple function to see if it was working..
Unfortunately when I insert the macros IACA says to use, the resulting assembly was very different, and so any analysis of it is not helpful..
Here is the assembly produced without IACA
00007FF9CD590580  vaddps      ymm1,ymm5,ymmword ptr [rax]  
00007FF9CD590584  vaddps      ymm2,ymm6,ymmword ptr [rax+20h]  
00007FF9CD590589  vaddps      ymm3,ymm7,ymmword ptr [rax+40h]  
00007FF9CD59058E  vmulps      ymm4,ymm1,ymm1  
00007FF9CD590592  vfmadd231ps ymm4,ymm2,ymm2  
00007FF9CD590597  vfmadd231ps ymm4,ymm3,ymm3  
00007FF9CD59059C  vcmpgt_oqps ymm1,ymm4,ymm9  
00007FF9CD5905A2  vrsqrtps    ymm0,ymm4  
00007FF9CD5905A6  vandps      ymm2,ymm1,ymm0  
00007FF9CD5905AA  vmovups     ymm3,ymm8  
00007FF9CD5905AF  vfmsub231ps ymm3,ymm2,ymm4  
00007FF9CD5905B4  vmovups     ymmword ptr [r9+rax],ymm3  
00007FF9CD5905BA  add         rax,rcx  
00007FF9CD5905BD  sub         r8d,1  
00007FF9CD5905C1  jne         fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B0h (07FF9CD590580h)  
And here is what it produces once I add the IACA macros..( I'm testing MSVC produced binary, so I'm using IACA_VC64_START and IACA_VC64_END as the manual says to do).
00007FF9CD59058B  vmovups     ymm2,ymmword ptr [rax+40h]  
00007FF9CD590590  vmovups     ymm0,ymmword ptr [rax]  
00007FF9CD590594  vmovups     ymm1,ymmword ptr [rax+20h]  
00007FF9CD590599  vaddps      ymm3,ymm2,ymm8  
00007FF9CD59059E  vmovups     ymmword ptr [rbp+20h],ymm0  
00007FF9CD5905A3  vaddps      ymm0,ymm0,ymm6  
00007FF9CD5905A7  vmovups     ymmword ptr [rbp+40h],ymm1  
00007FF9CD5905AC  vmulps      ymm4,ymm0,ymm0  
00007FF9CD5905B0  vaddps      ymm1,ymm1,ymm7  
00007FF9CD5905B4  vfmadd231ps ymm4,ymm1,ymm1  
00007FF9CD5905B9  vfmadd231ps ymm4,ymm3,ymm3  
00007FF9CD5905BE  vcmpgt_oqps ymm1,ymm4,ymm5  
00007FF9CD5905C3  vrsqrtps    ymm0,ymm4  
00007FF9CD5905C7  vmovups     ymmword ptr [rbp+60h],ymm2  
00007FF9CD5905CC  vandps      ymm2,ymm1,ymm0  
00007FF9CD5905D0  vmovups     ymm3,ymm9  
00007FF9CD5905D5  vfmsub231ps ymm3,ymm2,ymm4  
00007FF9CD5905DA  vmovups     ymmword ptr [rcx+rax],ymm3  
00007FF9CD5905DF  add         rax,rdx  
00007FF9CD5905E2  mov         qword ptr [rbp+18h],rax  
00007FF9CD5905E6  vmovups     ymmword ptr [rbp+80h],ymm3  
00007FF9CD5905EE  sub         r8d,1  
00007FF9CD5905F2  jne         fm::EvlOp::applyLoop<`RegisterShapeOps<fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> > >'::`2'::doDISTANCE_SPHERE_11,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::DataWrapper,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::RegisterBlock,fm::interpeter<fm::interpreter_settings<math::v8float,4,float,fm::Instruction,math::v8f2d,math::v8float> >::instruction_input>+0B2h (07FF9CD590582h)  
So it has inserted lots of moves, and now my (hopefully) fused add is not longer fused--..
I was hoping it would be able to tell me if
00007FF9CD590584  vaddps      ymm2,ymm6,ymmword ptr [rax+20h] 
Stayed fused, but it removed this code all together..
Is this a known issue, or perhaps because I'm using MSVC which may not be very common?
Is there perhaps a way to fix this, or a better tool that is compatible with MSVC?
IACA mark macros are just inline asm (or for 64-bit MSVC: start = __writegsbyte(111, 111); and stop = 222).  They can potentially disturb the optimizer, or end up in the wrong place (e.g. not the last instruction before falling into a loop, so the block includes some loop setup).
If that happens, like in your case, your best bet is to ask the compiler to produce asm (not machine code) output, and manually insert the markers into the asm you want to analyze.
In NASM syntax, I use this %if / %else block so I can build with nasm -DIACA_MARKS or not.  I know this isn't the right syntax for MASM, but the IACA start/end markers are pretty simple: mov to EBX and fs addr32 nop.
%ifdef IACA_MARKS
%macro  IACA_start 0             ; NASM macro with 0 args, defines IACA_start
     mov ebx, 111
     db 0x64, 0x67, 0x90
%endmacro
%macro  IACA_end 0
     mov ebx, 222
     db 0x64, 0x67, 0x90
%endmacro
%else
%define IACA_start
%define IACA_end
%endif
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With