Fastest way to convert 4 bytes to float in c++

Question

I need to convert an array of bytes to array of floats. I get the bytes through a network connection, and then need to parse them into floats. the size of the array is not pre-definded. this is the code I have so far, using unions. Do you have any suggestions on how to make it run faster?

int offset = DATA_OFFSET - 1;
UStuff bb;
//Convert every 4 bytes to float using a union
for (int i = 0; i < NUM_OF_POINTS;i++){
    //Going backwards - due to endianness
    for (int j = offset + BYTE_FLOAT*i + BYTE_FLOAT ; j > offset + BYTE_FLOAT*i; --j)
    {
        bb.c[(offset + BYTE_FLOAT*i + BYTE_FLOAT)- j] = sample[j];
    }
    res.append(bb.f);
}
return res;

This is the union I use

union UStuff
{
        float   f;
        unsigned char   c[4];
};

Admin · Accepted Answer

Short Answer

#include <cstdint>
#define NOT_STUPID 1
#define ENDIAN NOT_STUPID

namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
#if ENDIAN == NOT_STUPID
  return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
         ((uint32_t)byte_3) << 24;
#else
  return byte_3 | ((uint32_t)byte_2) << 8 | ((uint32_t)byte_1) << 16 |
         ((uint32_t)byte_0) << 24;
#endif
}

inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
  uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
  return *reinterpret_cast<float*>(&flt);
}

/* Use this function to write directly to RAM and avoid the xmm
   registers. */
inline uint32_t FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
                   float* destination) {
  uint32_t value = UI4Set (byte_0, byte_1, byte_2, byte_3);
  *reinterpret_cast<uint32_t*>(destination) = value;
  return value;
}

} //< namespace _
using namespace _; //< @see Kabuki Toolkit

static flt = FLTSet (0, 1, 2, 3);

int main () {
  uint32_t flt_init = FLTSet (4, 5, 6, 7, &flt);
  return 0;
}
//< This uses 4 extra bytes doesn't use the xmm register

Long Answer

It is GENERALLY NOT recommended to use a Union to convert to and from floating-point to integer because Unions to this day do not always generate optimal assembly code and the other techniques are more explicit and may use less typing; and rather then my opinion from other StackOverflow posts on Unions, we shall prove it will disassembly with a modern compiler: Visual-C++ 2018.

The first thing we must know about how to optimize floating-point algorithms is how the registers work. The core of the CPU is exclusively an integer-processing-unit with coprocessors (i.e. extensions) on them for processing floating-point numbers. These Load-Store Machines (LSM) can only work with integers and they must use a separate set of registers for interacting with floating-point coprocessors. On x86_64 these are the xmm registers, which are 128-bits wide and can process Single Instruction Multiple Data (SIMD). In C++ way to load-and-store a floating-point register is:

int Foo(double foo) { return foo + *reinterpret_cast<double*>(&foo); }

int main() {
  double foo = 1.0;
  uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
  return Foo(bar);
}

Now let's inspect the disassembly with Visual-C++ O2 optimations on because without them you will get a bunch of debug stack frame variables. I had to add the function Foo into the example to avoid the code being optimized away.

  double foo = 1.0;
  uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
00007FF7482E16A0  mov         rax,3FF0000000000000h  
00007FF7482E16AA  xorps       xmm0,xmm0  
00007FF7482E16AD  cvtsi2sd    xmm0,rax  
  return Foo(bar);
00007FF7482E16B2  addsd       xmm0,xmm0  
00007FF7482E16B6  cvttsd2si   eax,xmm0  
}
00007FF7482E16BA  ret

And as described, we can see that the LSM first moves the double value into an integer register, then it zeros out the xmm0 register using a xor function because the register is 128-bits wide and we're loading a 64-bit integer, then loads the contents of the integer register to the floating-point register using the cvtsi2sd instruction, then finally followed by the cvttsd2si instruction that then loads the value back from the xmm0 register to the return register before finally returning.

So now let's address that concern about generating optimal assembly code using this test script and Visual-C++ 2018:

#include <stdafx.h>
#include <cstdint>
#include <cstdio>

static float foo = 0.0f;

void SetFooUnion(char byte_0, char byte_1, char byte_2, char byte_3) {
  union {
    float flt;
    char bytes[4];

  } u = {foo};

  u.bytes[0] = byte_0;
  u.bytes[1] = byte_1;
  u.bytes[2] = byte_2;
  u.bytes[3] = byte_3;
  foo = u.flt;
}

void SetFooManually(char byte_0, char byte_1, char byte_2, char byte_3) {
  uint32_t faster_method = byte_0 | ((uint32_t)byte_1) << 8 |
                           ((uint32_t)byte_2) << 16 | ((uint32_t)byte_3) << 24;
  *reinterpret_cast<uint32_t*>(&foo) = faster_method;
}

namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
  return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
         ((uint32_t)byte_3) << 24;
}

inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
  uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
  return *reinterpret_cast<float*>(&flt);
}

inline void FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
                   float* destination) {
  uint32_t value = byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
                   ((uint32_t)byte_3) << 24;
  *reinterpret_cast<uint32_t*>(destination) = value;
}

}  // namespace _

int main() {
  SetFooUnion(0, 1, 2, 3);

  union {
    float flt;
    char bytes[4];

  } u = {foo};

  // Start union read tests

  putchar(u.bytes[0]);
  putchar(u.bytes[1]);
  putchar(u.bytes[2]);
  putchar(u.bytes[3]);

  // Start union write tests

  u.bytes[0] = 4;
  u.bytes[2] = 5;

  foo = u.flt;

  // Start hand-coded tests

  SetFooManually(6, 7, 8, 9);

  uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
  putchar((char)(bar));
  putchar((char)(bar >> 8));
  putchar((char)(bar >> 16));
  putchar((char)(bar >> 24));

  _::FLTSet (0, 1, 2, 3, &foo);

  return 0;
}

Now after inspecting the O2-optimized disassembly, we have proven the compiler DOES NOT produce optimal code:

int main() {
00007FF6DB4A1000  sub         rsp,28h  
  SetFooUnion(0, 1, 2, 3);
00007FF6DB4A1004  mov         dword ptr [rsp+30h],3020100h  
00007FF6DB4A100C  movss       xmm0,dword ptr [rsp+30h]  

  union {
    float flt;
    char bytes[4];

  } u = {foo};
00007FF6DB4A1012  movss       dword ptr [rsp+30h],xmm0  

  // Start union read tests

  putchar(u.bytes[0]);
00007FF6DB4A1018  movsx       ecx,byte ptr [u]  
  SetFooUnion(0, 1, 2, 3);
00007FF6DB4A101D  movss       dword ptr [foo (07FF6DB4A3628h)],xmm0  

  // Start union read tests

  putchar(u.bytes[0]);
00007FF6DB4A1025  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  
  putchar(u.bytes[1]);
00007FF6DB4A102B  movsx       ecx,byte ptr [rsp+31h]  
00007FF6DB4A1030  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  
  putchar(u.bytes[2]);
00007FF6DB4A1036  movsx       ecx,byte ptr [rsp+32h]  
00007FF6DB4A103B  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  
  putchar(u.bytes[3]);
00007FF6DB4A1041  movsx       ecx,byte ptr [rsp+33h]  
00007FF6DB4A1046  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  

  uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
  putchar((char)(bar));
00007FF6DB4A104C  mov         ecx,6  

  // Start union write tests

  u.bytes[0] = 4;
  u.bytes[2] = 5;

  foo = u.flt;

  // Start hand-coded tests

  SetFooManually(6, 7, 8, 9);
00007FF6DB4A1051  mov         dword ptr [foo (07FF6DB4A3628h)],9080706h  

  uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
  putchar((char)(bar));
00007FF6DB4A105B  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  
  putchar((char)(bar >> 8));
00007FF6DB4A1061  mov         ecx,7  
00007FF6DB4A1066  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  
  putchar((char)(bar >> 16));
00007FF6DB4A106C  mov         ecx,8  
00007FF6DB4A1071  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  
  putchar((char)(bar >> 24));
00007FF6DB4A1077  mov         ecx,9  
00007FF6DB4A107C  call        qword ptr [__imp_putchar (07FF6DB4A2160h)]  

  return 0;
00007FF6DB4A1082  xor         eax,eax  

  _::FLTSet(0, 1, 2, 3, &foo);
00007FF6DB4A1084  mov         dword ptr [foo (07FF6DB4A3628h)],3020100h  
}
00007FF6DB4A108E  add         rsp,28h  
00007FF6DB4A1092  ret

Here is the raw main disassembly because the inlined functions are missing:

; Listing generated by Microsoft (R) Optimizing Compiler Version 19.12.25831.0 

include listing.inc

INCLUDELIB OLDNAMES

EXTRN   __imp_putchar:PROC
EXTRN   __security_check_cookie:PROC
?foo@@3MA DD    01H DUP (?)             ; foo
_BSS    ENDS
PUBLIC  main
PUBLIC  ?SetFooManually@@YAXDDDD@Z          ; SetFooManually
PUBLIC  ?SetFooUnion@@YAXDDDD@Z             ; SetFooUnion
EXTRN   _fltused:DWORD
;   COMDAT pdata
pdata   SEGMENT
$pdata$main DD  imagerel $LN8
    DD  imagerel $LN8+137
    DD  imagerel $unwind$main
pdata   ENDS
;   COMDAT xdata
xdata   SEGMENT
$unwind$main DD 010401H
    DD  04204H
xdata   ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
;   COMDAT ?SetFooManually@@YAXDDDD@Z
_TEXT   SEGMENT
byte_0$dead$ = 8
byte_1$dead$ = 16
byte_2$dead$ = 24
byte_3$dead$ = 32
?SetFooManually@@YAXDDDD@Z PROC             ; SetFooManually, COMDAT

  00000 c7 05 00 00 00
    00 06 07 08 09   mov     DWORD PTR ?foo@@3MA, 151521030 ; 09080706H

  0000a c3       ret     0
?SetFooManually@@YAXDDDD@Z ENDP             ; SetFooManually
_TEXT   ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
;   COMDAT main
_TEXT   SEGMENT
u$1 = 48
u$ = 48
main    PROC                        ; COMDAT

$LN8:
  00000 48 83 ec 28  sub     rsp, 40            ; 00000028H

  00004 c7 44 24 30 00
    01 02 03     mov     DWORD PTR u$1[rsp], 50462976 ; 03020100H

  0000c f3 0f 10 44 24
    30       movss   xmm0, DWORD PTR u$1[rsp]

  00012 f3 0f 11 44 24
    30       movss   DWORD PTR u$[rsp], xmm0

  00018 0f be 4c 24 30   movsx   ecx, BYTE PTR u$[rsp]

  0001d f3 0f 11 05 00
    00 00 00     movss   DWORD PTR ?foo@@3MA, xmm0

  00025 ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  0002b 0f be 4c 24 31   movsx   ecx, BYTE PTR u$[rsp+1]
  00030 ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  00036 0f be 4c 24 32   movsx   ecx, BYTE PTR u$[rsp+2]
  0003b ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  00041 0f be 4c 24 33   movsx   ecx, BYTE PTR u$[rsp+3]
  00046 ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  0004c b9 06 00 00 00   mov     ecx, 6

  00051 c7 05 00 00 00
    00 06 07 08 09   mov     DWORD PTR ?foo@@3MA, 151521030 ; 09080706H

  0005b ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  00061 b9 07 00 00 00   mov     ecx, 7
  00066 ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  0006c b9 08 00 00 00   mov     ecx, 8
  00071 ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  00077 b9 09 00 00 00   mov     ecx, 9
  0007c ff 15 00 00 00
    00       call    QWORD PTR __imp_putchar

  00082 33 c0        xor     eax, eax

  00084 48 83 c4 28  add     rsp, 40            ; 00000028H
  00088 c3       ret     0
main    ENDP
_TEXT   ENDS
END

So what is the difference?

?SetFooUnion@@YAXDDDD@Z PROC                ; SetFooUnion, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 7
    mov BYTE PTR [rsp+32], r9b
; Line 14
    mov DWORD PTR u$[rsp], 50462976     ; 03020100H
; Line 18
    movss   xmm0, DWORD PTR u$[rsp]
    movss   DWORD PTR ?foo@@3MA, xmm0
; Line 19
    ret 0
?SetFooUnion@@YAXDDDD@Z ENDP                ; SetFooUnion

versus:

?SetFooManually@@YAXDDDD@Z PROC             ; SetFooManually, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 34
    mov DWORD PTR ?foo@@3MA, 151521030      ; 09080706H
; Line 35
    ret 0
?SetFooManually@@YAXDDDD@Z ENDP             ; SetFooManually

The first thing to notice is the effect of the Union on inline memory optimizations. Unions are designed specifically to multiplex RAM for different purposes for different time periods in order to reduce RAM usage, so this means the memory must remain coherent in RAM and thus is less inlinable. The Union code forced the compiler to write the Union out to RAM while the non-Union method just throws out your code and replaces with a single mov DWORD PTR ?foo@@3MA, 151521030 instruction without the use of the xmm0 register!!! The O2 optimizations automatically inlined the SetFooUnion and SetFooManually functions, but the non-Union method inlined a lot more code used less RAM reads, evidence from the difference between the Union method's line of code:

movsx       ecx,byte ptr [rsp+31h]

versus the non-Union method's version:

mov         ecx,7

The Union is loading ecx from A POINTER TO RAM while the other is using the faster single-cycle mov instruction. THAT IS A HUGE PERFORMANCE INCREASE!!! However, this actually may be the desired behavior when working with real-time systems and multi-threaded applications because the compiler optimizations may be unwanted and may mess up our timing, or you may want to use a mix of the two methods.

Aside from the potentially suboptimal RAM usage, I tried for several hours to get the compiler to generate suboptimal assembly but I was not able to for most of my toy problems, so it does seem as though this is a rather nifty feature of Unions rather than a reason to shun them. My favorite C++ metaphor is that C++ is like a kitchen full of sharp knives, you need to pick the correct knife for the correct job, and just because there are a lot of sharp knives in a kitchen doesn't mean that you pull out all of the knives at once, or that you leave the knives out. As long as you keep the kitchen tidy then you're not going to cut yourself. A Union is a sharp knife that may help ensure greater RAM coherency but it requires more typing and slows the program down.

Fastest way to convert 4 bytes to float in c++

Tags:

c++

arrays

optimization

casting

floating-point

JLev

1 Answers

Short Answer

Long Answer

Recent Activity

Donate For Us

Fastest way to convert 4 bytes to float in c++

Tags:

c++

arrays

optimization

casting

floating-point

JLev

1 Answers

Short Answer

Long Answer

Related questions

Recent Activity

Donate For Us