I need to convert an array of bytes to array of floats. I get the bytes through a network connection, and then need to parse them into floats. the size of the array is not pre-definded. this is the code I have so far, using unions. Do you have any suggestions on how to make it run faster?
int offset = DATA_OFFSET - 1;
UStuff bb;
//Convert every 4 bytes to float using a union
for (int i = 0; i < NUM_OF_POINTS;i++){
//Going backwards - due to endianness
for (int j = offset + BYTE_FLOAT*i + BYTE_FLOAT ; j > offset + BYTE_FLOAT*i; --j)
{
bb.c[(offset + BYTE_FLOAT*i + BYTE_FLOAT)- j] = sample[j];
}
res.append(bb.f);
}
return res;
This is the union I use
union UStuff
{
float f;
unsigned char c[4];
};
#include <cstdint>
#define NOT_STUPID 1
#define ENDIAN NOT_STUPID
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
#if ENDIAN == NOT_STUPID
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
#else
return byte_3 | ((uint32_t)byte_2) << 8 | ((uint32_t)byte_1) << 16 |
((uint32_t)byte_0) << 24;
#endif
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
/* Use this function to write directly to RAM and avoid the xmm
registers. */
inline uint32_t FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = UI4Set (byte_0, byte_1, byte_2, byte_3);
*reinterpret_cast<uint32_t*>(destination) = value;
return value;
}
} //< namespace _
using namespace _; //< @see Kabuki Toolkit
static flt = FLTSet (0, 1, 2, 3);
int main () {
uint32_t flt_init = FLTSet (4, 5, 6, 7, &flt);
return 0;
}
//< This uses 4 extra bytes doesn't use the xmm register
It is GENERALLY NOT recommended to use a Union to convert to and from floating-point to integer because Unions to this day do not always generate optimal assembly code and the other techniques are more explicit and may use less typing; and rather then my opinion from other StackOverflow posts on Unions, we shall prove it will disassembly with a modern compiler: Visual-C++ 2018.
The first thing we must know about how to optimize floating-point algorithms is how the registers work. The core of the CPU is exclusively an integer-processing-unit with coprocessors (i.e. extensions) on them for processing floating-point numbers. These Load-Store Machines (LSM) can only work with integers and they must use a separate set of registers for interacting with floating-point coprocessors. On x86_64 these are the xmm registers, which are 128-bits wide and can process Single Instruction Multiple Data (SIMD). In C++ way to load-and-store a floating-point register is:
int Foo(double foo) { return foo + *reinterpret_cast<double*>(&foo); }
int main() {
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
return Foo(bar);
}
Now let's inspect the disassembly with Visual-C++ O2 optimations on because without them you will get a bunch of debug stack frame variables. I had to add the function Foo into the example to avoid the code being optimized away.
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
00007FF7482E16A0 mov rax,3FF0000000000000h
00007FF7482E16AA xorps xmm0,xmm0
00007FF7482E16AD cvtsi2sd xmm0,rax
return Foo(bar);
00007FF7482E16B2 addsd xmm0,xmm0
00007FF7482E16B6 cvttsd2si eax,xmm0
}
00007FF7482E16BA ret
And as described, we can see that the LSM first moves the double value into an integer register, then it zeros out the xmm0 register using a xor function because the register is 128-bits wide and we're loading a 64-bit integer, then loads the contents of the integer register to the floating-point register using the cvtsi2sd instruction, then finally followed by the cvttsd2si instruction that then loads the value back from the xmm0 register to the return register before finally returning.
So now let's address that concern about generating optimal assembly code using this test script and Visual-C++ 2018:
#include <stdafx.h>
#include <cstdint>
#include <cstdio>
static float foo = 0.0f;
void SetFooUnion(char byte_0, char byte_1, char byte_2, char byte_3) {
union {
float flt;
char bytes[4];
} u = {foo};
u.bytes[0] = byte_0;
u.bytes[1] = byte_1;
u.bytes[2] = byte_2;
u.bytes[3] = byte_3;
foo = u.flt;
}
void SetFooManually(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t faster_method = byte_0 | ((uint32_t)byte_1) << 8 |
((uint32_t)byte_2) << 16 | ((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(&foo) = faster_method;
}
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
inline void FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(destination) = value;
}
} // namespace _
int main() {
SetFooUnion(0, 1, 2, 3);
union {
float flt;
char bytes[4];
} u = {foo};
// Start union read tests
putchar(u.bytes[0]);
putchar(u.bytes[1]);
putchar(u.bytes[2]);
putchar(u.bytes[3]);
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
putchar((char)(bar >> 8));
putchar((char)(bar >> 16));
putchar((char)(bar >> 24));
_::FLTSet (0, 1, 2, 3, &foo);
return 0;
}
Now after inspecting the O2-optimized disassembly, we have proven the compiler DOES NOT produce optimal code:
int main() {
00007FF6DB4A1000 sub rsp,28h
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A1004 mov dword ptr [rsp+30h],3020100h
00007FF6DB4A100C movss xmm0,dword ptr [rsp+30h]
union {
float flt;
char bytes[4];
} u = {foo};
00007FF6DB4A1012 movss dword ptr [rsp+30h],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1018 movsx ecx,byte ptr [u]
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A101D movss dword ptr [foo (07FF6DB4A3628h)],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1025 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[1]);
00007FF6DB4A102B movsx ecx,byte ptr [rsp+31h]
00007FF6DB4A1030 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[2]);
00007FF6DB4A1036 movsx ecx,byte ptr [rsp+32h]
00007FF6DB4A103B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[3]);
00007FF6DB4A1041 movsx ecx,byte ptr [rsp+33h]
00007FF6DB4A1046 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A104C mov ecx,6
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
00007FF6DB4A1051 mov dword ptr [foo (07FF6DB4A3628h)],9080706h
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A105B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 8));
00007FF6DB4A1061 mov ecx,7
00007FF6DB4A1066 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 16));
00007FF6DB4A106C mov ecx,8
00007FF6DB4A1071 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 24));
00007FF6DB4A1077 mov ecx,9
00007FF6DB4A107C call qword ptr [__imp_putchar (07FF6DB4A2160h)]
return 0;
00007FF6DB4A1082 xor eax,eax
_::FLTSet(0, 1, 2, 3, &foo);
00007FF6DB4A1084 mov dword ptr [foo (07FF6DB4A3628h)],3020100h
}
00007FF6DB4A108E add rsp,28h
00007FF6DB4A1092 ret
Here is the raw main disassembly because the inlined functions are missing:
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.12.25831.0
include listing.inc
INCLUDELIB OLDNAMES
EXTRN __imp_putchar:PROC
EXTRN __security_check_cookie:PROC
?foo@@3MA DD 01H DUP (?) ; foo
_BSS ENDS
PUBLIC main
PUBLIC ?SetFooManually@@YAXDDDD@Z ; SetFooManually
PUBLIC ?SetFooUnion@@YAXDDDD@Z ; SetFooUnion
EXTRN _fltused:DWORD
; COMDAT pdata
pdata SEGMENT
$pdata$main DD imagerel $LN8
DD imagerel $LN8+137
DD imagerel $unwind$main
pdata ENDS
; COMDAT xdata
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT ?SetFooManually@@YAXDDDD@Z
_TEXT SEGMENT
byte_0$dead$ = 8
byte_1$dead$ = 16
byte_2$dead$ = 24
byte_3$dead$ = 32
?SetFooManually@@YAXDDDD@Z PROC ; SetFooManually, COMDAT
00000 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo@@3MA, 151521030 ; 09080706H
0000a c3 ret 0
?SetFooManually@@YAXDDDD@Z ENDP ; SetFooManually
_TEXT ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT main
_TEXT SEGMENT
u$1 = 48
u$ = 48
main PROC ; COMDAT
$LN8:
00000 48 83 ec 28 sub rsp, 40 ; 00000028H
00004 c7 44 24 30 00
01 02 03 mov DWORD PTR u$1[rsp], 50462976 ; 03020100H
0000c f3 0f 10 44 24
30 movss xmm0, DWORD PTR u$1[rsp]
00012 f3 0f 11 44 24
30 movss DWORD PTR u$[rsp], xmm0
00018 0f be 4c 24 30 movsx ecx, BYTE PTR u$[rsp]
0001d f3 0f 11 05 00
00 00 00 movss DWORD PTR ?foo@@3MA, xmm0
00025 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0002b 0f be 4c 24 31 movsx ecx, BYTE PTR u$[rsp+1]
00030 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00036 0f be 4c 24 32 movsx ecx, BYTE PTR u$[rsp+2]
0003b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00041 0f be 4c 24 33 movsx ecx, BYTE PTR u$[rsp+3]
00046 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0004c b9 06 00 00 00 mov ecx, 6
00051 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo@@3MA, 151521030 ; 09080706H
0005b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00061 b9 07 00 00 00 mov ecx, 7
00066 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0006c b9 08 00 00 00 mov ecx, 8
00071 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00077 b9 09 00 00 00 mov ecx, 9
0007c ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00082 33 c0 xor eax, eax
00084 48 83 c4 28 add rsp, 40 ; 00000028H
00088 c3 ret 0
main ENDP
_TEXT ENDS
END
So what is the difference?
?SetFooUnion@@YAXDDDD@Z PROC ; SetFooUnion, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 7
mov BYTE PTR [rsp+32], r9b
; Line 14
mov DWORD PTR u$[rsp], 50462976 ; 03020100H
; Line 18
movss xmm0, DWORD PTR u$[rsp]
movss DWORD PTR ?foo@@3MA, xmm0
; Line 19
ret 0
?SetFooUnion@@YAXDDDD@Z ENDP ; SetFooUnion
versus:
?SetFooManually@@YAXDDDD@Z PROC ; SetFooManually, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 34
mov DWORD PTR ?foo@@3MA, 151521030 ; 09080706H
; Line 35
ret 0
?SetFooManually@@YAXDDDD@Z ENDP ; SetFooManually
The first thing to notice is the effect of the Union on inline memory optimizations. Unions are designed specifically to multiplex RAM for different purposes for different time periods in order to reduce RAM usage, so this means the memory must remain coherent in RAM and thus is less inlinable. The Union code forced the compiler to write the Union out to RAM while the non-Union method just throws out your code and replaces with a single mov DWORD PTR ?foo@@3MA, 151521030 instruction without the use of the xmm0 register!!! The O2 optimizations automatically inlined the SetFooUnion and SetFooManually functions, but the non-Union method inlined a lot more code used less RAM reads, evidence from the difference between the Union method's line of code:
movsx ecx,byte ptr [rsp+31h]
versus the non-Union method's version:
mov ecx,7
The Union is loading ecx from A POINTER TO RAM while the other is using the faster single-cycle mov instruction. THAT IS A HUGE PERFORMANCE INCREASE!!! However, this actually may be the desired behavior when working with real-time systems and multi-threaded applications because the compiler optimizations may be unwanted and may mess up our timing, or you may want to use a mix of the two methods.
Aside from the potentially suboptimal RAM usage, I tried for several hours to get the compiler to generate suboptimal assembly but I was not able to for most of my toy problems, so it does seem as though this is a rather nifty feature of Unions rather than a reason to shun them. My favorite C++ metaphor is that C++ is like a kitchen full of sharp knives, you need to pick the correct knife for the correct job, and just because there are a lot of sharp knives in a kitchen doesn't mean that you pull out all of the knives at once, or that you leave the knives out. As long as you keep the kitchen tidy then you're not going to cut yourself. A Union is a sharp knife that may help ensure greater RAM coherency but it requires more typing and slows the program down.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With