A readonly struct containing a single primitive should be more or less as fast for any simple operation as the primitive itself.
All the tests below are running .NET Core 2.2 on Windows 7 x64, code optimised. I also get similar results when testing on .NET 4.7.2.
Testing this premise with the long type, it seems that this holds:
// =============== SETUP ===================
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static LongStruct Add(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long LongAdd(long lhs, long rhs) => lhs + rhs;
// =============== TESTS ===================
public static void TestLong(long a, long b, out long result)
{
var sw = Stopwatch.StartNew();
for (var i = 1000000000; i > 0; --i)
{
a = LongAdd(a, b);
}
sw.Stop();
result = a;
return sw.ElapsedMilliseconds;
}
public static void TestLongStruct(LongStruct a, LongStruct b, out LongStruct result)
{
var sw = Stopwatch.StartNew();
for (var i = 1000000000; i > 0; --i)
{
a = LongStruct.Add(a, b);
}
sw.Stop();
result = a;
return sw.ElapsedMilliseconds;
}
// ============= TEST LOOP =================
public static void RunTests()
{
var longStruct = new LongStruct(1);
var count = 0;
var longTime = 0L;
var longStructTime = 0L;
while (true)
{
count++;
Console.WriteLine("Test #" + count);
longTime += TestLong(1, 1, out var longResult);
var longMean = longTime / count;
Console.WriteLine($"Long: value={longResult}, Mean Time elapsed: {longMean} ms");
longStructTime += TestLongStruct(longStruct, longStruct, out var longStructResult);
var longStructMean = longStructTime / count;
Console.WriteLine($"LongStruct: value={longStructResult.Primitive}, Mean Time elapsed: {longStructMean} ms");
Console.WriteLine();
}
}
LongAdd is used so the test loops match - each loop calls out to a method which does some adding, rather than inlining for the primitive case
On my machine, the two times have settled down to within 2% of each other, close enough that I'm convinced they've been optimised to pretty much the same code.
The difference in IL is fairly small:
LongAdd vs LongStruct.Add).LongStruct.Add has a few extra instructions:
ldfld instructions to load Primitive from the structnewobj instruction to pack the new long back into a LongStructSo either the jitter is optimising away these instructions, or they're basically free.
If I take the code above and replace every long with a double, I'd expect the same sort of result (slower in absolute terms, as the add instruction will be slightly slower, but both by the same margin).
What I actually see is that the DoubleStruct version is about 4.8 times (i.e. 480%) slower than the double version.
The IL is identical to the long case (other than swapping int64 and LongStruct for float64 and DoubleStruct), but somehow the runtime is doing a load of extra work for the DoubleStruct case that isn't present in the LongStruct case or the double case.
Testing a few other primitive types, I see that float (465%) behaves the same way as double, and short and int behave the same way as long, so it seems it's something about floating point that is causing some optimisation not to be taken.
Why are DoubleStruct and FloatStruct so much slower than double and float, where the long, int and short equivalents suffer no such slowdown?
This isn't an answer on its own, but it's a bit of a more rigorous benchmark, on both x86 and x64, so hopefully it provides some more information to someone else who can explain this.
I tried to replicate this with BenchmarkDotNet. I also wanted to see what difference removing the in would do. I ran it separately as x86 and x64.
x86 (LegacyJIT)
| Method | Mean | Error | StdDev |
|----------------------- |---------:|---------:|---------:|
| TestLong | 257.9 ms | 2.099 ms | 1.964 ms |
| TestLongStruct | 529.3 ms | 4.977 ms | 4.412 ms |
| TestLongStructWithIn | 526.2 ms | 6.722 ms | 6.288 ms |
| TestDouble | 256.7 ms | 1.466 ms | 1.300 ms |
| TestDoubleStruct | 342.5 ms | 5.189 ms | 4.600 ms |
| TestDoubleStructWithIn | 338.7 ms | 3.808 ms | 3.376 ms |
x64 (RyuJIT)
| Method | Mean | Error | StdDev |
|----------------------- |-----------:|----------:|----------:|
| TestLong | 269.8 ms | 5.359 ms | 9.099 ms |
| TestLongStruct | 266.2 ms | 6.706 ms | 8.236 ms |
| TestLongStructWithIn | 270.4 ms | 4.150 ms | 3.465 ms |
| TestDouble | 270.4 ms | 5.336 ms | 6.748 ms |
| TestDoubleStruct | 1,250.9 ms | 24.702 ms | 25.367 ms |
| TestDoubleStructWithIn | 577.1 ms | 12.159 ms | 16.644 ms |
I can replicate this on x64 with RyuJIT, but not on x86 with LegacyJIT. This seems to be an artifact of RyuJIT managing the optimize the long case but not the double case - LegacyJIT doesn't mange to optimize either.
I've no idea why TestDoubleStruct is such an outlier on RyuJIT.
Code:
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
public static LongStruct Add(LongStruct lhs, LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
public readonly struct DoubleStruct
{
public readonly double Primitive;
public DoubleStruct(double value) => Primitive = value;
public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
}
public class Benchmark
{
[Benchmark]
public void TestLong()
{
for (var i = 1000000000; i > 0; --i)
{
LongAdd(1, 2);
}
}
[Benchmark]
public void TestLongStruct()
{
var a = new LongStruct(1);
var b = new LongStruct(2);
for (var i = 1000000000; i > 0; --i)
{
LongStruct.Add(a, b);
}
}
[Benchmark]
public void TestLongStructWithIn()
{
var a = new LongStruct(1);
var b = new LongStruct(2);
for (var i = 1000000000; i > 0; --i)
{
LongStruct.AddWithIn(a, b);
}
}
[Benchmark]
public void TestDouble()
{
for (var i = 1000000000; i > 0; --i)
{
DoubleAdd(1, 2);
}
}
[Benchmark]
public void TestDoubleStruct()
{
var a = new DoubleStruct(1);
var b = new DoubleStruct(2);
for (var i = 1000000000; i > 0; --i)
{
DoubleStruct.Add(a, b);
}
}
[Benchmark]
public void TestDoubleStructWithIn()
{
var a = new DoubleStruct(1);
var b = new DoubleStruct(2);
for (var i = 1000000000; i > 0; --i)
{
DoubleStruct.AddWithIn(a, b);
}
}
public static long LongAdd(long lhs, long rhs) => lhs + rhs;
public static double DoubleAdd(double lhs, double rhs) => lhs + rhs;
}
class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Benchmark>();
Console.ReadLine();
}
}
For fun, here's the x64 assembly for both cases:
Code
using System;
public class C {
public long AddLongs(long a, long b) {
return a + b;
}
public LongStruct AddLongStructs(LongStruct a, LongStruct b) {
return LongStruct.Add(a, b);
}
public LongStruct AddLongStructsWithIn(LongStruct a, LongStruct b) {
return LongStruct.AddWithIn(a, b);
}
public double AddDoubles(double a, double b) {
return a + b;
}
public DoubleStruct AddDoubleStructs(DoubleStruct a, DoubleStruct b) {
return DoubleStruct.Add(a, b);
}
public DoubleStruct AddDoubleStructsWithIn(DoubleStruct a, DoubleStruct b) {
return DoubleStruct.AddWithIn(a, b);
}
}
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
public static LongStruct Add(LongStruct lhs, LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
public readonly struct DoubleStruct
{
public readonly double Primitive;
public DoubleStruct(double value) => Primitive = value;
public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
}
x86 Assembly
C.AddLongs(Int64, Int64)
L0000: mov eax, [esp+0xc]
L0004: mov edx, [esp+0x10]
L0008: add eax, [esp+0x4]
L000c: adc edx, [esp+0x8]
L0010: ret 0x10
C.AddLongStructs(LongStruct, LongStruct)
L0000: push esi
L0001: mov eax, [esp+0x10]
L0005: mov esi, [esp+0x14]
L0009: add eax, [esp+0x8]
L000d: adc esi, [esp+0xc]
L0011: mov [edx], eax
L0013: mov [edx+0x4], esi
L0016: pop esi
L0017: ret 0x10
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: push esi
L0001: mov eax, [esp+0x10]
L0005: mov esi, [esp+0x14]
L0009: add eax, [esp+0x8]
L000d: adc esi, [esp+0xc]
L0011: mov [edx], eax
L0013: mov [edx+0x4], esi
L0016: pop esi
L0017: ret 0x10
C.AddDoubles(Double, Double)
L0000: fld qword [esp+0xc]
L0004: fadd qword [esp+0x4]
L0008: ret 0x10
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: fld qword [esp+0xc]
L0004: fld qword [esp+0x4]
L0008: faddp st1, st0
L000a: fstp qword [edx]
L000c: ret 0x10
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: fld qword [esp+0xc]
L0004: fadd qword [esp+0x4]
L0008: fstp qword [edx]
L000a: ret 0x10
x64 Assembly
C..ctor()
L0000: ret
C.AddLongs(Int64, Int64)
L0000: lea rax, [rdx+r8]
L0004: ret
C.AddLongStructs(LongStruct, LongStruct)
L0000: lea rax, [rdx+r8]
L0004: ret
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: lea rax, [rdx+r8]
L0004: ret
C.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: vmovaps xmm0, xmm1
L0008: vaddsd xmm0, xmm0, xmm2
L000d: ret
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov [rsp+0x28], rdx
L000c: mov [rsp+0x30], r8
L0011: mov rax, [rsp+0x28]
L0016: mov [rsp+0x10], rax
L001b: mov rax, [rsp+0x30]
L0020: mov [rsp+0x8], rax
L0025: vmovsd xmm0, qword [rsp+0x10]
L002c: vaddsd xmm0, xmm0, [rsp+0x8]
L0033: vmovsd [rsp], xmm0
L0039: mov rax, [rsp]
L003d: add rsp, 0x18
L0041: ret
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: push rax
L0001: vzeroupper
L0004: mov [rsp+0x18], rdx
L0009: mov [rsp+0x20], r8
L000e: vmovsd xmm0, qword [rsp+0x18]
L0015: vaddsd xmm0, xmm0, [rsp+0x20]
L001c: vmovsd [rsp], xmm0
L0022: mov rax, [rsp]
L0026: add rsp, 0x8
L002a: ret
SharpLab
If you add in the loops:
Code
public class C {
public void AddLongs(long a, long b) {
for (var i = 1000000000; i > 0; --i) {
long c = a + b;
}
}
public void AddLongStructs(LongStruct a, LongStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = LongStruct.Add(a, b);
}
}
public void AddLongStructsWithIn(LongStruct a, LongStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = LongStruct.AddWithIn(a, b);
}
}
public void AddDoubles(double a, double b) {
for (var i = 1000000000; i > 0; --i) {
a = a + b;
}
}
public void AddDoubleStructs(DoubleStruct a, DoubleStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = DoubleStruct.Add(a, b);
}
}
public void AddDoubleStructsWithIn(DoubleStruct a, DoubleStruct b) {
for (var i = 1000000000; i > 0; --i) {
a = DoubleStruct.AddWithIn(a, b);
}
}
}
public readonly struct LongStruct
{
public readonly long Primitive;
public LongStruct(long value) => Primitive = value;
public static LongStruct Add(LongStruct lhs, LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
public static LongStruct AddWithIn(in LongStruct lhs, in LongStruct rhs)
=> new LongStruct(lhs.Primitive + rhs.Primitive);
}
public readonly struct DoubleStruct
{
public readonly double Primitive;
public DoubleStruct(double value) => Primitive = value;
public static DoubleStruct Add(DoubleStruct lhs, DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
public static DoubleStruct AddWithIn(in DoubleStruct lhs, in DoubleStruct rhs)
=> new DoubleStruct(lhs.Primitive + rhs.Primitive);
}
x86
C.AddLongs(Int64, Int64)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: dec eax
L0009: test eax, eax
L000b: jg L0008
L000d: pop ebp
L000e: ret 0x10
C.AddLongStructs(LongStruct, LongStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: push esi
L0004: mov esi, 0x3b9aca00
L0009: mov eax, [ebp+0x10]
L000c: mov edx, [ebp+0x14]
L000f: add eax, [ebp+0x8]
L0012: adc edx, [ebp+0xc]
L0015: mov [ebp+0x10], eax
L0018: mov [ebp+0x14], edx
L001b: dec esi
L001c: test esi, esi
L001e: jg L0009
L0020: pop esi
L0021: pop ebp
L0022: ret 0x10
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: push esi
L0004: mov esi, 0x3b9aca00
L0009: mov eax, [ebp+0x10]
L000c: mov edx, [ebp+0x14]
L000f: add eax, [ebp+0x8]
L0012: adc edx, [ebp+0xc]
L0015: mov [ebp+0x10], eax
L0018: mov [ebp+0x14], edx
L001b: dec esi
L001c: test esi, esi
L001e: jg L0009
L0020: pop esi
L0021: pop ebp
L0022: ret 0x10
C.AddDoubles(Double, Double)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: dec eax
L0009: test eax, eax
L000b: jg L0008
L000d: pop ebp
L000e: ret 0x10
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: fld qword [ebp+0x10]
L000b: fld qword [ebp+0x8]
L000e: faddp st1, st0
L0010: fstp qword [ebp+0x10]
L0013: dec eax
L0014: test eax, eax
L0016: jg L0008
L0018: pop ebp
L0019: ret 0x10
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, 0x3b9aca00
L0008: fld qword [ebp+0x10]
L000b: fadd qword [ebp+0x8]
L000e: fstp qword [ebp+0x10]
L0011: dec eax
L0012: test eax, eax
L0014: jg L0008
L0016: pop ebp
L0017: ret 0x10
x64
C.AddLongs(Int64, Int64)
L0000: mov eax, 0x3b9aca00
L0005: dec eax
L0007: test eax, eax
L0009: jg L0005
L000b: ret
C.AddLongStructs(LongStruct, LongStruct)
L0000: mov eax, 0x3b9aca00
L0005: add rdx, r8
L0008: dec eax
L000a: test eax, eax
L000c: jg L0005
L000e: ret
C.AddLongStructsWithIn(LongStruct, LongStruct)
L0000: mov eax, 0x3b9aca00
L0005: add rdx, r8
L0008: dec eax
L000a: test eax, eax
L000c: jg L0005
L000e: ret
C.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: mov eax, 0x3b9aca00
L0008: vaddsd xmm1, xmm1, xmm2
L000d: dec eax
L000f: test eax, eax
L0011: jg L0008
L0013: ret
C.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov [rsp+0x28], rdx
L000c: mov [rsp+0x30], r8
L0011: mov eax, 0x3b9aca00
L0016: mov rdx, [rsp+0x28]
L001b: mov [rsp+0x10], rdx
L0020: mov rdx, [rsp+0x30]
L0025: mov [rsp+0x8], rdx
L002a: vmovsd xmm0, qword [rsp+0x10]
L0031: vaddsd xmm0, xmm0, [rsp+0x8]
L0038: vmovsd [rsp], xmm0
L003e: mov rdx, [rsp]
L0042: mov [rsp+0x28], rdx
L0047: dec eax
L0049: test eax, eax
L004b: jg L0016
L004d: add rsp, 0x18
L0051: ret
C.AddDoubleStructsWithIn(DoubleStruct, DoubleStruct)
L0000: push rax
L0001: vzeroupper
L0004: mov [rsp+0x18], rdx
L0009: mov [rsp+0x20], r8
L000e: mov eax, 0x3b9aca00
L0013: vmovsd xmm0, qword [rsp+0x20]
L001a: vmovaps xmm1, xmm0
L001f: vaddsd xmm1, xmm1, [rsp+0x18]
L0026: vmovsd [rsp], xmm1
L002c: mov rdx, [rsp]
L0030: mov [rsp+0x18], rdx
L0035: dec eax
L0037: test eax, eax
L0039: jg L001a
L003b: add rsp, 0x8
L003f: ret
SharpLab
I'm not familiar enough with assembly to explain what exactly it's doing, but it's clear that more work is going on in AddDoubleStructs than in AddLongStructs.
See @canton7's answer for some timing results and x86 asm output that I based my conclusions on. (I don't have Windows or a C# compiler).
Anomalies: the "release" asm for the loops on SharpLab don't match @canton7's BenchmarkDotNet performance numbers for any Intel or AMD CPUs. The asm shows TestDouble really does a+=b inside the loop, but the timing shows it running as fast as the 1/clock integer loop. (FP add latency is 3 to 5 cycles on all AMD K8/K10/Bulldozer-family/Ryzen, and Intel P6 through Skylake.)
Maybe that's only a first-pass optimization, and after running longer the JIT will optimize away the FP add entirely (since the value is not returned). So I think unfortunately we still don't really have the asm that's actually running, but we can see the kind of mess the JIT optimizer makes.
I don't understand how TestDoubleStructWithIn could be slower than an integer loop but only twice as slow (not 3x), unless maybe the long loops aren't running at 1 iteration per clock. With such high counts, startup overhead should be negligible. A loop counter being kept in memory could explain it (imposing a ~6 cycle per iteration bottleneck on everything, hiding the latency of anything except the very slow FP versions.) But @canton7 says they tested with a Release build. But their i7-8650U might not maintain max-turbo = 4.20 GHz for all the loops, due to power/thermal limits. (all-core minimum sustained frequency = 1.90 GHz), so looking at time in seconds instead of cycles could be throwing us off for the loops without a bottleneck? That still doesn't explain primitive double being the same speed as long; those must have been optimized away.
It's reasonable to expect this class to inline and optimize away, the way you're using it. A good compiler would do that. But a JIT has to compile quickly, so it isn't always good, and clearly in this case is not for double.
For the integer loops, 64-bit integer add on x86-64 has 1 cycle latency, and modern superscalar CPUs have enough throughput to run a loop containing an add at the same speed as an otherwise-empty loop that just counts down a counter. So we can't tell from the timings whether the compiler did a + b * 1000000000 outside the loop, (but still ran an empty loop), or what.
@canton7 used SharpLab to look at the JIT x86-64 asm for a stand-alone version of AddDoubleStructs, and for the loop that calls it. standalone and loops, x86-64, release mode.
We can see that for primitive long c = a + b it optimized away the add entirely (but kept an empty countdown loop)! If we use a = a+b; we get an actual add instruction, even though a isn't returned from the function.
loops.AddLongs(Int64, Int64)
L0000: mov eax, 0x3b9aca00 # i = init
# do {
# long c = a+b optimized out
L0005: dec eax # --i;
L0007: test eax, eax
L0009: jg L0005 # }while(i>0);
L000b: ret
But the struct version has an actual add instruction, from a = LongStruct.Add(a, b);. (We do get the same with a = a+b; with primitive long.)
loops.AddLongStructs(LongStruct a, LongStruct b)
L0000: mov eax, 0x3b9aca00
L0005: add rdx, r8 # a += b; other insns are identical
L0008: dec eax
L000a: test eax, eax
L000c: jg L0005
L000e: ret
But if we change it to LongStruct.Add(a, b); (not assigning the result anywhere), we get L0006: add rdx, r8 outside the loop (hoisting a+b), and then L0009: mov rcx, rdx / L000c: mov [rsp], rcx inside the loop. (register copy and then store to a dead scratch space, totally insane.) In C# (unlike C / C++), writing a+b; on its own as a statement is an error, so we can't see if the primitive equivalent would still result in stupid wasted instructions. Only assignment, call, increment, decrement, await, and new object expressions can be used as a statement.
I don't think we can blame any of these missed optimizations on the struct per-se. But even if you benchmark this with/without an add in the loop, it won't doesn't result in an actual slowdown in this loop on modern x86. The empty loop hits the 1/clock loop throughput bottleneck with only 2 uops on the loop (dec and macro-fused test/jg), leaving room for 2 more uops with no slowdown as long as they don't introduce any bottleneck worse than 1/clock. (https://agner.org/optimize/) e.g. imul edx, r8d with 3 cycle latency would slow the loop down by a factor of 3. The "4 uops" front-end throughput is assuming a recent Intel. Bulldozer-family is narrower, Ryzen is 5-wide.
These are non-static member functions of a class (for no reason, but I didn't notice right away so not changing it now). In the asm calling convention, the first arg (RCX) is a this pointer, and args 2 and 3 are the explicit args to the member function (RDX and R8).
The JIT code-gen puts an extra test eax,eax after dec eax which already sets FLAGS (other than CF which we don't test) according to i - 1. The starting point is a positive compile time constant; any C compiler would have optimized this to dec eax / jnz. I think dec eax / jg would also work, falling through when dec produced zero, because 1 > 1 is false.
The calling convention used by C# on x86-64 passes 8-byte structs in integer registers, which sucks for a struct that contains a double (because it has to get bounced to XMM registers for vaddsd or other FP operations). So there is an unavoidable downside for your struct for non-inline function calls.
### stand-alone versions of functions: not inlined into a loop
# with primitive double, args are passed in XMM regs
standalone.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: vmovaps xmm0, xmm1 # stupid missed optimization defeating the purpose of AVX 3-operand instructions
L0008: vaddsd xmm0, xmm0, xmm2 # vaddsd xmm0, xmm1, xmm2 would do retval = a + b
L000d: ret
# without `in`. Significantly less bad with `in`, see the link.
standalone.AddDoubleStructs(DoubleStruct a, DoubleStruct b)
L0000: sub rsp, 0x18 # reserve 24 bytes of stack space
L0004: vzeroupper # Weird to use this in a function that doesn't have any YMM vectors...
L0007: mov [rsp+0x28], rdx # spill args 2 (rdx=double a) and 3 (r8=double b) to the stack.
L000c: mov [rsp+0x30], r8 # (first arg = rcx = unused this pointer)
L0011: mov rax, [rsp+0x28]
L0016: mov [rsp+0x10], rax # copy a to another place on the stack!
L001b: mov rax, [rsp+0x30]
L0020: mov [rsp+0x8], rax # copy b to another place on the stack!
L0025: vmovsd xmm0, qword [rsp+0x10]
L002c: vaddsd xmm0, xmm0, [rsp+0x8] # add a and b in the SSE/AVX FPU
L0033: vmovsd [rsp], xmm0 # store the result to yet another stack location
L0039: mov rax, [rsp] # reload it into RAX, the return value
L003d: add rsp, 0x18
L0041: ret
This is totally batshit insane. This is release-mode code-gen, but the compiler stores the structs to memory, then reloads+stores them again before actually loading them into the FPU. (I'm guessing the int->int copy might be a constructor, but I have no idea. I normally look at C/C++ compiler output which isn't usually this dumb in optimized builds).
Using in on the function arg avoids that extra copy of each input to a 2nd stack location, but it does still transfer them from integer to XMM with store/reload.
That's what gcc does for int->xmm with default tuning, but it's a missed optimization. Agner Fog says (in his microarch guide) that AMD's optimization manual suggests store/reload when tuning for Bulldozer, but he found it's not faster even on AMD. (Where ALU int->xmm has ~10 cycle latency, vs. 2 to 3 cycles on Intel or on Ryzen, with 1/clock throughput same as stores.)
A good implementation of this function (if we're stuck with the calling convention) would be vmovq xmm0, rdx / vmovq xmm1, r8, then vaddsd then vmovq rax, xmm0 / ret.
Primitive double optimizes similarly to long:
double c = a + b; optimizes away completelya = a + b (like @canton7 used) still does not, even though the result is still unused. This will bottleneck on vaddsd latency (3 to 5 cycles depending on Bulldozer vs. Ryzen vs. Intel pre-Skylake vs. Skylake.) But it does stay in registers.loops.AddDoubles(Double, Double)
L0000: vzeroupper
L0003: mov eax, 0x3b9aca00
# do {
L0008: vaddsd xmm1, xmm1, xmm2 # a += b
L000d: dec eax # --i
L000f: test eax, eax
L0011: jg L0008 # }while(i>0);
L0013: ret
All that store/reload overhead should go away after inlining the function into the loop; that's a large part of the point of inlining. Well surprise, it doesn't optimize away. 2x store/reload is on the critical path of the loop-carried data-dependency chain (of FP adds)!!! This is a huge missed optimization.
Store/reload latency is about 5 or 6 cycles on modern Intel, slower than an FP add. a is being loaded/stored on the way into XMM0, and then again on the way back out.
loops.AddDoubleStructs(DoubleStruct, DoubleStruct)
L0000: sub rsp, 0x18
L0004: vzeroupper
L0007: mov [rsp+0x28], rdx # spill function args: a
L000c: mov [rsp+0x30], r8 # and b
L0011: mov eax, 0x3b9aca00 # i= init
# do {
L0016: mov rdx, [rsp+0x28]
L001b: mov [rsp+0x10], rdx # tmp_a = copy a to another local
L0020: mov rdx, [rsp+0x30]
L0025: mov [rsp+0x8], rdx # tmp_b = copy b
L002a: vmovsd xmm0, qword [rsp+0x10] # tmp_a
L0031: vaddsd xmm0, xmm0, [rsp+0x8] # + tmp_b
L0038: vmovsd [rsp], xmm0 # tmp_a = sum
L003e: mov rdx, [rsp]
L0042: mov [rsp+0x28], rdx # a = copy tmp_a
L0047: dec eax # --i;
L0049: test eax, eax
L004b: jg L0016 # }while(i>0)
L004d: add rsp, 0x18
L0051: ret
The primitive double loop optimizes to a simple loop keeping everything in registers, no clever optimization that would violate strict FP. i.e. no turning it into a multiply, or using multiple accumulators to hide FP add latency. (But we know from the long version that the compiler wouldn't do anything better regardless.) It does all the additions as one long dependency chain, so one addsd per 3 (Broadwell or earlier, Ryzen) or 4 cycles (Skylake).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With