I performed some benchmarking to compare doubles and floats performance. I was very surprised to see that doubles are much faster than floats.
I saw some discussion about that, for example:
Is using double faster than float?
Are doubles faster than floats in c#?
Most of them said that it is possible that double and float performance will be similar , because of double-precision optimization, etc. . But I saw a x2 performance improvement when using doubles!! How is it possible? What makes it worst, is that I'm using a 32-bit machine which do expected to perform better for floats according to some posts...
I used C# to check it precisely but I see that similar C++ implementation have similar behavior.
Code I used to check it:
static void Main(string[] args)
{
  double[,] doubles = new double[64, 64];
  float[,] floats = new float[64, 64];
  System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
  s.Restart();
  CalcDoubles(doubles);
  s.Stop();
  long doubleTime = s.ElapsedMilliseconds;
  s.Restart();
  CalcFloats(floats);
  s.Stop();
  long floatTime = s.ElapsedMilliseconds;
  Console.WriteLine("Doubles time: " + doubleTime + " ms");
  Console.WriteLine("Floats time: " + floatTime + " ms");
}
private static void CalcDoubles(double[,] arr)
{
  unsafe
  {
    fixed (double* p = arr)
    {
      for (int b = 0; b < 192 * 12; ++b)
      {
        for (int i = 0; i < 64; ++i)
        {
          for (int j = 0; j < 64; ++j)
          {
            double* addr = (p + i * 64 + j);
            double arrij = *addr;
            arrij = arrij == 0 ? 1.0f / (i * j) : arrij * (double)i / j;
            *addr = arrij;
          }
        }
      }
    }
  }
}
private static void CalcFloats(float[,] arr)
{
  unsafe
  {
    fixed (float* p = arr)
    {
      for (int b = 0; b < 192 * 12; ++b)
      {
        for (int i = 0; i < 64; ++i)
        {
          for (int j = 0; j < 64; ++j)
          {
            float* addr = (p + i * 64 + j);
            float arrij = *addr;
            arrij = arrij == 0 ? 1.0f / (i * j) : arrij * (float)i / j;
            *addr = arrij;
          }
        }
      }
    }
  }
}
I'm using a very weak notebook: Intel Atom N455 processor (dual core, 1.67GHz, 32bit) with 2GB RAM.
This looks the jitter optimizer drops the ball here, it doesn't suppress a redundant store in the float case.  The hot code is the 1.0f / (i * j) calculation since all array values are 0.  The x86 jitter generates:
01062928  mov         eax,edx                     ; eax = i
0106292A  imul        eax,esi                     ; eax = i * j
0106292D  mov         dword ptr [ebp-10h],eax     ; store to mem
01062930  fild        dword ptr [ebp-10h]         ; convert to double 
01062933  fstp        dword ptr [ebp-10h]         ; redundant store, convert to float
01062936  fld         dword ptr [ebp-10h]         ; redundant load
01062939  fld1                                    ; 1.0f
0106293B  fdivrp      st(1),st                    ; 1.0f / (i * j)
0106293D  fstp        dword ptr [ecx]             ; arrij = result
The x64 jitter:
00007FFCFD6440B0  cvtsi2ss    xmm0,r10d           ; (float)(i * j)
00007FFCFD6440B5  movss       xmm1,dword ptr [7FFCFD644118h]  ; 1.0f
00007FFCFD6440BD  divss       xmm1,xmm0           ; 1.0f / (i * j)
00007FFCFD6440C1  cvtss2sd    xmm0,xmm1           ; redundant store 
00007FFCFD6440C5  cvtsd2ss    xmm0,xmm0           ; redundant load
00007FFCFD6440C9  movss       dword ptr [rax+r11],xmm0  ; arrij = result
I marked the superfluous instructions with "redundant". The optimizer did manage to eliminate them in the double version so that code runs faster.
The redundant stores are actually present in the IL generated by the C# compiler, it is the job of the optimizer to detect and remove them. Notable is that both the x86 and the x64 jitter have this flaw so it looks like a general oversight in the optimizer algorithm.
The x64 code is especially noteworthy for converting the float result to double and then back to float again, suggesting that the underlying problem is a data type conversion that it doesn't know how to suppress. You also see it in the x86 code, the redundant store actually makes a double to float conversion. Eliminating the conversion looks difficult in the x86 case so this may well have leaked into the x64 jitter.
Do note that the x64 code runs significantly faster than the x86 code so be sure to set the Platform target to AnyCPU for a simple win. At least part of that speed up was the optimizer's smarts at hoisting the integer multiplication.
And do make sure to test realistic data, your measurement is fundamentally invalid due to the uninitialized array content. The difference is much less pronounced with non-zero data in the elements, it makes the division much more expensive.
Also note your bug in the double case, you should not use 1.0f there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With