I've already profiled, and am now looking to squeeze every possible bit of performance possible out of my hot-spot.
I know about [MethodImplOptions.AggressiveInlining] and the ProfileOptimization class. Are there any others?
[Edit]  I just discovered [TargetedPatchingOptOut] as well. Nevermind, apparently that one is not needed.
To help the JIT compiler analyze the method, its bytecodes are first reformulated in an internal representation called trees, which resembles machine code more closely than bytecodes. Analysis and optimizations are then performed on the trees of the method. At the end, the trees are translated into native code.
In theory, a Just-in-Time (JIT) compiler has an advantage over Ahead-of-Time (AOT) if it has enough time and computational resources available. A JIT compiler can be faster because the machine code is being generated on the exact machine that it will also execute on.
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.
Optimization is a program transformation technique, which tries to improve the code by making it consume less resources (i.e. CPU, Memory) and deliver high speed. In optimization, high-level general programming constructs are replaced by very efficient low-level programming codes.
Yes there are more tricks :-)
I've actually did quite a bit of research on optimizing C# code. So far, these are the most significant results:
IEquatable<T> is usually a bad plan - so if you use f.ex. a hash, be sure to implement the right overloads and interfaces, because it'll safe you a ton of performance.Foo[], even Foo[][] is normally faster than Foo[,].There also used to be a guide called "optimization for the intel pentium processor" with a large number of tricks (like shifting or multiplying instead of dividing). While the compiler does a fine effort nowadays, this also sometimes helps a bit.
Of course these are just optimizations; the biggest performance gains are usually the result of changing the algorithm and/or data structure. Be sure to check out which options are available to you and don't restrict yourself too much by the .NET framework... also I have a natural tendency to distrust the .NET implementation until I've checked the decompiled code by myself... there's a ton of stuff that could have been implemented much faster (most of the times for good reasons).
HTH
Alex pointed out to me that Array.Copy is actually faster according to some people. And since I really don't know what has changed over the years, I decided that the only proper course of action is to create a fresh new benchmark and put it to the test.
If you're just interested in the results, go down. In most cases the call to Buffer.BlockCopy clearly outperforms Array.Copy. Tested on an Intel Skylake with 16 GB memory (>10 GB free) on .NET 4.5.2.
Code:
static void TestNonOverlapped1(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K];     byte[] tmp2 = new byte[K];     for (long i = 0; i < iter; ++i)     {         Array.Copy(tmp, tmp2, K);     } }  static void TestNonOverlapped2(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K];     byte[] tmp2 = new byte[K];     for (long i = 0; i < iter; ++i)     {         Buffer.BlockCopy(tmp, 0, tmp2, 0, K);     } }  static void TestOverlapped1(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K + 16];     for (long i = 0; i < iter; ++i)     {         Array.Copy(tmp, 0, tmp, 16, K);     } }  static void TestOverlapped2(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K + 16];     for (long i = 0; i < iter; ++i)     {         Buffer.BlockCopy(tmp, 0, tmp, 16, K);     } }  static void Main(string[] args) {     for (int i = 0; i < 10; ++i)     {         int N = 16 << i;          Console.WriteLine("Block size: {0} bytes", N);          Stopwatch sw = Stopwatch.StartNew();          {             sw.Restart();             TestNonOverlapped1(N);              Console.WriteLine("Non-overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          {             sw.Restart();             TestNonOverlapped2(N);              Console.WriteLine("Non-overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          {             sw.Restart();             TestOverlapped1(N);              Console.WriteLine("Overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          {             sw.Restart();             TestOverlapped2(N);              Console.WriteLine("Overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          Console.WriteLine("-------------------------");     }      Console.ReadLine(); } Results on x86 JIT:
Block size: 16 bytes Non-overlapped Array.Copy: 4267.52 ms Non-overlapped Buffer.BlockCopy: 2887.05 ms Overlapped Array.Copy: 3305.01 ms Overlapped Buffer.BlockCopy: 2670.18 ms ------------------------- Block size: 32 bytes Non-overlapped Array.Copy: 1327.55 ms Non-overlapped Buffer.BlockCopy: 763.89 ms Overlapped Array.Copy: 2334.91 ms Overlapped Buffer.BlockCopy: 2158.49 ms ------------------------- Block size: 64 bytes Non-overlapped Array.Copy: 705.76 ms Non-overlapped Buffer.BlockCopy: 390.63 ms Overlapped Array.Copy: 1303.00 ms Overlapped Buffer.BlockCopy: 1103.89 ms ------------------------- Block size: 128 bytes Non-overlapped Array.Copy: 361.18 ms Non-overlapped Buffer.BlockCopy: 219.77 ms Overlapped Array.Copy: 620.21 ms Overlapped Buffer.BlockCopy: 577.20 ms ------------------------- Block size: 256 bytes Non-overlapped Array.Copy: 192.92 ms Non-overlapped Buffer.BlockCopy: 108.71 ms Overlapped Array.Copy: 347.63 ms Overlapped Buffer.BlockCopy: 353.40 ms ------------------------- Block size: 512 bytes Non-overlapped Array.Copy: 104.69 ms Non-overlapped Buffer.BlockCopy: 65.65 ms Overlapped Array.Copy: 211.77 ms Overlapped Buffer.BlockCopy: 202.94 ms ------------------------- Block size: 1024 bytes Non-overlapped Array.Copy: 52.93 ms Non-overlapped Buffer.BlockCopy: 38.84 ms Overlapped Array.Copy: 144.39 ms Overlapped Buffer.BlockCopy: 154.09 ms ------------------------- Block size: 2048 bytes Non-overlapped Array.Copy: 45.64 ms Non-overlapped Buffer.BlockCopy: 30.11 ms Overlapped Array.Copy: 118.33 ms Overlapped Buffer.BlockCopy: 109.16 ms ------------------------- Block size: 4096 bytes Non-overlapped Array.Copy: 30.93 ms Non-overlapped Buffer.BlockCopy: 30.72 ms Overlapped Array.Copy: 119.73 ms Overlapped Buffer.BlockCopy: 104.66 ms ------------------------- Block size: 8192 bytes Non-overlapped Array.Copy: 30.37 ms Non-overlapped Buffer.BlockCopy: 26.63 ms Overlapped Array.Copy: 90.46 ms Overlapped Buffer.BlockCopy: 87.40 ms ------------------------- Results on x64 JIT:
Block size: 16 bytes Non-overlapped Array.Copy: 1252.71 ms Non-overlapped Buffer.BlockCopy: 694.34 ms Overlapped Array.Copy: 701.27 ms Overlapped Buffer.BlockCopy: 573.34 ms ------------------------- Block size: 32 bytes Non-overlapped Array.Copy: 995.47 ms Non-overlapped Buffer.BlockCopy: 654.70 ms Overlapped Array.Copy: 398.48 ms Overlapped Buffer.BlockCopy: 336.86 ms ------------------------- Block size: 64 bytes Non-overlapped Array.Copy: 498.86 ms Non-overlapped Buffer.BlockCopy: 329.15 ms Overlapped Array.Copy: 218.43 ms Overlapped Buffer.BlockCopy: 179.95 ms ------------------------- Block size: 128 bytes Non-overlapped Array.Copy: 263.00 ms Non-overlapped Buffer.BlockCopy: 196.71 ms Overlapped Array.Copy: 137.21 ms Overlapped Buffer.BlockCopy: 107.02 ms ------------------------- Block size: 256 bytes Non-overlapped Array.Copy: 144.31 ms Non-overlapped Buffer.BlockCopy: 101.23 ms Overlapped Array.Copy: 85.49 ms Overlapped Buffer.BlockCopy: 69.30 ms ------------------------- Block size: 512 bytes Non-overlapped Array.Copy: 76.76 ms Non-overlapped Buffer.BlockCopy: 55.31 ms Overlapped Array.Copy: 61.99 ms Overlapped Buffer.BlockCopy: 54.06 ms ------------------------- Block size: 1024 bytes Non-overlapped Array.Copy: 44.01 ms Non-overlapped Buffer.BlockCopy: 33.30 ms Overlapped Array.Copy: 53.13 ms Overlapped Buffer.BlockCopy: 51.36 ms ------------------------- Block size: 2048 bytes Non-overlapped Array.Copy: 27.05 ms Non-overlapped Buffer.BlockCopy: 25.57 ms Overlapped Array.Copy: 46.86 ms Overlapped Buffer.BlockCopy: 47.83 ms ------------------------- Block size: 4096 bytes Non-overlapped Array.Copy: 29.11 ms Non-overlapped Buffer.BlockCopy: 25.12 ms Overlapped Array.Copy: 45.05 ms Overlapped Buffer.BlockCopy: 47.84 ms ------------------------- Block size: 8192 bytes Non-overlapped Array.Copy: 24.95 ms Non-overlapped Buffer.BlockCopy: 21.52 ms Overlapped Array.Copy: 43.81 ms Overlapped Buffer.BlockCopy: 43.22 ms ------------------------- You've exhausted the options added in .NET 4.5 to affect the jitted code directly. Next step is to look at the generated machine code to spot any obvious inefficiencies. Do so with the debugger, first prevent it from disabling the optimizer. Tools + Options, Debugging, General, untick the "Suppress JIT optimization on module load" option. Set a breakpoint on the hot code, Debug + Disassembly to look at it.
There are not that many to consider, the jitter optimizer in general does an excellent job. One thing to look for is failed attempts at eliminating an array bounds check, the fixed keyword is an unsafe workaround for that. A corner case is a failed attempt at inlining a method and the jitter not using cpu registers effectively, an issue with the x86 jitter and fixed with MethodImplOptions.NoInlining. The optimizer is not terribly efficient at hoisting invariant code out of a loop, but that's something you'd almost always first consider when staring at the C# code when looking for ways to optimize it.
The most important thing to want to know is when you are done and just can't hope to make it any faster. You can only really get there by comparing apples and oranges and writing the hot code in native code using C++/CLI. Make sure that this code is compiled with #pragma unmanaged in effect so it gets the full optimizer love. There's a cost associated with switching from managed code to native code execution so do make sure the execution time of the native code is substantial enough. This is otherwise not necessarily easy to do and you certainly won't have a guarantee for success. Albeit that knowing you are done can save you a lot of time stumbling into dead alleys.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With