A while ago I read somewhere that SSE intrinsic functions compile into efficient machine code because compilers treat them differently from ordinary functions. I am wandering how actually compilers do it and what C programmers can do to facilitate the process. Are there any guidelines on how to use intrinsic functions in a manner that makes compiler's job of generating efficient machine code easier.
Thanks.
Contrary to what Necrolis wrote, the intrinsics may or may not compile down to the instructions they represent. This is especially true for copy or load instructions such as _mm_load_pd, since the compiler is still responsible for register allocation and assignment when using intrinsics. This means that copying a value from one location to another may not be necessary at all, if the two locations can be represented by the same register. In that case the compiler may choose to remove the copy. It may also choose to remove other instructions if the result is never used.
Check out this blog post where the behavior of different compilers is compared in practice. It's from 2009, so the details may no longer apply. However, newer compilers are likely to optimize your code more, not less.
As for actually use intrinsics efficiently, the answer is the same as for all other performance optimization: Measure, measure and measure. Make sure that you are actually dealing with a hot piece of code, find out why it's slow and then improve it. You are very likely to find that improving your memory access patterns is more important than using intrinsics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With