Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?
If so, why doesn't GCC generate SIMD instructions for these library functions by default?
Also, are there any other functions can be possibly improved by SIMD?
memcpy is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.
memmove() is similar to memcpy() as it also copies data from a source to destination.
For example, some implementations of the memset , memcpy , or memmove standard C library routines use SSE2 instructions for better throughput.
Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive.
I have a custom SIMD memchr which is a hell-of-a-lot faster than the library version.  Especially when I'm finding the first of 2 or 3 characters (example, I want to know if there's an equation in this line of text, I search for the first of =, \n, \r).
On the other hand, the library functions are well tested, so it's only worth writing your own if you call them a lot and a profiler shows they're a significant fraction of your CPU time.
It does not make sense. Your compiler ought to be emitting these instructions implicitly for memcpy/memcmp/similar intrinsics, if it is able to emit SIMD at all.
You may need to explicitly instruct GCC to emit SSE opcodes with eg -msse -msse2; some GCCs do not enable them by default. Also, if you do not tell GCC to optimize (ie, -o2), it won't even try to emit fast code.
The use of SIMD opcodes for memory work like this can have a massive performance impact, because they also include cache prefetches and other DMA hints that are important for optimizing bus access. But that doesn't mean that you need to emit them manually; even though most compiler stink at emitting SIMD ops generally, every one I've used at least handles them for the basic CRT memory functions.
Basic math functions can also benefit a lot from setting the compiler to SSE mode. You can easily get an 8x speedup on basic sqrt() just by telling the compiler to use the SSE opcode instead of the terrible old x87 FPU. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With