I have implemented a SSE4.2 version of memcpy, but I cannot seem to beat _intel_fast_memcpy on Xeon V3. I use my routine in a gather routine in which the data varies between 4 to 15 bytes at each location. I've looked at many posts here and on intel's website with no luck. What is a good source I should look at?
Can you do your gathers with a 16B load and store, and then just overlap however many garbage bytes were at the end?
// pseudocode: pretend these intrinsics take void* args, not float
char *dst = something;
__m128 tmp = _mm_loadu_ps(src1);
_mm_storeu_ps(dst, tmp);
dst += src1_size;
tmp = _mm_loadu_ps(src2);
_mm_storeu_ps(dst, tmp);
dst += src2_size;
...
Overlapping stores are efficient (and the L1 cache soaks them up just fine), and modern CPUs should handle this well. Unaligned loads/stores are cheap enough that I don't think you can beat this. (Assuming an average amount of page-split loads. Even if you had more than an average amount of cache-line split loads, it probably won't be a problem, though.)
This means no conditional branches inside the inner loop to decide on a copying strategy, or any mask generation or anything. All you need is an extra up to 12B or something at the end of your gather buffer in case the last copy was only supposed to be 4B. (You also need the elements you're gathering to not be within 16B of the end of a page, where the following page is unmapped or not readable.)
If reading past the end of the elements you're gathering is a problem, then maybe vpmaskmov for the loads will actually be a good idea. If your elements are 4B-aligned, then it's always fine to read up to 3 bytes beyond the end. You can still use a normal 16B vector store into your dst buffer.
I used _ps loads because movups is 1 byte shorter than movupd or movdqu, but performs the same (see Agner Fog's microarch pdf, and other links in the x86 tag wiki. (clang will even use movaps / movups for _mm_store_si128 sometimes.)
re: your comment: Don't use legacy SSE maskmovdqu. The biggest problem is that it only works as a store, so it can't help you avoid reading outside the objects you're gathering. It's slow, and it bypasses the cache (it's an NT store), making it extremely slow when you come to reload this data.
The AVX versions (vmaskmov and vpmaskmov) aren't like that, so converting your code to use maskmovdqu would probably be a big slowdown.
Related: I posted a Q&A about using vmovmaskps for the end of unaligned buffers a while ago. I got some interesting responses. Apparently it's not usually the best way to solve any problem, even though my (clever IMO) strategy for generating a mask was pretty efficient.
MOVMASKPS is very much one of those "it seemed like a good idea at the time" things AFAICT. I've never used it. – Stephen Canon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With