For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.
Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior).
So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE.
The operations I need are the following (quite specific) ones:
uint64_t X, Y;
X = 0;
X = 1;
X << 1;
X != Y;
X + 1;
X & 0x1 // get lsb
X | 0x1 // set lsb
X > Y;
Specifically, I don't need general-purpose addition or shifting, for example, just add one and left-shift one. Really, just the exact operations shown here.
Except, of course, on x86, uint64_t is emulated by using two 32-bit scalars, which is slow (and, in my case, simply doesn't work, because I need loads/stores to be atomic, which they won't be when loading/storing two separate registers).
Hence, I need a SIMD solution.
Some of these operations are trivial, supported by SSE2 already. Others (!= and <) require a bit more work.
Suggestions? SSE and SSE2 are fine. It'd take some persuasion to permit SSE3, and SSE4 is probably out of the question (A CPU which supports SSE4 is likely to run 64-bit anyway, and so I don't need these workarounds)
SSE2 has direct support for some 64-bit integer operations:
Set both elements to 0:
__m128i z = _mm_setzero_si128();
Set both elements to 1:
__m128i z = _mm_set1_epi64x(1);      // also works for variables.
__m128i z = _mm_set_epi64x(hi, lo);  // elements can be different
__m128i z = _mm_set_epi32(0,1,0,1);  // if any compilers refuse int64_t in 32-bit mode.  (None of the major ones do.)
Set/load the low 64 bits, zero-extending to __m128i
// supported even in 32-bit mode, and listed as an intrinsic for MOVQ
// so it should be atomic on aligned integers.
_mm_loadl_epi64((const __m128i*)p);     // movq or movsd 64-bit load
_mm_cvtsi64x_si128(a);      // only ICC, others refuse in 32-bit mode
_mm_loadl_epi64((const __m128i*)&a);  // portable for a value instead of pointer
Things based on _mm_set_epi32 can get compiled into a mess by some compilers, so _mm_loadl_epi64 appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode.  See it on the Godbolt compiler explorer
Vertically add/subtract each 64-bit integer:
__m128i z = _mm_add_epi64(x,y)
__m128i z = _mm_sub_epi64(x,y)
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_arithmetic.htm#intref_sse2_integer_arithmetic
Left Shift:
__m128i z = _mm_slli_epi64(x,i)   // i must be an immediate
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm
Bitwise operators:
__m128i z = _mm_and_si128(x,y)
__m128i z = _mm_or_si128(x,y)
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm
SSE doesn't have increments, so you'll have to use a constant with 1.
Comparisons are harder since there's no 64-bit support until SSE4.1 pcmpeqq and SSE4.2 pcmpgtq
Here's the one for equality:
__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
This will set the each 64-bit element to 0xffffffffffff (aka -1) if they are equal. If you want it as a 0 or 1 in an int, you can pull it out using _mm_cvtsi32_si128() and add 1.  (But sometimes you can do total -= cmp_result; instead of converting and adding.)
And Less-Than: (not fully tested)
a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
This will set the each 64-bit element to 0xffffffffffff if the corresponding element in a is less than b.
Here's are versions of "equals" and "less-than" that return a bool. They return the result of the comparison for the bottom 64-bit integer.
inline bool equals(__m128i a,__m128i b){
    __m128i t = _mm_cmpeq_epi32(a,b);
    __m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
    return _mm_cvtsi128_si32(z) & 1;
}
inline bool lessthan(__m128i a,__m128i b){
    a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a,b);
    __m128i u = _mm_cmpgt_epi32(a,b);
    __m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
    return _mm_cvtsi128_si32(z) & 1;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With