Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient AVX2 implementation of a 17x17-bit squaring operation with result truncation

I am trying to create an efficient AVX2 implementation for a 17x17-bit truncated squarer that returns the 15 most significant bits of the result. This operation appears as a building block in transcendental function approximation, as described in the following publication:

Stuart F. Oberman and Michael Y. Siu, "A High-Performance Area-Efficient Multifunction Interpolator." In 17th IEEE Symposium on Computer Arithmetic, June 2005, pp. 272-279.

My assumption is that a truncated squarer is used to guarantee single-cycle execution, such that squaring proceeds in parallel with necessary table lookup. Various implementations of truncated multipliers are described in the literature. I picked a minimal one based on clues in the paper and that I find fulfils the functional requirements in the given context. For more accurate (but potentially slower) implementations, see

Theo Drane, et al., "On the Systematic Creation of Faithfully-Rounded Commutative Truncated Booth Multipliers." In 31st IEEE Symposium on Computer Arithmetic, June 2024, pp. 108-115.

For the details of operation of the truncated squarer I refer to the diagram in the comment for the reference implementation truncated_squarer_ref(). Here, letters 'a' through 'q' denote individual bits of source operand x, x<i> denotes the i-th bit of x, '0' indicates bits that are ignored due to the truncating nature of the squarer, 'R' denotes the "round" bit, 'C' denotes a potential carry-out, while 'r' denotes any bit of the result.

For the reference implementation, I essentially used binary long-hand multiplication with certain source bits and partial-product bits disregarded as indicated by the truncating nature of the squarer. I did not exploit the fact that this is a squaring operation rather than a general multiplication, because no optimization along those lines readily occurred to me.

An AVX2 implementation that fully exploits all available parallelism seemed trivial, given that the task basically comprises conditional summation of fifteen 15-bit partial products, all of which fit into a single 256-bit register. Straightforward translation thus resulted in truncated_squarer_avx2() as shown below. Because I usually compile with /QxHOST (meaning: target the architecture of the build machine) and my build machine implements the sklylake-avx512 architecture, I did not notice until considerable time later that through the wonders of non-orthogonal ISA design, _mm256_srlv_epi16() is not actually a thing in AVX2.

Emulating _mm256_srlv_epi16() via _mm256_srlv_epi32() would seem inefficient and I do not use AVX2 frequently enough to readily think of an alternative way of addressing the problem. What is an efficient way of implementing the functionality of the truncated squarer using AVX2? Any fast and concise intrinsic-based implementation that passes the exhaustive functional test is welcome.

Addendum: It was asked in comments whether it is possible to perform a full multiplication first and then subtract from it the sum of the discarded partial products. This is indeed possible, as demonstrated by truncated_squarer_mul_sub() below.

Addendum 2: I found through further experimentation at Compiler Explorer that the auto-SIMD-ization of the Intel compiler icx 2025.1.1 can generate a 25-instruction solution for a code variant I include as truncated_squarer_alt() below. This came as a pleasant surprise.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"

/* 17x17-bit truncated squarer that returns the most significant 15 bits
   of the result. This uses a "round" bit to reduce the error introduced
   by truncation.

                    17 source bits
                   _______^_______
                  /               \
                  q0000000000000000  (  (x >> 16)* x<2>
                 qp000000000000000    + (x >> 15)* x<3>
                qpo00000000000000     + (x >> 14)* x<4>
               qpon0000000000000      + (x >> 13)* x<5>
              qponm000000000000       + (x >> 12)* x<6>
             qponml00000000000        + (x >> 11)* x<7>
            qponmlk0000000000         + (x >> 10)* x<8>
           qponmlkj000000000          + (x >> 9) * x<9>
          qponmlkji00000000           + (x >> 8) * x<10>
         qponmlkjih0000000            + (x >> 7) * x<11>
        qponmlkjihg000000             + (x >> 6) * x<12>
       qponmlkjihgf00000              + (x >> 5) * x<13>
      qponmlkjihgfe0000               + (x >> 4) * x<14>
     qponmlkjihgfed000                + (x >> 3) * x<15>
    qponmlkjihgfedc00                 + (x >> 2) * x<16>) >> 1
   C              R
   rrrrrrrrrrrrrrr
   \______ ______/
          v
   15 result bits
*/
uint32_t truncated_squarer_ref (uint32_t x)
{
    uint32_t r = 0;
    for (int i = 16; i >= 2; i--) {
        r += (0 - ((x >> i) & 1)) & (x >> (18 - i));
    }
    return r >> 1;
}

/* As is, this does NOT actually work on AVX2 but runs fine on AVX512! */
uint32_t truncated_squarer_avx2 (uint32_t x)
{
    __m256i a, b, lsb, shift_count1_v, shift_count2_v;
    uint16_t shift_count1[16] = { 0, 1, 2, 3, 4,5,6,7,8,9,10,11,12,13,14,15};
    uint16_t shift_count2[16] = {14,13,12,11,10,9,8,7,6,5, 4, 3, 2, 1, 0,15};
    uint16_t xs = x >> 2;
    memcpy (&shift_count1_v, shift_count1, sizeof shift_count1_v);
    memcpy (&shift_count2_v, shift_count2, sizeof shift_count2_v);
    a = _mm256_set1_epi16 (xs);
    lsb = _mm256_set1_epi16 (1);
    b = _mm256_srlv_epi16 (a, shift_count2_v); // doesn't exist in AVX2!
    a = _mm256_srlv_epi16 (a, shift_count1_v); // doesn't exist in AVX2!
    b = _mm256_and_si256 (b, lsb);
    a = _mm256_mullo_epi16 (a, b);
    a = _mm256_hadd_epi16 (a, a);
    a = _mm256_hadd_epi16 (a, a);
    a = _mm256_hadd_epi16 (a, a);
    return (uint32_t)(_mm256_extract_epi16(a,0)+_mm256_extract_epi16(a,8)) >> 1;
}

// perform full multiply, then remove the sum of the discarded partial products
uint32_t truncated_squarer_mul_sub (uint32_t x)
{
    uint32_t r = 0;
    x = x >> 2;
    for (int i = 14; i >= 0; i--) {
        r += (0 - ((x >> i) & 1)) & ((x << i) & 0x3fff);
    }
    return (x * x - r) >> 15;
}

// compiles to 25 instructions with icx 2025.1.1 through auto-SIMD-ization
uint32_t truncated_squarer_alt (uint32_t x)
{
    uint32_t r = 0;
    x = x >> 2;
    for (int i = 0; i < 15; i++) {
        if ((x >> i) & 1) r = r + ((x << i) & ~0x3fff);
    }
    return r >> 15;
}

int main (void)
{
    uint32_t x, res, ref;

    for (x = 0; x < 0x20000; x++) {
        res = truncated_squarer_avx2 (x);
        ref = truncated_squarer_ref (x);
        if (res != ref) {
            printf ("x=%05x res=%04x ref=%04x\n", x, res, ref);
            return EXIT_FAILURE;
        }
    }
    return EXIT_SUCCESS;
}
like image 491
njuffa Avatar asked Dec 11 '25 13:12

njuffa


2 Answers

This passed the test:

uint32_t truncated_squarer_avx2(uint32_t x)
{
    x >>= 2;
    __m256i sh1 = _mm256_setr_epi16(0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0, 0);
    __m256i sh2 = _mm256_setr_epi16(0x8000, 0x4000, 0x2000, 0x1000, 0x800, 0x400, 0x200, 0x100, 0x80, 0x40, 0x20, 0x10, 0x8, 0x4, 0, 0);
    __m256i vx = _mm256_set1_epi16(x);
    __m256i b = _mm256_mullo_epi16(vx, sh2);
    vx = _mm256_and_si256(_mm256_mullo_epi16(vx, sh1), _mm256_set1_epi16(0x3fff));
    b = _mm256_srli_epi16(b, 15);
    __m256i hsum_256 = _mm256_madd_epi16(vx, b);
    __m128i hsum = _mm_add_epi32(_mm256_castsi256_si128(hsum_256), _mm256_extracti128_si256(hsum_256, 1));
    hsum = _mm_add_epi32(hsum, _mm_srli_epi64(hsum, 32));
    hsum = _mm_add_epi32(hsum, _mm_srli_si128(hsum, 8));
    return (x * x - _mm_cvtsi128_si32(hsum)) >> 15;
}

It's based on truncated_squarer_mul_sub. The "bit extraction" part ((x >> i) & 1) is implemented as multiplying by some power of two that puts the target bit in the sign bit of the corresponding word, then all words are shifted right by 15 (this could be an arithmetic shift if you wanted to use the bit to mask by, but we can multiply by it). The "left shift and mask" part ((x << i) & 0x3fff) is implemented as multiplying by powers of two and then masking by 0x3fff. The extracted bit and the left-shifted thing are combined in _mm256_madd_epi16 which also does the first round of horizontal reduction.

Here's an alternative without _mm256_madd_epi16 and with a horizontal reduction based mostly on _mm_sad_epu8, the low bytes and high bytes of the partial products are summed (mostly) separately (except in the first reduction step that goes from 256 bits to 128 bits).

uint32_t truncated_squarer_avx2_2(uint32_t x)
{
    x >>= 2;
    __m256i sh1 = _mm256_setr_epi16(0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0, 0);
    __m256i sh2 = _mm256_setr_epi16(0x8000, 0x4000, 0x2000, 0x1000, 0x800, 0x400, 0x200, 0x100, 0x80, 0x40, 0x20, 0x10, 0x8, 0x4, 0, 0);
    __m256i vx = _mm256_set1_epi16(x);
    __m256i b = _mm256_mullo_epi16(vx, sh2);
    vx = _mm256_and_si256(_mm256_mullo_epi16(vx, sh1), _mm256_set1_epi16(0x3fff));
    b = _mm256_srai_epi16(b, 15);
    __m256i pp = _mm256_and_si256(vx, b);
    __m128i hsum = _mm_add_epi16(_mm256_castsi256_si128(pp), _mm256_extracti128_si256(pp, 1));
    hsum = _mm_shuffle_epi8(hsum, _mm_setr_epi8(0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15));
    hsum = _mm_sad_epu8(hsum, _mm_setzero_si128());
    return (x * x - _mm_cvtsi128_si32(hsum) - (_mm_extract_epi16(hsum, 4) << 8)) >> 15;
}
like image 53
harold Avatar answered Dec 14 '25 14:12

harold


Here's one with 3.2KB of extra data, which should fit comfortably in L1d:

uint8_t lut2[1190] = {0x41, 0x10, 0x04, 0x41, 0x10, 0x04, 0x49, 0x92, 0x24, 0x49, 0x92, 0x24, 0x41, 0x10, 0x04, 0x41, 0x10, 0x28, 0x41, 0x10, 0x28, 0x41, 0x10, 0x28, 0x41, 0x10, 0x04, 0x81, 0x12, 0x28, 0x41, 0x10, 0x28, 0x81, 0x12, 0x28, 0x81, 0x12, 0x28, 0x81, 0x12, 0x28, 0x41, 0x10, 0x04, 0x41, 0xa0, 0x28, 0x41, 0x10, 0x28, 0x41, 0xa0, 0x28, 0x41, 0xa0, 0x28, 0x41, 0xa0, 0x28, 0x41, 0x10, 0x04, 0x81, 0xa2, 0x28, 0x41, 0x10, 0x28, 0x81, 0xa2, 0x28, 0x81, 0x12, 0x28, 0x81, 0xa2, 0x28, 0x41, 0xa0, 0x28, 0x81, 0xa2, 0x28, 0x81, 0xa2, 0x28, 0x81, 0xa2, 0x28, 0x41, 0x10, 0x04, 0x8a, 0xa2, 0x28, 0x41, 0x10, 0x28, 0x8a, 0xa2, 0x28, 0x81, 0x12, 0x28, 0x8a, 0xa2, 0x28, 0x41, 0xa0, 0x28, 0x8a, 0xa2, 0x28, 0x81, 0xa2, 0x28, 0x8a, 0xa2, 0x28, 0x8a, 0xa2, 0x28, 0x8a, 0xa2, 0x28, 0x49, 0x92, 0x24, 0x49, 0x92, 0x48, 0x49, 0x92, 0x48, 0x49, 0x92, 0x48, 0x49, 0x92, 0x24, 0x89, 0x94, 0x48, 0x49, 0x92, 0x48, 0x89, 0x94, 0x48, 0x89, 0x94, 0x48, 0x89, 0x94, 0x48, 0x49, 0x92, 0x24, 0x49, 0x22, 0x49, 0x49, 0x92, 0x48, 0x49, 0x22, 0x49, 0x49, 0x22, 0x49, 0x49, 0x22, 0x49, 0x49, 0x92, 0x24, 0x89, 0x24, 0x49, 0x49, 0x92, 0x48, 0x89, 0x24, 0x49, 0x89, 0x94, 0x48, 0x89, 0x24, 0x49, 0x49, 0x22, 0x49, 0x89, 0x24, 0x49, 0x89, 0x24, 0x49, 0x89, 0x24, 0x49, 0x49, 0x92, 0x24, 0x92, 0x24, 0x49, 0x49, 0x92, 0x48, 0x92, 0x24, 0x49, 0x89, 0x94, 0x48, 0x92, 0x24, 0x49, 0x49, 0x22, 0x49, 0x92, 0x24, 0x49, 0x89, 0x24, 0x49, 0x92, 0x24, 0x49, 0x92, 0x24, 0x49, 0x92, 0x24, 0x49, 0x41, 0x10, 0x28, 0x81, 0xa2, 0x4c, 0x81, 0x12, 0x28, 0x81, 0xa2, 0x4c, 0x41, 0xa0, 0x28, 0x81, 0xa2, 0x4c, 0x81, 0xa2, 0x28, 0x81, 0xa2, 0x4c, 0x81, 0xa2, 0x4c, 0x81, 0xa2, 0x4c, 0x41, 0x10, 0x28, 0x8a, 0xa2, 0x4c, 0x81, 0x12, 0x28, 0x8a, 0xa2, 0x4c, 0x41, 0xa0, 0x28, 0x8a, 0xa2, 0x4c, 0x81, 0xa2, 0x28, 0x8a, 0xa2, 0x4c, 0x8a, 0xa2, 0x28, 0x8a, 0xa2, 0x4c, 0x81, 0xa2, 0x4c, 0x8a, 0xa2, 0x4c, 0x8a, 0xa2, 0x4c, 0x8a, 0xa2, 0x4c, 0x81, 0x12, 0x28, 0xca, 0xa4, 0x4c, 0x81, 0xa2, 0x28, 0xca, 0xa4, 0x4c, 0x8a, 0xa2, 0x28, 0xca, 0xa4, 0x4c, 0x81, 0xa2, 0x4c, 0xca, 0xa4, 0x4c, 0x8a, 0xa2, 0x4c, 0xca, 0xa4, 0x4c, 0xca, 0xa4, 0x4c, 0xca, 0xa4, 0x4c, 0x41, 0xa0, 0x28, 0x8a, 0x32, 0x4d, 0x81, 0xa2, 0x28, 0x8a, 0x32, 0x4d, 0x8a, 0xa2, 0x28, 0x8a, 0x32, 0x4d, 0x81, 0xa2, 0x4c, 0x8a, 0x32, 0x4d, 0x8a, 0xa2, 0x4c, 0x8a, 0x32, 0x4d, 0x8a, 0x32, 0x4d, 0x8a, 0x32, 0x4d, 0x81, 0xa2, 0x28, 0xca, 0x34, 0x4d, 0x8a, 0xa2, 0x28, 0xca, 0x34, 0x4d, 0x81, 0xa2, 0x4c, 0xca, 0x34, 0x4d, 0x8a, 0xa2, 0x4c, 0xca, 0x34, 0x4d, 0xca, 0xa4, 0x4c, 0xca, 0x34, 0x4d, 0x8a, 0x32, 0x4d, 0xca, 0x34, 0x4d, 0xca, 0x34, 0x4d, 0xca, 0x34, 0x4d, 0x8a, 0xa2, 0x28, 0xd3, 0x34, 0x4d, 0x8a, 0xa2, 0x4c, 0xd3, 0x34, 0x4d, 0xca, 0xa4, 0x4c, 0xd3, 0x34, 0x4d, 0x8a, 0x32, 0x4d, 0xd3, 0x34, 0x4d, 0xd3, 0x34, 0x4d, 0xd3, 0x34, 0x4d, 0x49, 0x92, 0x48, 0x89, 0x24, 0x6d, 0x89, 0x94, 0x48, 0x89, 0x24, 0x6d, 0x49, 0x22, 0x49, 0x89, 0x24, 0x6d, 0x89, 0x24, 0x49, 0x89, 0x24, 0x6d, 0x89, 0x24, 0x6d, 0x89, 0x24, 0x6d, 0x49, 0x92, 0x48, 0x92, 0x24, 0x6d, 0x89, 0x94, 0x48, 0x92, 0x24, 0x6d, 0x49, 0x22, 0x49, 0x92, 0x24, 0x6d, 0x89, 0x24, 0x49, 0x92, 0x24, 0x6d, 0x92, 0x24, 0x49, 0x92, 0x24, 0x6d, 0x89, 0x24, 0x6d, 0x92, 0x24, 0x6d, 0x92, 0x24, 0x6d, 0x92, 0x24, 0x6d, 0x89, 0x94, 0x48, 0xd2, 0x26, 0x6d, 0x89, 0x24, 0x49, 0xd2, 0x26, 0x6d, 0x92, 0x24, 0x49, 0xd2, 0x26, 0x6d, 0x89, 0x24, 0x6d, 0xd2, 0x26, 0x6d, 0x92, 0x24, 0x6d, 0xd2, 0x26, 0x6d, 0xd2, 0x26, 0x6d, 0xd2, 0x26, 0x6d, 0x49, 0x22, 0x49, 0x92, 0xb4, 0x6d, 0x89, 0x24, 0x49, 0x92, 0xb4, 0x6d, 0x92, 0x24, 0x49, 0x92, 0xb4, 0x6d, 0x89, 0x24, 0x6d, 0x92, 0xb4, 0x6d, 0x92, 0x24, 0x6d, 0x92, 0xb4, 0x6d, 0x92, 0xb4, 0x6d, 0x92, 0xb4, 0x6d, 0x89, 0x24, 0x49, 0xd2, 0xb6, 0x6d, 0x92, 0x24, 0x49, 0xd2, 0xb6, 0x6d, 0x89, 0x24, 0x6d, 0xd2, 0xb6, 0x6d, 0x92, 0x24, 0x6d, 0xd2, 0xb6, 0x6d, 0xd2, 0x26, 0x6d, 0xd2, 0xb6, 0x6d, 0x92, 0xb4, 0x6d, 0xd2, 0xb6, 0x6d, 0xd2, 0xb6, 0x6d, 0xd2, 0xb6, 0x6d, 0x92, 0x24, 0x49, 0xdb, 0xb6, 0x6d, 0x92, 0x24, 0x6d, 0xdb, 0xb6, 0x6d, 0xd2, 0x26, 0x6d, 0xdb, 0xb6, 0x6d, 0x92, 0xb4, 0x6d, 0xdb, 0xb6, 0x6d, 0xdb, 0xb6, 0x6d, 0xdb, 0xb6, 0x6d, 0x81, 0xa2, 0x4c, 0xca, 0x34, 0x71, 0x8a, 0xa2, 0x4c, 0xca, 0x34, 0x71, 0xca, 0xa4, 0x4c, 0xca, 0x34, 0x71, 0x8a, 0x32, 0x4d, 0xca, 0x34, 0x71, 0xca, 0x34, 0x4d, 0xca, 0x34, 0x71, 0xca, 0x34, 0x71, 0xca, 0x34, 0x71, 0x8a, 0xa2, 0x4c, 0xd3, 0x34, 0x71, 0xca, 0xa4, 0x4c, 0xd3, 0x34, 0x71, 0x8a, 0x32, 0x4d, 0xd3, 0x34, 0x71, 0xca, 0x34, 0x4d, 0xd3, 0x34, 0x71, 0xd3, 0x34, 0x4d, 0xd3, 0x34, 0x71, 0xca, 0x34, 0x71, 0xd3, 0x34, 0x71, 0xca, 0xa4, 0x4c, 0x13, 0x37, 0x71, 0xca, 0x34, 0x4d, 0x13, 0x37, 0x71, 0xd3, 0x34, 0x4d, 0x13, 0x37, 0x71, 0xca, 0x34, 0x71, 0x13, 0x37, 0x71, 0xd3, 0x34, 0x71, 0x13, 0x37, 0x71, 0x13, 0x37, 0x71, 0x13, 0x37, 0x71, 0x8a, 0x32, 0x4d, 0xd3, 0xc4, 0x71, 0xca, 0x34, 0x4d, 0xd3, 0xc4, 0x71, 0xd3, 0x34, 0x4d, 0xd3, 0xc4, 0x71, 0xd3, 0x34, 0x71, 0xd3, 0xc4, 0x71, 0xca, 0x34, 0x4d, 0x13, 0xc7, 0x71, 0xd3, 0x34, 0x71, 0x13, 0xc7, 0x71, 0xd3, 0xc4, 0x71, 0x13, 0xc7, 0x71, 0xd3, 0x34, 0x4d, 0x1c, 0xc7, 0x71, 0xd3, 0x34, 0x71, 0x1c, 0xc7, 0x71, 0x1c, 0xc7, 0x71, 0x1c, 0xc7, 0x71, 0x89, 0x24, 0x6d, 0xd2, 0xb6, 0x91, 0x92, 0x24, 0x6d, 0xd2, 0xb6, 0x91, 0xd2, 0x26, 0x6d, 0xd2, 0xb6, 0x91, 0x92, 0xb4, 0x6d, 0xd2, 0xb6, 0x91, 0xd2, 0xb6, 0x6d, 0xd2, 0xb6, 0x91, 0xd2, 0xb6, 0x91, 0xd2, 0xb6, 0x91, 0x92, 0x24, 0x6d, 0xdb, 0xb6, 0x91, 0xd2, 0x26, 0x6d, 0xdb, 0xb6, 0x91, 0x92, 0xb4, 0x6d, 0xdb, 0xb6, 0x91, 0xd2, 0xb6, 0x6d, 0xdb, 0xb6, 0x91, 0xdb, 0xb6, 0x6d, 0xdb, 0xb6, 0x91, 0xd2, 0xb6, 0x91, 0xdb, 0xb6, 0x91, 0xd2, 0x26, 0x6d, 0x1b, 0xb9, 0x91, 0xd2, 0xb6, 0x6d, 0x1b, 0xb9, 0x91, 0xdb, 0xb6, 0x6d, 0x1b, 0xb9, 0x91, 0xd2, 0xb6, 0x91, 0x1b, 0xb9, 0x91, 0xdb, 0xb6, 0x91, 0x1b, 0xb9, 0x91, 0x1b, 0xb9, 0x91, 0x1b, 0xb9, 0x91, 0x92, 0xb4, 0x6d, 0xdb, 0x46, 0x92, 0xd2, 0xb6, 0x6d, 0xdb, 0x46, 0x92, 0xdb, 0xb6, 0x6d, 0xdb, 0x46, 0x92, 0xdb, 0xb6, 0x91, 0xdb, 0x46, 0x92, 0xd2, 0xb6, 0x6d, 0x1b, 0x49, 0x92, 0xdb, 0xb6, 0x91, 0x1b, 0x49, 0x92, 0xdb, 0x46, 0x92, 0x1b, 0x49, 0x92, 0xdb, 0xb6, 0x6d, 0x24, 0x49, 0x92, 0xdb, 0xb6, 0x91, 0x24, 0x49, 0x92, 0x24, 0x49, 0x92, 0x24, 0x49, 0x92, 0xca, 0x34, 0x71, 0x13, 0xc7, 0x95, 0xd3, 0x34, 0x71, 0x13, 0xc7, 0x95, 0x13, 0x37, 0x71, 0x13, 0xc7, 0x95, 0x13, 0xc7, 0x71, 0x13, 0xc7, 0x95, 0xd3, 0x34, 0x71, 0x1c, 0xc7, 0x95, 0x13, 0x37, 0x71, 0x1c, 0xc7, 0x95, 0x1c, 0xc7, 0x71, 0x5c, 0xc9, 0x95, 0xd3, 0xc4, 0x71, 0x1c, 0x57, 0x96, 0x1c, 0xc7, 0x95, 0x1c, 0x57, 0x96, 0x1c, 0xc7, 0x71, 0x65, 0x59, 0x96, 0xd2, 0xb6, 0x91, 0x1b, 0x49, 0xb6, 0xdb, 0xb6, 0x91, 0x1b, 0x49, 0xb6, 0x1b, 0xb9, 0x91, 0x1b, 0x49, 0xb6, 0x1b, 0x49, 0x92, 0x1b, 0x49, 0xb6, 0xdb, 0xb6, 0x91, 0x24, 0x49, 0xb6, 0x1b, 0xb9, 0x91, 0x24, 0x49, 0xb6, 0x24, 0x49, 0x92, 0x64, 0x4b, 0xb6, 0xdb, 0x46, 0x92, 0x24, 0xd9, 0xb6, 0x24, 0x49, 0xb6, 0x24, 0xd9, 0xb6, 0x24, 0x49, 0x92, 0x6d, 0xdb, 0xb6, 0x13, 0xc7, 0x95, 0x5c, 0x59, 0xba, 0x1c, 0xc7, 0x95, 0x5c, 0x59, 0xba, 0x1c, 0xc7, 0x95, 0x65, 0x59, 0xba, 0x1c, 0x57, 0x96, 0x65, 0xe9, 0xba, 0x1b, 0x49, 0xb6, 0x64, 0xdb, 0xda, 0x24, 0x49, 0xb6, 0x64, 0xdb, 0xda, 0x24, 0x49, 0xb6, 0x6d, 0xdb, 0xda, 0x24, 0xd9, 0xb6, 0x6d, 0x6b, 0xdb, 0x5c, 0x59, 0xba, 0xa5, 0xeb, 0xde, 0x64, 0xdb, 0xda, 0xad, 0x6d, 0xff, 0x00, 0x00};
uint8_t lut1[2048] = {0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x23, 0x27, 0x27, 0x27, 0x27, 0x1, 0x1, 0x1, 0x1d, 0x1, 0x1, 0x1a, 0x27, 0x1, 0x23, 0x27, 0x60, 0x27, 0x27, 0x6f, 0x6f, 0x1, 0x1, 0x1, 0x15, 0x1, 0x1d, 0x1, 0x25, 0x15, 0x27, 0x23, 0x55, 0x27, 0x6c, 0x60, 0x6f, 0x1, 0x23, 0x1d, 0x55, 0x1, 0x26, 0x23, 0x6b, 0x1d, 0x60, 0x55, 0x96, 0x27, 0x6f, 0x6f, 0xa7, 0x1, 0x1, 0x1, 0x1, 0x15, 0x1a, 0x1a, 0x1d, 0x22, 0x23, 0x25, 0x26, 0x60, 0x65, 0x6b, 0x6e, 0x15, 0x1d, 0x23, 0x25, 0x1a, 0x22, 0x25, 0x60, 0x23, 0x54, 0x65, 0x6e, 0x65, 0x6c, 0xa0, 0xa6, 0x1a, 0x23, 0x1a, 0x23, 0x1d, 0x53, 0x22, 0x5f, 0x23, 0x65, 0x5e, 0x6c, 0x65, 0x9f, 0x92, 0xa5, 0x1d, 0x5e, 0x51, 0x6b, 0x22, 0x64, 0x5f, 0x9e, 0x51, 0x92, 0x6b, 0xa5, 0x6b, 0xa5, 0xa5, 0xbb, 0x1, 0x1, 0x1, 0x1, 0x1, 0x15, 0x16, 0x16, 0x1c, 0x1c, 0x21, 0x26, 0x55, 0x57, 0x63, 0x6a, 0x1, 0x16, 0x1c, 0x21, 0x15, 0x1c, 0x21, 0x57, 0x1c, 0x26, 0x62, 0x69, 0x57, 0x69, 0x96, 0xa4, 0x15, 0x20, 0x15, 0x20, 0x1b, 0x21, 0x1b, 0x54, 0x20, 0x62, 0x54, 0x69, 0x62, 0x95, 0x69, 0xa1, 0x1b, 0x54, 0x20, 0x69, 0x1b, 0x61, 0x54, 0x94, 0x4e, 0x69, 0x69, 0xa3, 0x62, 0xa1, 0xa1, 0xba, 0x1b, 0x1b, 0x1b, 0x1b, 0x1e, 0x1e, 0x20, 0x4e, 0x53, 0x54, 0x5f, 0x61, 0x67, 0x8f, 0x94, 0x9f, 0x1e, 0x4e, 0x53, 0x61, 0x4c, 0x53, 0x61, 0x8d, 0x5e, 0x66, 0x8f, 0x9f, 0x8d, 0x9f, 0xb3, 0xb9, 0x4c, 0x5e, 0x4c, 0x5f, 0x4e, 0x61, 0x53, 0x8c, 0x5e, 0x8f, 0x66, 0x9e, 0x8f, 0xb2, 0x9e, 0xb9, 0x4c, 0x8c, 0x5f, 0x9e, 0x53, 0x8c, 0x8c, 0xb2, 0x5e, 0x9e, 0x9e, 0xb9, 0x9e, 0xb9, 0xb9, 0xc3, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x19, 0x19, 0x19, 0x24, 0x27, 0x55, 0x5d, 0x5d, 0x1, 0x1, 0x19, 0x1f, 0x1, 0x18, 0x1f, 0x27, 0x19, 0x24, 0x5a, 0x68, 0x55, 0x5d, 0x6f, 0x9d, 0x1, 0x19, 0x15, 0x19, 0x15, 0x1f, 0x18, 0x26, 0x19, 0x5a, 0x24, 0x5d, 0x5a, 0x6d, 0x68, 0x9c, 0x15, 0x24, 0x1f, 0x5c, 0x17, 0x54, 0x52, 0x6d, 0x1f, 0x67, 0x5c, 0x9c, 0x5a, 0x9a, 0x9a, 0xb8, 0x17, 0x17, 0x17, 0x17, 0x18, 0x1e, 0x1e, 0x1f, 0x24, 0x52, 0x59, 0x59, 0x67, 0x67, 0x93, 0x99, 0x17, 0x1e, 0x52, 0x59, 0x1e, 0x52, 0x59, 0x67, 0x52, 0x59, 0x8d, 0x99, 0x67, 0x93, 0xa3, 0xb7, 0x1e, 0x52, 0x1e, 0x58, 0x1e, 0x59, 0x51, 0x64, 0x58, 0x8d, 0x64, 0x93, 0x8d, 0xa2, 0x98, 0xb6, 0x4c, 0x64, 0x58, 0x93, 0x51, 0x8c, 0x64, 0xa2, 0x58, 0x98, 0x93, 0xb6, 0x92, 0xb6, 0xb6, 0xc2, 0x16, 0x16, 0x16, 0x16, 0x16, 0x16, 0x19, 0x19, 0x21, 0x4f, 0x50, 0x56, 0x57, 0x5d, 0x6a, 0x91, 0x16, 0x19, 0x4f, 0x50, 0x18, 0x21, 0x50, 0x5c, 0x4f, 0x56, 0x68, 0x91, 0x5c, 0x90, 0x9c, 0xb5, 0x18, 0x4f, 0x18, 0x50, 0x1e, 0x50, 0x4d, 0x5b, 0x4f, 0x67, 0x56, 0x90, 0x67, 0x9b, 0x91, 0xb4, 0x1e, 0x56, 0x50, 0x90, 0x4c, 0x66, 0x5b, 0x97, 0x50, 0x90, 0x90, 0xb4, 0x8d, 0xb3, 0xb3, 0xc1, 0x4c, 0x4c, 0x4c, 0x4c, 0x4d, 0x4d, 0x4f, 0x50, 0x56, 0x5b, 0x8c, 0x8c, 0x8e, 0x91, 0x9b, 0xb2, 0x4c, 0x4f, 0x5b, 0x8c, 0x4d, 0x5b, 0x8c, 0x8e, 0x66, 0x8c, 0x91, 0xb2, 0x8e, 0xb2, 0xb3, 0xc0, 0x4c, 0x8c, 0x4c, 0x8c, 0x4f, 0x8c, 0x5b, 0x8c, 0x8c, 0x90, 0x8c, 0xb2, 0x91, 0xb2, 0xb2, 0xc0, 0x4c, 0x8c, 0x8c, 0xb2, 0x5b, 0x8c, 0x8c, 0xb2, 0x8c, 0xb2, 0xb2, 0xc0, 0xb2, 0xc0, 0xc0, 0xc5, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x23, 0x27, 0x27, 0x27, 0x27, 0x1, 0x1, 0x1, 0x1d, 0x1, 0x1, 0x1a, 0x27, 0x1, 0x23, 0x27, 0x60, 0x27, 0x27, 0x6f, 0x6f, 0x1, 0x1, 0x1, 0x15, 0x1, 0x1d, 0x1, 0x25, 0x15, 0x27, 0x23, 0x55, 0x27, 0x6c, 0x60, 0x6f, 0x1, 0x23, 0x1d, 0x55, 0x1, 0x26, 0x23, 0x6b, 0x1d, 0x60, 0x55, 0x96, 0x27, 0x6f, 0x6f, 0xa7, 0x1, 0x1, 0x1, 0x1, 0x15, 0x1a, 0x1a, 0x1d, 0x22, 0x23, 0x25, 0x26, 0x60, 0x65, 0x6b, 0x6e, 0x15, 0x1d, 0x23, 0x25, 0x1a, 0x22, 0x25, 0x60, 0x23, 0x54, 0x65, 0x6e, 0x65, 0x6c, 0xa0, 0xa6, 0x1a, 0x23, 0x1a, 0x23, 0x1d, 0x53, 0x22, 0x5f, 0x23, 0x65, 0x5e, 0x6c, 0x65, 0x9f, 0x92, 0xa5, 0x1d, 0x5e, 0x51, 0x6b, 0x22, 0x64, 0x5f, 0x9e, 0x51, 0x92, 0x6b, 0xa5, 0x6b, 0xa5, 0xa5, 0xbb, 0x1, 0x1, 0x1, 0x1, 0x1, 0x15, 0x16, 0x16, 0x1c, 0x1c, 0x21, 0x26, 0x55, 0x57, 0x63, 0x6a, 0x1, 0x16, 0x1c, 0x21, 0x15, 0x1c, 0x21, 0x57, 0x1c, 0x26, 0x62, 0x69, 0x57, 0x69, 0x96, 0xa4, 0x15, 0x20, 0x15, 0x20, 0x1b, 0x21, 0x1b, 0x54, 0x20, 0x62, 0x54, 0x69, 0x62, 0x95, 0x69, 0xa1, 0x1b, 0x54, 0x20, 0x69, 0x1b, 0x61, 0x54, 0x94, 0x4e, 0x69, 0x69, 0xa3, 0x62, 0xa1, 0xa1, 0xba, 0x1b, 0x1b, 0x1b, 0x1b, 0x1e, 0x1e, 0x20, 0x4e, 0x53, 0x54, 0x5f, 0x61, 0x67, 0x8f, 0x94, 0x9f, 0x1e, 0x4e, 0x53, 0x61, 0x4c, 0x53, 0x61, 0x8d, 0x5e, 0x66, 0x8f, 0x9f, 0x8d, 0x9f, 0xb3, 0xb9, 0x4c, 0x5e, 0x4c, 0x5f, 0x4e, 0x61, 0x53, 0x8c, 0x5e, 0x8f, 0x66, 0x9e, 0x8f, 0xb2, 0x9e, 0xb9, 0x4c, 0x8c, 0x5f, 0x9e, 0x53, 0x8c, 0x8c, 0xb2, 0x5e, 0x9e, 0x9e, 0xb9, 0x9e, 0xb9, 0xb9, 0xc3, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x19, 0x19, 0x19, 0x24, 0x27, 0x55, 0x5d, 0x5d, 0x1, 0x1, 0x19, 0x1f, 0x1, 0x18, 0x1f, 0x27, 0x19, 0x24, 0x5a, 0x68, 0x55, 0x5d, 0x6f, 0x9d, 0x1, 0x19, 0x15, 0x19, 0x15, 0x1f, 0x18, 0x26, 0x19, 0x5a, 0x24, 0x5d, 0x5a, 0x6d, 0x68, 0x9c, 0x15, 0x24, 0x1f, 0x5c, 0x17, 0x54, 0x52, 0x6d, 0x1f, 0x67, 0x5c, 0x9c, 0x5a, 0x9a, 0x9a, 0xb8, 0x17, 0x17, 0x17, 0x17, 0x18, 0x1e, 0x1e, 0x1f, 0x24, 0x52, 0x59, 0x59, 0x67, 0x67, 0x93, 0x99, 0x17, 0x1e, 0x52, 0x59, 0x1e, 0x52, 0x59, 0x67, 0x52, 0x59, 0x8d, 0x99, 0x67, 0x93, 0xa3, 0xb7, 0x1e, 0x52, 0x1e, 0x58, 0x1e, 0x59, 0x51, 0x64, 0x58, 0x8d, 0x64, 0x93, 0x8d, 0xa2, 0x98, 0xb6, 0x4c, 0x64, 0x58, 0x93, 0x51, 0x8c, 0x64, 0xa2, 0x58, 0x98, 0x93, 0xb6, 0x92, 0xb6, 0xb6, 0xc2, 0x16, 0x16, 0x16, 0x16, 0x16, 0x16, 0x19, 0x19, 0x21, 0x4f, 0x50, 0x56, 0x57, 0x5d, 0x6a, 0x91, 0x16, 0x19, 0x4f, 0x50, 0x18, 0x21, 0x50, 0x5c, 0x4f, 0x56, 0x68, 0x91, 0x5c, 0x90, 0x9c, 0xb5, 0x18, 0x4f, 0x18, 0x50, 0x1e, 0x50, 0x4d, 0x5b, 0x4f, 0x67, 0x56, 0x90, 0x67, 0x9b, 0x91, 0xb4, 0x1e, 0x56, 0x50, 0x90, 0x4c, 0x66, 0x5b, 0x97, 0x50, 0x90, 0x90, 0xb4, 0x8d, 0xb3, 0xb3, 0xc1, 0x4c, 0x4c, 0x4c, 0x4c, 0x4d, 0x4d, 0x4f, 0x50, 0x56, 0x5b, 0x8c, 0x8c, 0x8e, 0x91, 0x9b, 0xb2, 0x4c, 0x4f, 0x5b, 0x8c, 0x4d, 0x5b, 0x8c, 0x8e, 0x66, 0x8c, 0x91, 0xb2, 0x8e, 0xb2, 0xb3, 0xc0, 0x4c, 0x8c, 0x4c, 0x8c, 0x4f, 0x8c, 0x5b, 0x8c, 0x8c, 0x90, 0x8c, 0xb2, 0x91, 0xb2, 0xb2, 0xc0, 0x4c, 0x8c, 0x8c, 0xb2, 0x5b, 0x8c, 0x8c, 0xb2, 0x8c, 0xb2, 0xb2, 0xc0, 0xb2, 0xc0, 0xc0, 0xc5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10, 0x14, 0x14, 0x14, 0x14, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x7, 0x14, 0x0, 0x10, 0x14, 0x3c, 0x14, 0x14, 0x4b, 0x4b, 0x0, 0x0, 0x0, 0x2, 0x0, 0xa, 0x0, 0x12, 0x2, 0x14, 0x10, 0x31, 0x14, 0x48, 0x3c, 0x4b, 0x0, 0x10, 0xa, 0x31, 0x0, 0x13, 0x10, 0x47, 0xa, 0x3c, 0x31, 0x7a, 0x14, 0x4b, 0x4b, 0x8b, 0x0, 0x0, 0x0, 0x0, 0x2, 0x7, 0x7, 0xa, 0xf, 0x10, 0x12, 0x13, 0x3c, 0x41, 0x47, 0x4a, 0x2, 0xa, 0x10, 0x12, 0x7, 0xf, 0x12, 0x3c, 0x10, 0x30, 0x41, 0x4a, 0x41, 0x48, 0x84, 0x8a, 0x7, 0x10, 0x7, 0x10, 0xa, 0x2f, 0xf, 0x3b, 0x10, 0x41, 0x3a, 0x48, 0x41, 0x83, 0x76, 0x89, 0xa, 0x3a, 0x2d, 0x47, 0xf, 0x40, 0x3b, 0x82, 0x2d, 0x76, 0x47, 0x89, 0x47, 0x89, 0x89, 0xb1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x3, 0x3, 0x9, 0x9, 0xe, 0x13, 0x31, 0x33, 0x3f, 0x46, 0x0, 0x3, 0x9, 0xe, 0x2, 0x9, 0xe, 0x33, 0x9, 0x13, 0x3e, 0x45, 0x33, 0x45, 0x7a, 0x88, 0x2, 0xd, 0x2, 0xd, 0x8, 0xe, 0x8, 0x30, 0xd, 0x3e, 0x30, 0x45, 0x3e, 0x79, 0x45, 0x85, 0x8, 0x30, 0xd, 0x45, 0x8, 0x3d, 0x30, 0x78, 0x2a, 0x45, 0x45, 0x87, 0x3e, 0x85, 0x85, 0xb0, 0x8, 0x8, 0x8, 0x8, 0xb, 0xb, 0xd, 0x2a, 0x2f, 0x30, 0x3b, 0x3d, 0x43, 0x73, 0x78, 0x83, 0xb, 0x2a, 0x2f, 0x3d, 0x28, 0x2f, 0x3d, 0x71, 0x3a, 0x42, 0x73, 0x83, 0x71, 0x83, 0xa9, 0xaf, 0x28, 0x3a, 0x28, 0x3b, 0x2a, 0x3d, 0x2f, 0x70, 0x3a, 0x73, 0x42, 0x82, 0x73, 0xa8, 0x82, 0xaf, 0x28, 0x70, 0x3b, 0x82, 0x2f, 0x70, 0x70, 0xa8, 0x3a, 0x82, 0x82, 0xaf, 0x82, 0xaf, 0xaf, 0xbf, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x6, 0x6, 0x6, 0x11, 0x14, 0x31, 0x39, 0x39, 0x0, 0x0, 0x6, 0xc, 0x0, 0x5, 0xc, 0x14, 0x6, 0x11, 0x36, 0x44, 0x31, 0x39, 0x4b, 0x81, 0x0, 0x6, 0x2, 0x6, 0x2, 0xc, 0x5, 0x13, 0x6, 0x36, 0x11, 0x39, 0x36, 0x49, 0x44, 0x80, 0x2, 0x11, 0xc, 0x38, 0x4, 0x30, 0x2e, 0x49, 0xc, 0x43, 0x38, 0x80, 0x36, 0x7e, 0x7e, 0xae, 0x4, 0x4, 0x4, 0x4, 0x5, 0xb, 0xb, 0xc, 0x11, 0x2e, 0x35, 0x35, 0x43, 0x43, 0x77, 0x7d, 0x4, 0xb, 0x2e, 0x35, 0xb, 0x2e, 0x35, 0x43, 0x2e, 0x35, 0x71, 0x7d, 0x43, 0x77, 0x87, 0xad, 0xb, 0x2e, 0xb, 0x34, 0xb, 0x35, 0x2d, 0x40, 0x34, 0x71, 0x40, 0x77, 0x71, 0x86, 0x7c, 0xac, 0x28, 0x40, 0x34, 0x77, 0x2d, 0x70, 0x40, 0x86, 0x34, 0x7c, 0x77, 0xac, 0x76, 0xac, 0xac, 0xbe, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x6, 0x6, 0xe, 0x2b, 0x2c, 0x32, 0x33, 0x39, 0x46, 0x75, 0x3, 0x6, 0x2b, 0x2c, 0x5, 0xe, 0x2c, 0x38, 0x2b, 0x32, 0x44, 0x75, 0x38, 0x74, 0x80, 0xab, 0x5, 0x2b, 0x5, 0x2c, 0xb, 0x2c, 0x29, 0x37, 0x2b, 0x43, 0x32, 0x74, 0x43, 0x7f, 0x75, 0xaa, 0xb, 0x32, 0x2c, 0x74, 0x28, 0x42, 0x37, 0x7b, 0x2c, 0x74, 0x74, 0xaa, 0x71, 0xa9, 0xa9, 0xbd, 0x28, 0x28, 0x28, 0x28, 0x29, 0x29, 0x2b, 0x2c, 0x32, 0x37, 0x70, 0x70, 0x72, 0x75, 0x7f, 0xa8, 0x28, 0x2b, 0x37, 0x70, 0x29, 0x37, 0x70, 0x72, 0x42, 0x70, 0x75, 0xa8, 0x72, 0xa8, 0xa9, 0xbc, 0x28, 0x70, 0x28, 0x70, 0x2b, 0x70, 0x37, 0x70, 0x70, 0x74, 0x70, 0xa8, 0x75, 0xa8, 0xa8, 0xbc, 0x28, 0x70, 0x70, 0xa8, 0x37, 0x70, 0x70, 0xa8, 0x70, 0xa8, 0xa8, 0xbc, 0xa8, 0xbc, 0xbc, 0xc4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10, 0x14, 0x14, 0x14, 0x14, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x7, 0x14, 0x0, 0x10, 0x14, 0x3c, 0x14, 0x14, 0x4b, 0x4b, 0x0, 0x0, 0x0, 0x2, 0x0, 0xa, 0x0, 0x12, 0x2, 0x14, 0x10, 0x31, 0x14, 0x48, 0x3c, 0x4b, 0x0, 0x10, 0xa, 0x31, 0x0, 0x13, 0x10, 0x47, 0xa, 0x3c, 0x31, 0x7a, 0x14, 0x4b, 0x4b, 0x8b, 0x0, 0x0, 0x0, 0x0, 0x2, 0x7, 0x7, 0xa, 0xf, 0x10, 0x12, 0x13, 0x3c, 0x41, 0x47, 0x4a, 0x2, 0xa, 0x10, 0x12, 0x7, 0xf, 0x12, 0x3c, 0x10, 0x30, 0x41, 0x4a, 0x41, 0x48, 0x84, 0x8a, 0x7, 0x10, 0x7, 0x10, 0xa, 0x2f, 0xf, 0x3b, 0x10, 0x41, 0x3a, 0x48, 0x41, 0x83, 0x76, 0x89, 0xa, 0x3a, 0x2d, 0x47, 0xf, 0x40, 0x3b, 0x82, 0x2d, 0x76, 0x47, 0x89, 0x47, 0x89, 0x89, 0xb1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x3, 0x3, 0x9, 0x9, 0xe, 0x13, 0x31, 0x33, 0x3f, 0x46, 0x0, 0x3, 0x9, 0xe, 0x2, 0x9, 0xe, 0x33, 0x9, 0x13, 0x3e, 0x45, 0x33, 0x45, 0x7a, 0x88, 0x2, 0xd, 0x2, 0xd, 0x8, 0xe, 0x8, 0x30, 0xd, 0x3e, 0x30, 0x45, 0x3e, 0x79, 0x45, 0x85, 0x8, 0x30, 0xd, 0x45, 0x8, 0x3d, 0x30, 0x78, 0x2a, 0x45, 0x45, 0x87, 0x3e, 0x85, 0x85, 0xb0, 0x8, 0x8, 0x8, 0x8, 0xb, 0xb, 0xd, 0x2a, 0x2f, 0x30, 0x3b, 0x3d, 0x43, 0x73, 0x78, 0x83, 0xb, 0x2a, 0x2f, 0x3d, 0x28, 0x2f, 0x3d, 0x71, 0x3a, 0x42, 0x73, 0x83, 0x71, 0x83, 0xa9, 0xaf, 0x28, 0x3a, 0x28, 0x3b, 0x2a, 0x3d, 0x2f, 0x70, 0x3a, 0x73, 0x42, 0x82, 0x73, 0xa8, 0x82, 0xaf, 0x28, 0x70, 0x3b, 0x82, 0x2f, 0x70, 0x70, 0xa8, 0x3a, 0x82, 0x82, 0xaf, 0x82, 0xaf, 0xaf, 0xbf, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x6, 0x6, 0x6, 0x11, 0x14, 0x31, 0x39, 0x39, 0x0, 0x0, 0x6, 0xc, 0x0, 0x5, 0xc, 0x14, 0x6, 0x11, 0x36, 0x44, 0x31, 0x39, 0x4b, 0x81, 0x0, 0x6, 0x2, 0x6, 0x2, 0xc, 0x5, 0x13, 0x6, 0x36, 0x11, 0x39, 0x36, 0x49, 0x44, 0x80, 0x2, 0x11, 0xc, 0x38, 0x4, 0x30, 0x2e, 0x49, 0xc, 0x43, 0x38, 0x80, 0x36, 0x7e, 0x7e, 0xae, 0x4, 0x4, 0x4, 0x4, 0x5, 0xb, 0xb, 0xc, 0x11, 0x2e, 0x35, 0x35, 0x43, 0x43, 0x77, 0x7d, 0x4, 0xb, 0x2e, 0x35, 0xb, 0x2e, 0x35, 0x43, 0x2e, 0x35, 0x71, 0x7d, 0x43, 0x77, 0x87, 0xad, 0xb, 0x2e, 0xb, 0x34, 0xb, 0x35, 0x2d, 0x40, 0x34, 0x71, 0x40, 0x77, 0x71, 0x86, 0x7c, 0xac, 0x28, 0x40, 0x34, 0x77, 0x2d, 0x70, 0x40, 0x86, 0x34, 0x7c, 0x77, 0xac, 0x76, 0xac, 0xac, 0xbe, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x6, 0x6, 0xe, 0x2b, 0x2c, 0x32, 0x33, 0x39, 0x46, 0x75, 0x3, 0x6, 0x2b, 0x2c, 0x5, 0xe, 0x2c, 0x38, 0x2b, 0x32, 0x44, 0x75, 0x38, 0x74, 0x80, 0xab, 0x5, 0x2b, 0x5, 0x2c, 0xb, 0x2c, 0x29, 0x37, 0x2b, 0x43, 0x32, 0x74, 0x43, 0x7f, 0x75, 0xaa, 0xb, 0x32, 0x2c, 0x74, 0x28, 0x42, 0x37, 0x7b, 0x2c, 0x74, 0x74, 0xaa, 0x71, 0xa9, 0xa9, 0xbd, 0x28, 0x28, 0x28, 0x28, 0x29, 0x29, 0x2b, 0x2c, 0x32, 0x37, 0x70, 0x70, 0x72, 0x75, 0x7f, 0xa8, 0x28, 0x2b, 0x37, 0x70, 0x29, 0x37, 0x70, 0x72, 0x42, 0x70, 0x75, 0xa8, 0x72, 0xa8, 0xa9, 0xbc, 0x28, 0x70, 0x28, 0x70, 0x2b, 0x70, 0x37, 0x70, 0x70, 0x74, 0x70, 0xa8, 0x75, 0xa8, 0xa8, 0xbc, 0x28, 0x70, 0x70, 0xa8, 0x37, 0x70, 0x70, 0xa8, 0x70, 0xa8, 0xa8, 0xbc, 0xa8, 0xbc, 0xbc, 0xc4};

uint64_t result(uint32_t x) {
    x >>= 2;
    uint32_t d = x >> 1;
    uint64_t data;
    memcpy(&data, &lut2[6 * lut1[x >> 4]], sizeof(uint64_t));
    return ((d * d) >> 13) - (data >> (3 * (x & 0xf)) & 0x7) + 1;
}

I suspect it'll be slower than the other algorithms though, and it's not so elegant.

As chtz mentioned in a comment, the result only ever differs from the full-width multiplication by an integer between 0 and 7. This packed lookup table would be 12KB which is a bit large (but maybe worth a shot, if the rest of the application isn't memory intensive). To make it smaller we just use the fact that the error is highly repetitive.

Consider consecutive groups of sixteen 3-bit errors. There are a bit more than 255 patterns if we compare with x*x>>15, so the pattern index won't fit in a byte. (I don't remember exactly how many there are.) But there are only 198 different patterns, with error between -1 and 6, if we compare with (x&~1)*(x&~1)>>15; intuitively, with the last bit removed, the product is less noisy. That fits in a byte. Thus we have one 2048-byte lookup table to determine the pattern to index into, and a second lookup table (made up of 198 6-byte patterns) containing the actual pattern.

I should also note that the code assumes the machine is little endian; beyond that, it should be cross-platform.

like image 31
Ovinus Real Avatar answered Dec 14 '25 13:12

Ovinus Real