How can I further optimize this C++ function that takes integer square roots?

Question

In an algorithm I am using, most of the time is spent in the evaluation of square roots using the sqrt function in cmath. In my application I only need the integer part of the square root of integers, for which sqrt seems overly complex, and I tried to do a fast implementation following the manual algorithm for calculating square roots, which becomes very easy in binary, to get

const int BITLEN = sizeof(unsigned int) * 8;
bool DUMMYBOOL;

// digit in base 4 (two bits) starting at bit position i in n
inline int digit(unsigned int n, int i)
{
    return (n >> (BITLEN - i - 2)) & 3;
}

inline int isqrt(unsigned int n, bool* isSquare = &DUMMYBOOL)
{
    int answer = 0;
    int remaining = 0;
    
    for (int i = 0; i < BITLEN; i += 2)
    {
        remaining = (remaining << 2) + digit(n,i);

        answer <<= 1;
        int correction = (answer << 1) + 1;
        if (correction <= remaining)
        {
            remaining -= correction;
            answer++;
        }
    }
    *isSquare = remaining == 0;
    return answer;
}

(Note that it also tells me whether the number was a perfect square). This is fast, but still not as fast as the builtin sqrt. From what I understand, the sqrt is likely directly implemented in hardware (x86_64 with gcc on Linux), and will be hard to beat, and I'm OK with that.

I am still curious though if this function could be optimized further. One thing I tried was to adapt the BITLEN to the size of n using __builtin_clz, but that actually seems to slow it down, even if on average the numbers only have 24 bits or so.

abcdefg · Accepted Answer

A few words about lookup tables with pre-computed values of interest. In this use case, one defines the following

template<int NBITS>
struct ISqrt_LUT
{
    static_assert (NBITS<=32);

    static const auto Nbits = NBITS;
    static const auto Nmax  = 1UL<<NBITS;

    ISqrt_LUT() : lut(Nmax) {
        for (std::size_t i=0; i<lut.size(); i++)  { lut[i] = sqrt(i); }
    }

    auto operator() (std::uint32_t n) const {
        return lut[n];
    }
    std::vector<std::uint16_t> lut;
};

where all values up to 1<<NBITS are computed only once and then can be retrieved as many times as required through the operator() method.

Demo

One could think that one can't do better than a lookup table since the values are already computed. For instance, if we try all successive values from 1 to 1<<NBITS, one gets the following exec times (on my system)

[native       / contiguous values]   duration (ms): 162
[lookup table / contiguous values]   duration (ms): 40

where native refers to a call to the sqrt function. In such a case, one gets a real speedup with a LUT.

OTOH, when the values are random, one gets

[native       / shuffle values   ]   duration (ms): 163
[lookup table / shuffle values   ]   duration (ms): 476

so the LUT takes more time than the native approach: indeed, one gets many costly cache misses which makes the LUT not interesting in that case.

So, LUT are sometimes an interesting tool but can be deceptive when they are too big to fit L1 (or L2, L3) cache.

GoWiser · Answer

Use `consteval` instead of `const`

Declaring a variable as const or constexpr does not guarantie that it will be evaluated at compile time. Use consteval for that.

Define both of the consts you need and make sure they are calculated at compile time

consteval int BITLEN() { return sizeof(unsigned int) * 8; }
consteval int BITLEN_M2() { return BITLEN - 2; }

And then remember to change the places you use BITLEN to BITLEN().

Uninline `digit()` and compute it directly

Declaring a function as inline does not guarantee that it will be inlined. And avoid subtracting twice, by using BITLEN_M2 instead of BITLEN. To achive this, change remaining = (remaining << 2) + digit(n,i) to remaining = (remaining << 2) | ((n >> (BITLEN_M2() - i)) & 3).

Help the compiler if it's too stupid

Change (answer << 1) + 1 to (answer << 1) | 1.

Try to help the branch predictor

You assign to answer twice, once before a conditional, and then inside a conditional. By introducing a local variable and only assigning to answer inside the conditional, you could get a possible speed-up.

Remove the global variable `DUMMYBOOL`

Make the 2nd parameter not-optional, to prevent the default assignment of a parameter (to a global variable that isn't constant, which may prevent compiler optimization) and remember to never call the function with the 2nd parameter equal to null.

Here is the optimized code with all the suggestions

consteval int BITLEN() { return sizeof(unsigned int) * 8; }
consteval int BITLEN_M2() { return BITLEN() - 2; }

int isqrt(unsigned int n, bool* isSquare)
{
    int answer = 0;
    int remaining = 0;
    for (int i = 0; i < BITLEN(); i += 2) {
        int bits = (n >> (BITLEN_M2() - i)) & 3;
        remaining = (remaining << 2) + bits;
        int trial = (answer << 1) | 1;
        if (trial <= remaining) {
            remaining -= trial;
            answer = trial;
        } else {
            answer <<= 1;
        }
    }
    *isSquare = (remaining == 0);
    return answer;
}

Should probably change all the int to unsigned int too, but that's not performance related.

I was thinking that the algorithm could be created without the conditional. And I see that some of the other answers that arrived after mine are very good. When reading them I think, why didn't I think of that :D

How can I further optimize this C++ function that takes integer square roots?

Tags:

c++

optimization

numerical-methods

square-root

inline-assembly

doetoe

2 Answers

abcdefg

Use `consteval` instead of `const`

Define both of the consts you need and make sure they are calculated at compile time

Uninline `digit()` and compute it directly

Help the compiler if it's too stupid

Try to help the branch predictor

Remove the global variable `DUMMYBOOL`

Here is the optimized code with all the suggestions

GoWiser

Recent Activity

Donate For Us

How can I further optimize this C++ function that takes integer square roots?

Tags:

c++

optimization

numerical-methods

square-root

inline-assembly

doetoe

2 Answers

abcdefg

Use consteval instead of const

Define both of the consts you need and make sure they are calculated at compile time

Uninline digit() and compute it directly

Help the compiler if it's too stupid

Try to help the branch predictor

Remove the global variable DUMMYBOOL

Here is the optimized code with all the suggestions

GoWiser

Related questions

Recent Activity

Donate For Us

Use `consteval` instead of `const`

Uninline `digit()` and compute it directly

Remove the global variable `DUMMYBOOL`