I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?
I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.
Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:
int32_t fpatan2(int32_t y, int32_t x)       // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
    #include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
    const uint32_t outfmt = 32; // final output format in s0.outfmt
    const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp);   // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
    const uint32_t halfof = 1UL<<(outfmt-1);    // represents 0.5 in the output format, which since it is in turns means half a circle
    const uint32_t pds=ds-lutsp;    // division shift and post-division shift
    uint32_t lutind, p, t, d;
    int32_t a, b, xa, ya, xs, ys, div, r;
    xs = x >> 31;       // equivalent of fabs()
    xa = (x^xs) - xs;
    ys = y >> 31;
    ya = (y^ys) - ys;
    d = ya+xa;
    if (d==0)       // if both y and x are 0 then they add up to 0 and we must return 0
        return 0;
    // the following does 0.5 * (1. - (y-x) / (y+x))
    // (y+x) is u1.31, (y-x) is s0.31, div is in s1.29
    div = ((int64_t) (ya-xa)<<ds) / d;  // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
    p = ((1UL<<ds) - div) >> 1;     // before shift the format is s2.29. position in u1.29
    lutind = p >> ish;      // index for the LUT
    t = (p & tmask) >> tbd;     // interpolator between two LUT entries
    a = fpatan_lut[lutind];
    b = fpatan_lut[lutind+1];
    r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip);  // linear interpolation of a and b by t in s0.32 format
    // Quadrants
    if (xs)             // if x was negative
        r = halfof - r;     // r = 0.5 - r
    r = (r^ys) - ys;        // if y was negative then r is negated
    return r;
}
Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.
If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.
Based on @Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.
I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.
Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:
#include "ffo_lut.h"
static INLINE int32_t log2_ffo32(uint32_t x)    // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
    int32_t y;
    y = x>>21;  if (y)  return ffo_lut[y]+21;
    y = x>>10;  if (y)  return ffo_lut[y]+10;
    return ffo_lut[x];
}
// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt)   // ~23.3 cycles, max error (by division) 53.39e-9
{
    #include "fpdiv.h"  // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
    const int32_t *c;
    int32_t xa, xs, p, sh;
    uint32_t expon, frx, lutind;
    const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1);  // the shift for the index, the shift for 31-bit xa, the value of 0.5
    int64_t out;
    int64_t c0, c1, c2;
    // turn x into xa (|x|) and sign of x (xs)
    xs = x >> 31;
    xa = (x^xs) - xs;
    // decompose |x| into frx * 2^expon
    expon = log2_ffo32(xa);
    frx = (xa << (31-expon)) >> cfs;    // the fractional part is now in 0.27 format
    // lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
    lutind = (frx - half) >> ish;       // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
    lutind *= 3;                // 3 entries for each index
    c = &fpdiv_lut[lutind];         // c points to the correct c0, c1, c2
    c0 = c[0];    c1 = c[1];    c2 = c[2];
    p = (int64_t) frx * frx >> prec;    // x^2
    p = c2 * p >> prec;         // c2 * x^2
    p += c1 * frx >> prec;          // + c1 * x
    p += c0;                // + c0, p = (1.0 , 2.0] in 2.27 format
    // apply the necessary bit shifts and reapplies the original sign of x to make final result
    sh = expon + prec - outfmt;     // calculates the final needed shift
    out = (int64_t) y * p;          // format is s31 + 1.27 = s32.27
    if (sh >= 0)
        out >>= sh;
    else
        out <<= -sh;
    out = (out^xs) - xs;            // if x was negative then out is negated
    return out;
}
I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.
As for the fpatan2() question the solution would be to replace this line:
div = ((int64_t) (ya-xa)<<ds) / d;
with that line:
div = fpdiv(ya-xa, d, ds);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With