I am googling the question for past hour, but there are only points to Taylor Series or some sample code that is either too slow or does not compile at all. Well, most answer I've found over Google is "Google it, it's already asked", but sadly it's not...
I am profiling my game on low-end Pentium 4 and found out that ~85% of execution time is wasted on calculating sinus, cosinus and square root (from standard C++ library in Visual Studio), and this seems to be heavily CPU dependent (on my I7 the same functions got only 5% of execution time, and the game is waaaaaaaaaay faster). I cannot optimize this three functions out, nor calculate both sine and cosine in one pass (there interdependent), but I don't need too accurate results for my simulation, so I can live with faster approximation.
So, the question: What are the fastest way to calculate sine, cosine and square root for float in C++?
EDIT Lookup table are more painful as resulting Cache Miss is way more costly on modern CPU than Taylor Series. The CPUs are just so fast these days, and cache is not.
I made a mistake, I though that I need to calculate several factorials for Taylor Series, and I see now they can be implemented as constants.
So the updated question: is there any speedy optimization for square root as well?
EDIT2
I am using square root to calculate distance, not normalization - can't use fast inverse square root algorithm (as pointed in comment: http://en.wikipedia.org/wiki/Fast_inverse_square_root
EDIT3
I also cannot operate on squared distances, I need exact distance for calculations
Here's the guaranteed fastest possible sine function in C++:
double FastSin(double x)
{
    return 0;
}
Oh, you wanted better accuracy than |1.0|? Well, here is a sine function that is similarly fast:
double FastSin(double x)
{
    return x;
}
This answer actually does not suck, when x is close to zero. For small x, sin(x) is approximately equal to x, because x is the first term of the Taylor expansion of sin(x).
What, still not accurate enough for you? Well read on.
Engineers in the 1970s made some fantastic discoveries in this field, but new programmers are simply unaware that these methods exist, because they're not taught as part of standard computer science curricula.
You need to start by understanding that there is no "perfect" implementation of these functions for all applications. Therefore, superficial answers to questions like "which one is fastest" are guaranteed to be wrong.
Most people who ask this question don't understand the importance of the tradeoffs between performance and accuracy. In particular, you are going to have to make some choices regarding the accuracy of the calculations before you do anything else. How much error can you tolerate in the result? 10^-4? 10^-16?
Unless you can quantify the error in any method, don't use it. See all those random answers below mine, that post a bunch of random uncommented source code, without clearly documenting the algorithm used and its exact maximum error across the input range? "The error is approximately sort of mumble mumble I would guess." That's strictly bush league. If you don't know how to calculate the PRECISE maximum error, to FULL precision, in your approximation function, across the ENTIRE range of the inputs... then you don't know how to write an approximation function!
No one uses the Taylor series alone to approximate transcendentals in software. Except for certain highly specific cases, Taylor series generally approach the target slowly across common input ranges.
The algorithms that your grandparents used to calculate transcendentals efficiently, are collectively referred to as CORDIC and were simple enough to be implemented in hardware. Here is a well documented CORDIC implementation in C. CORDIC implementations, usually, require a very small lookup table, but most implementations don't even require that a hardware multiplier be available. Most CORDIC implementations let you trade off performance for accuracy, including the one I linked.
There's been a lot of incremental improvements to the original CORDIC algorithms over the years. For example, last year some researchers in Japan published an article on an improved CORDIC with better rotation angles, which reduces the operations required.
If you have hardware multipliers sitting around (and you almost certainly do), or if you can't afford a lookup table like CORDIC requires, you can always use a Chebyshev polynomial to do the same thing. Chebyshev polynomials require multiplies, but this is rarely a problem on modern hardware. We like Chebyshev polynomials because they have highly predictable maximum errors for a given approximation. The maximum of the last term in a Chebyshev polynomial, across your input range, bounds the error in the result. And this error gets smaller as the number of terms gets larger. Here's one example of a Chebyshev polynomial giving a sine approximation across a huge range, ignoring the natural symmetry of the sine function and just solving the approximation problem by throwing more coefficients at it. And here's an example of estimating a sine function to within 5 ULPs. Don't know what an ULP is? You should.
We also like Chebyshev polynomials because the error in the approximation is equally distributed across the range of outputs. If you're writing audio plugins or doing digital signal processing, Chebyshev polynomials give you a cheap and predictable dithering effect "for free."
If you want to find your own Chebyshev polynomial coefficients across a specific range, many math libraries call the process of finding those coefficients "Chebyshev fit" or something like that.
Square roots, then as now, are usually calculated with some variant of the Newton-Raphson algorithm, usually with a fixed number of iterations. Usually, when someone cranks out an "amazing new" algorithm for doing square roots, it's merely Newton-Raphson in disguise.
Newton-Raphson, CORDIC, and Chebyshev polynomials let you trade off speed for accuracy, so the answer can be just as imprecise as you want.
Lastly, when you've finished all your fancy benchmarking and micro-optimization, make sure that your "fast" version is actually faster than the library version. Here is a typical library implementation of fsin() bounded on the domain from -pi/4 to pi/4. And it just ain't that damned slow.
One last caution to you: you're almost certainly using IEEE-754 math to perform your estimations, and anytime you're performing IEEE-754 math with a bunch of multiplies, then some obscure engineering decisions made decades ago will come back to haunt you, in the form of roundoff errors. And those errors start small, but they get bigger, and Bigger, and BIGGER! At some point in your life, please read "What every computer scientist should know about floating point numbers" and have the appropriate amount of fear. Keep in mind that if you start writing your own transcendental functions, you'll need to benchmark and measure the ACTUAL error due to floating-point roundoff, not just the maximum theoretical error. This is not a theoretical concern; "fast math" compilation settings have bit me in the butt, on more than one project.
tl:dr; go google "sine approximation" or "cosine approximation" or "square root approximation" or "approximation theory."
First, the Taylor series is NOT the best/fastest way to implement sine/cos. It is also not the way professional libraries implement these trigonometric functions, and knowing the best numerical implementation allows you to tweak the accuracy to get speed more efficiently. In addition, this problem has already been extensively discussed in StackOverflow. Here is just one example.
Second, the big difference you see between old/new PCS is due to the fact that modern Intel architecture has explicit assembly code to calculate elementary trigonometric functions. It is quite hard to beat them on execution speed.
Finally, let's talk about the code on your old PC. Check gsl gnu scientific library (or numerical recipes) implementation, and you will see that they basically use Chebyshev Approximation Formula.
Chebyshev approximation converges faster, so you need to evaluate fewer terms. I won't write implementation details here because there are already very nice answers posted on StackOverflow. Check this one for example . Just tweak the number of terms on this series to change the balance between accuracy/speed.
For this kind of problem, if you want implementation details of some special function or numerical method, you should take a look on GSL code before any further action - GSL is THE STANDARD numerical library.
EDIT: you may improve the execution time by including aggressive floating point optimization flags in gcc/icc. This will decrease the precision, but it seems that is exactly what you want.
EDIT2: You can try to make a coarse sin grid and use gsl routine (gsl_interp_cspline_periodic for spline with periodic conditions) to spline that table (the spline will reduce the errors in comparison to a linear interpolation => you need less points on your table => less cache miss)!
Based on the idea of 1 (web archive 2) and some manual rewriting to improve the performance in a micro benchmark I ended up with the following cosine implementation which is used in a HPC physics simulation that is bottlenecked by repeated cos calls on a large number space. It's accurate enough and much faster than a lookup table, most notably no division is required.
template<typename T>
inline T cos(T x) noexcept
{
    constexpr T tp = 1./(2.*M_PI);
    x *= tp;
    x -= T(.25) + std::floor(x + T(.25));
    x *= T(16.) * (std::abs(x) - T(.5));
    #if EXTRA_PRECISION
    x += T(.225) * x * (std::abs(x) - T(1.));
    #endif
    return x;
}
The Intel compiler at least is also smart enough in vectorizing this function when used in a loop.
If EXTRA_PRECISION is defined, the maximum error is about 0.00109 for the range -π to π, assuming T is double as it's usually defined in most C++ implementations. Otherwise, the maximum error is about 0.056 for the same range.
Here's a godbolt playground to show that clang also happily autovectorizes this: https://godbolt.org/z/35qj5f1zc
For square root, there is an approach called bit shift.
A float number defined by IEEE-754 is using some certain bit represent describe times of multiple based 2. Some bits are for represent the base value.
float squareRoot(float x)
{
  unsigned int i = *(unsigned int*) &x;
  // adjust bias
  i  += 127 << 23;
  // approximation of square root
  i >>= 1;
  return *(float*) &i;
}
That's a constant time calculating the squar root
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With