I am working on an architecture which does not feature floating point hardware, but only a 16 bit ALU and a 40 bit MAC.
I have already implemented 32-bit single precision floating point addition/subtraction, multiplication, cosine, sine, division, square root, and range reduction all in software on this architecture.
To implement cosine and sine I first used range reduction using the method described in the paper "ARGUMENT REDUCTION FOR HUGE ARGUMENTS" by K.C. NG I then implemented a cosine and sine function which are polynomial approximations to the cosine and sine functions on the range -pi/4 to +pi/4. I referred to the book "Computer Approximations", Hart, et al. for the polynomials.
I have also heard that I should consider the CORDIC algorithm. However, I was wondering if anyone knows if it would be more or less efficient (in terms of throughput, memory overhead, and number of instructions required) than the method I already used? I have implemented my software functions on a multicore architecture where each core features only 128 words of instruction memory and a 128 word 16-bit data memory. Also I have tried searching for how to implement the CORDIC algorithm for cosine and sine, but I couldn't find any good resources for 32-bit floating point implementations. Does anybody have suggestions?
Thank you!
In 1959, Volder [17], introduced the CORDIC algorithm in order to compute approximations of trigonometric functions. This method is still used because of its adequacy to hardware design.
CORDIC is a method of calculating a math function using much simpler math operations in a loop called a Binary Search. Most commonly CORDIC is used to calculate ATAN2 (Angle), and Hypotenuse (Distance) of a point. CORDIC can also be used to calculate other math functions like SIN and COS.
CORDIC is a hardware-efficient iterative method which uses rotations to calculate a wide range of elementary functions. CORDIC (coordinate rotation digital computer) is a hardware-efficient iterative method which uses rotations to calculate a wide range of elementary functions.
CORDIC, an acronym for COordinate Rotation DIgital Computer, is a class of shift-add algorithms that rotate a vector in a plane.
CORDIC gives you one bit per loop iteration, so implementing it in software will likely be slower than your polynomial version. That may also be why it is hard to find articles on software implementations of CORDIC: its performance is inferior, so nobody bothers.
Re your comment:  Horner's method is the practice of evaluating polynomials from highest-order coefficient to lowest, by repeatedly adding the coefficient, then multiplying by the variable x.  In contrast, the naive method (i.e., evaluating the powers of x first, then multiplying them by their coefficients and adding them together) takes more work and can be less numerically stable than Horner's method.
You haven't mentioned exactly how you're trying to evaluate your polynomials, so I will suggest a formula:
x2 = x * x
cos = ((COS_D * x2 + COS_C) * x2 + COS_B) * x2 + COS_A
sin = (((SIN_D * x2 + SIN_C) * x2 + SIN_B) * x2 + SIN_A) * x
Note that you can get better precision if you adapt your constants to the range over which you are evaluating the function, rather than using the Taylor coefficients. (Again, apologies if you have done some or all of these things, but you didn't mention what you had already tried...)
This is probably less relevant for your case (which presumably has just a 16x16-bit MAC), but if your processor can launch multiple arithmetic evaluations at once, you may be able to get better performance if you write your evaluation in a tree-like form, avoiding some of the sequential dependency of operations:
x2 = x * x
x4 = x2 * x2
cos = (COS_D * x2 + COS_C) * x4 + (COS_B * x2 + COS_A)
sin = ((SIN_D * x2 + SIN_C) * x4 + (SIN_B * x2 + SIN_A)) * x
If your processor has a vector ALU, this formula also suggests a productive use for it...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With