So, i would like to convert a part of C code to Haskell. I wrote this part (it's a simplified example of what I want to do) in C, but being the newbie I am in Haskell, I can't really make it work.
float g(int n, float a, float p, float s)
{
int c;
while (n>0)
{
c = n % 2;
if (!c) s += p;
else s -= p;
p *= a;
n--;
}
return s;
}
Anyone got any ideas/solutions?
Lee's translation is already pretty good (well, he confused the odd and even cases(1)), but he fell into a couple of performance traps.
g n a p s =
if n > 0
then
let c = n `mod` 2
s' = (if c == 0 then (-) else (+)) s p
p' = p * a
in g (n-1) a p' s'
else s
He used mod instead of rem. The latter maps to machine division, the former performs additional checks to ensure a non-negative result. Thus mod is a bit slower than rem, and if either satisfies the needs - because they yield identical results in the case where both arguments are non-negative; or because the result is only compared to 0 (both conditions are satisfied here) - rem is preferable. Even better, and a bit more idiomatic is to use even (which uses rem for the reasons mentioned above). The difference is not huge, though.
No type signature. That means that the code is (type-class) polymorphic, and thus no strictness analysis is possible, nor any specialisations. If the code is used in the same module at a specific type, GHC can (and usually will, if optimisations are enabled) create a specialised version for that specific type that allows strictness analysis and some other optimisations (inlining of class methods like (+) etc.), in that case, one does not pay the polymorhism penalty. But if the use site is in a different module, that cannot happen. If (type-class) polymorphic code is desired, one should mark it INLINABLE or INLINE (for GHC < 7), so that its unfolding is exposed in the .hi file and the function can be specialised and optimised at the use site.
Since g is recursive, it cannot be inlined [meaning, GHC cannot inline it; in principle it is possible] at use sites, which often would enable more optimisations than a mere specialisation.
One technique that often allows better optimisation for recursive functions is the worker/wrapper transformation. One creates a wrapper that calls a recursive (local) worker, then the non-recursive wrapper can be inlined, and when the worker is called with known arguments, that can enable further optimisations like constant folding or, in the case of function arguments, inlining. In particular the latter often has an enormous impact, when combined with a static-argument-transformation (arguments that never change in the recursive calls are not passed as arguments to the recursive worker).
In this case, we only have one static argument of type Float, so a worker/wrapper transformation with a SAT typically makes no difference (as a rule of thumb, a SAT pays off when
so by this rule, we shouldn't expect any benefit from w/w + SAT, and in general, there is none). Here we have one special case where w/w + SAT can make a big difference, and that is when the factor a is 1. GHC has {-# RULES #-} that eliminate multiplication by 1 for various types, and with such a short loop body, a multiplication more or less per iteration makes a difference, the running time is reduced by about 40% after points 3 and 4 have been applied. (There are no RULES for multiplication by 0 or by -1 for floating point types because 0*x = 0 resp. (-1)*x = -x don't hold for NaNs.) For all other a, the w/w + SATed
{-# INLINABLE g #-}
g n a p s = worker n p s
where
worker n p s
| n <= 0 = s
| otherwise = let s' = if even n then s + p else s - p
in worker (n-1) a (p*a) s'
does not perform measurably different from the top-level recursive version with the same optimisations done.
Strictness. GHC's strictness analyser is good, but not perfect. It cannot see far enough through the algorithm to determine that the function is
p if n >= 1 (assuming addition - (+) - is strict in both arguments)a if n >= 2 (assuming strictness of (*) in both arguments)and then produce a worker that is strict in both. Instead you get a worker that uses an unboxed Int# for n and an unboxed Float# for s (I'm using the type Int -> Float -> Float -> Float -> Float here, corresponding to the C), and boxed Floats for a and p. Thus in each iteration you get two unboxings and a re-boxing. That costs (relatively) a lot of time, since besides that it's just a bit of simple arithmetic and tests.
Help GHC along a bit, and make the worker (or g itself, if you don't do the worker/wrapper transform) strict in p (bang pattern for example). That is enough to allow GHC producing a worker using unboxed values throughout.
Using division to test parity (not applicable if the type is Int and the LLVM backend is used).
GHC's optimiser hasn't got down to the low-level bits very much yet, so the native code generator emits a division instruction for
x `rem` 2 == 0
and, when the rest of the loop body is as cheap as it is here, that costs a lot of time. LLVM's optimiser has already been taught to replace that with a bitmasking at type Int, so with ghc -O2 -fllvm you don't need to do that manually. With the native code generator, substituting that with
x .&. 1 == 0
(needs import Data.Bits of course) produces a significant speedup (on normal platforms where a bitwise and is much faster than a division).
The final result
{-# INLINABLE g #-}
g n a p s = worker n p s
where
worker k !ap acc
| k > 0 = worker (k-1) (ap*a) (if k .&. (1 :: Int) == 0 then acc + ap else acc - ap)
| otherwise = acc
performs not measurably different (for the tested values) from the result of gcc -O3 -msse2 loop.c, except for a = -1, where gcc replaces the multiplication with a negation (assuming all NaNs equivalent).
(1) He's not alone in that,
c = n % 2;
if (!c) s += p;
else s -= p;
seems to be really tricky, as far as I can see everybody(2) got that wrong.
(2) With one exception ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With