I have a haskell code to resolve a Fast Fourier Transformation, and i want to apply data parallelism on it. However, every strategy that i use generate too many sparks and most of them are being overflowed.
Does anyone have any idea on how to apply a good data parallelism strategy on the following algorithm:
-- radix 2 Cooley-Tukey FFT
fft::[Complex Float] -> [Complex Float]
fft [a] = [a]
fft as = interleave ls rs
where
(cs,ds) = bflyS as
ls = fft cs
rs = fft ds
interleave [] bs = bs
interleave (a:as) bs = a : interleave bs as
bflyS :: [Complex Float] -> ([Complex Float], [Complex Float])
bflyS as = (los,rts)
where
(ls,rs) = halve as
los = zipWith (+) ls rs
ros = zipWith (-) ls rs
rts = zipWith (*) ros [tw (length as) i | i <- [0..(length ros) - 1]]
halve as = splitAt n' as
where
n' = div (length as + 1) 2
-- twiddle factors
tw :: Int -> Int -> Complex Float
tw n k = cis (-2 * pi * fromIntegral k / fromIntegral n)
PAR MONAD
The answer from leftaroundabout helped me a lot about understanging on how to apply data parallelism on the code. However, i have studied the par monad and tried to apply task parallelism to it. The problem is that it is running way slower than the original bflyS. I think the code i developed is way to expensive to create threads comparing to the relative work I am doing. Does anyone know how to use the par monad in a better way ?
--New Task Parallelism bflyS
bflySTask :: [Complex Float] -> ([Complex Float], [Complex Float])
bflySTask as = runPar $ do
let (ls, rs) = halve as
v1<-new
v2<-new
let ros = DATA.zipWith (-) ls rs
let aux = DATA.map (tw n) [0..n-1]
fork $ put v1 (DATA.zipWith (+) ls rs)
fork $ put v2 (DATA.zipWith (*) ros aux)
los <- get v1
rts <- get v2
return (los, rts)
where
n = DATA.length as
First off: there's a lot of optimisation to be done here before I'd start to think about parallelism:
Lists rock, but their non-consecutive memory model means they just can't allow for traversals nearly as fast as what you can achieve with tight arrays such as Data.Vector, because you inevitably end up with lots of cache misses. Indeed I've seldom seen a list-based algorithm to gain much from parallelisation, because they're bound by memory- rather than CPU performance.
Your twiddle factors are computed over and over again, you can obviously gain a lot through memoisation here.
You keep on calling length, but that's an extremely wasteful function (O (n) for something that could be O (1)). Use some container that probably handles length; lists aren't meant to (we like to keep their ability to be infinite).
The parallelisation itself will be pretty simple; I'd check on the length as suggested by John L, indeed I'd rather require a pretty large size before sparking a thread, at least something like 256: as the performance probably becomes crucial only at sizes of several thousands, this should sill be enough threads to keep your cores busy.
import Data.Vector.Unboxed as UBV
import Control.Parallel.Strategies
type ℂ = Complex Float
fft' :: UBV.Vector ℂ -> UBV.Vector ℂ
fft' aₓs = interleave lᵥs rᵥs
where (lᵥs, rᵥs) = (fft lₓs, fft rₓs)
`using` if UBV.length aₓs > 256 then parTuple2 else r0
(lₓs, rₓs) = byflyS aₓs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With