I am new to Haskell, and I've stuck with efficiency issues..
Task is: Build CSV file from 4GB text file where columns have constant size
column sizes are known, for example [col1: 4 chars wide, col2: 2 chars wide, etc...
file can only contain [A-Z0-9] ASCII chars so there is no point in escaping cells
I have: 
$ cat example.txt 
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)
I need: 
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...
this is my fastest code in Haskell,
takes about 2 minutes to process entire 4GB file.
I need at maximum 30 seconds
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List
col_sizes = intercalate [1] $ map (`replicate` 0) cs
  where
    cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
sp = char8 ',' -- column separator
nl = char8 '\n'
separator !cs !cl !xs !xl !ci !xi
  | c  == 1   = ps
  | xi == xl  = mempty -- at the end of bytestring, end recursion
  | cl == ci  = pr
  | otherwise = pc
  where
    c  = U.unsafeIndex cs ci         -- get column separation indicator
    w  = word8 . U.unsafeIndex xs    -- get char from BS at position
    p  = separator cs cl xs xl       -- partial recursion call
    pr = nl   <> p  0       (xi + 1) -- end of row, put '\n', reset counter, recur
    ps = sp   <> p (ci + 1)  xi      -- end of column, put column separator, recur
    pc = w xi <> p (ci + 1) (xi + 1) -- in the middle of column, copy byte, recur
main = do
  contents <- B.getContents
  BL.putStr . toLazyByteString $ init_sep sp_after_char contents
init_sep cs xs = separator cs (l cs) xs (l xs) 0 0
  where l = fromIntegral . B.length
sp_after_char = B.pack col_sizes
And this is my implementation in C http://pastebin.com/Kjz3Mugs 
(to long to paste it here...) 
takes about 5 seconds to process the same file
So my Haskell code is approx. 20x slower. 
Because Haskell ByteString filter and map are faster than my implementations in C,
(both take less than 2s to process same file doing some simple modifications) 
I hope there is something wrong in my Haskell code and I won't be forced to use C.
time cat test.txt > /dev/null
cat test.txt > /dev/null  0,00s user 0,35s system 99% cpu 0,353 total
Vanilla generator:
time ./data_builder | head -50000000 > /dev/null
./data_builder  0,02s user 1,09s system 30% cpu 3,709 total
head -50000000 > /dev/null  2,95s user 0,76s system 99% cpu 3,708 total
time ./tocsvc < test.txt > /dev/null 
./tocsvc < test.txt > /dev/null  5,35s user 0,35s system 100% cpu 5,689 total
with generator
time ./data_builder | head -50000000 | ./tocsvc > /dev/null
./data_builder  0,02s user 1,18s system 18% cpu 6,460 total
head -50000000  3,15s user 1,19s system 67% cpu 6,459 total
./tocsvc > /dev/null  5,81s user 0,55s system 98% cpu 6,459 total
time ./tocsvh1 < test.txt > /dev/null 
./tocsv < test.txt > /dev/null  19,56s user 0,41s system 100% cpu 19,950 total
with generator
time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null 
./data_builder  0,11s user 3,04s system 7% cpu 41,320 total
head -50000000  7,29s user 3,56s system 26% cpu 41,319 total
./tocsvh2 > /dev/null  33,01s user 2,42s system 85% cpu 41,327 total
time ./tocsvh2 < test.txt > /dev/null 
./tocsvh2 < test.txt > /dev/null  128,63s user 2,95s system 100% cpu 2:11,45 total
with generator
time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null 
./data_builder  0,02s user 1,26s system 28% cpu 4,526 total
head -50000000  3,17s user 1,33s system 99% cpu 4,524 total
./tocsvh2 > /dev/null  129,95s user 3,33s system 98% cpu 2:14,75 total
time ./tocsvh3 < test.txt > /dev/null 
./tocsv < test.txt > /dev/null  324,38s user 4,13s system 100% cpu 5:28,18 total
with generator
time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null 
./data_builder  0,43s user 4,46s system 1% cpu 5:30,34 total
head -50000000  5,20s user 2,82s system 2% cpu 5:30,34 total
./tocsv > /dev/null  329,08s user 4,21s system 100% cpu 5:32,96 total
I was able to get within a factor of 3 of C just by using raw pointer operations:
import Control.Monad (unless, when, void)
import Foreign.Safe hiding (void)
import System.IO
import Foreign.C.Types
bufInSize :: Int
bufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1
bufOutSize :: Int
bufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0
sizes0 :: [Int]
sizes0 = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
-- I also tried using the C memset using the FFI, but got the same speed
memcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()
memcpy dst src n = when (n > 0) $ do
    x <- peek src
    poke dst x
    memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)
main = do
    allocaArray bufInSize  $ \bufIn0  -> do
    allocaArray bufOutSize $ \bufOut0 -> do
    with (44 :: Word8)  $ \cm -> do
        let loop bufIn bufOut sizes suffixIn suffixOut = do
                let (bytesIn, bytesOut, sizes', copy) = case sizes of
                        []     -> (1, 1    , sizes0,    memcpy bufOut bufIn 1)
                        [s]    -> (s, s    , []    ,    memcpy bufOut bufIn s)
                        s:izes -> (s, s + 1, izes  , do
                            memcpy  bufOut              bufIn s
                            memcpy (bufOut `plusPtr` s) cm    1 )
                if suffixIn < bytesIn
                then do
                    eof <- hIsEOF stdin
                    if eof
                        then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
                        else do
                            suffixIn' <- hGetBuf stdin bufIn0 bufInSize
                            loop bufIn0 bufOut sizes suffixIn' suffixOut
                else if suffixOut < bytesOut
                then do
                    hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
                    loop bufIn bufOut0 sizes suffixIn bufOutSize
                else do
                    copy
                    loop (bufIn  `plusPtr` bytesIn )
                         (bufOut `plusPtr` bytesOut)
                         sizes'
                         (suffixIn  - bytesIn )
                         (suffixOut - bytesOut)
        loop bufIn0 bufOut0 sizes0 0 bufOutSize
Here are some rough time-based measurements using an input file of 1000000 lines:
$ # The C Version
$ time ./a.out < in.dat > out.dat
real    0m0.189s
user    0m0.116s
sys 0m0.068s
$ # The Haskell version
$ time ./csv < in.dat > out2.dat
real    0m0.536s
user    0m0.428s
sys 0m0.104s
$ diff out.dat out2.dat
$ # No difference
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With