The following program uses 100+ MB RAM when counting different line lengths in a 250 MB file. How do I fix it to use less RAM? I suppose I misused lazy IO, foldr and laziness of Data.Map in values.
import Control.Applicative
import qualified Data.Map as M
import Data.List
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
The first big mistake in
main = do
content <- readFile "output.csv"
print $ (foldr count M.empty . map length . lines) content
count a b = M.insertWith (+) a 1 b
is using foldr. That constructs an expression of the form
length firstLine `count` length secondLine `count` ... `count` length lastLine `count` M.empty
traversing the entire list of lines constructing the thunk - at that time not even evaluating the length calls due to laziness - before it is then evaluated right to left. So the entire file contents is in memory in addition to the thunk for building the Map.
If you build up a map from a list of things, always use a strict left fold (well, if the list is short, and the things not huge, it doesn't matter) unless the semantics require a right fold (if you're combining values using a non-commutative function, that might be the case, but even then it is often preferable to use a left fold and reverse the list before building the map).
Data.Maps (or Data.IntMaps) are spine-strict, that alone makes it impossible to generate partial output before the entire list has been traversed, so the strengths of foldr cannot be used here.
The next (possible) problem is (again laziness) that you don't evaluate the mapped-to values when putting them in the Map, so if there is a line length that occurs particularly often, that value becomes a huge thunk
((...((1+1)+1)...+1)+1)
Make it
main = do
content <- readFile "output.csv"
print $ (foldl' count M.empty . map length . lines) content
count mp a = M.insertWith' (+) a 1 mp
so that the lines can be garbage collected as soon as they have been read in, and no thunks can build up in the values. That way you never need more than one line of the file in memory at once, and even that need not be in memory entirely, since the length is evaluated before it is recorded in the Map.
If your containers package is recent enough, you could also
import Data.Map.Strict
and leave count using insertWith (without prime, the Data.Map.Strict module always evaluates values put into the map).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With