Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory footprint of splitOn?

Tags:

haskell

I wrote a file indexing program that should read thousands of text file lines as records and finally group those records by fingerprint. It uses Data.List.Split.splitOn to split the lines at tabs and retrieve the record fields. The program consumes 10-20 GB of memory.

Probably there is not much I can do to reduce that huge memory footprint, but I cannot explain why a function like splitOn (breakDelim) can consume that much memory:

    Mon Dec  9 21:07 2019 Time and Allocation Profiling Report  (Final)

       group +RTS -p -RTS file1 file2 -o 2 -h

    total time  =        7.40 secs   (7399 ticks @ 1000 us, 1 processor)
    total alloc = 14,324,828,696 bytes  (excludes profiling overheads)

COST CENTRE                          MODULE                    SRC                                                %time %alloc

fileToPairs.linesIncludingEmptyLines ImageFileRecordParser     ImageFileRecordParser.hs:35:7-47                    25.0   33.8
breakDelim                           Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36)   24.9   39.3
sortAndGroup                         Aggregations              Aggregations.hs:6:1-85                              12.9    1.7
fileToPairs                          ImageFileRecordParser     ImageFileRecordParser.hs:(33,1)-(42,14)              8.2   10.7
matchDelim                           Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23)      7.4    0.4
onSublist                            Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72            3.6    0.0
toHashesView                         ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24)          3.0    6.3
main                                 Main                      group.hs:(47,1)-(89,54)                              2.9    0.4
numberOfUnique                       ImageFileRecord           ImageFileRecord.hs:37:1-40                           1.6    0.1
toHashesView.sortedLines             ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30                 1.4    0.1
imageFileRecordFromFields            ImageFileRecordParser     ImageFileRecordParser.hs:(11,1)-(30,5)               1.1    0.3
toHashView                           ImageFileRecord           ImageFileRecord.hs:(67,1)-(69,23)                    0.7    1.7

Or is type [Char] too memory inefficient (compared to Text), causing splitOn to take that much memory?

UPDATE 1 (+RTS -s suggestion of user HTNW)

  23,446,268,504 bytes allocated in the heap
  10,753,363,408 bytes copied during GC
   1,456,588,656 bytes maximum residency (22 sample(s))
      29,282,936 bytes maximum slop
            3620 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     45646 colls,     0 par    4.055s   4.059s     0.0001s    0.0013s
  Gen  1        22 colls,     0 par    4.034s   4.035s     0.1834s    1.1491s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.477s  (  7.475s elapsed)
  GC      time    8.089s  (  8.094s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.114s  (  0.114s elapsed)
  Total   time   15.687s  ( 15.683s elapsed)

  %GC     time      51.6%  (51.6% elapsed)

  Alloc rate    3,135,625,407 bytes per MUT second

  Productivity  48.4% of total user, 48.4% of total elapsed

The processed text files are smaller than usual (UTF-8 encoded, 37 MB). But still 3 GB of memory are used.

UPDATE 2 (critical part of the code)

Explanation: fileToPairs processes a text file. It returns a list of key-value pairs (key: fingerprint of record, value: record).

sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]

main = do
  CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
    ignored_paths_file = def &= typFile,
    files = def &= typ "FILES" &= args,
    number_of_occurrences = def &= name "o",
    minimum_number_of_occurrences = def &= name "l",
    maximum_number_of_occurrences = def &= name "u",
    number_of_hashes = def &= name "n",
    having_record_errors = def &= name "e",
    hashes = def
  }
    &= summary "Group image/video files"
    &= program "group"

  let ignoredPathsFilenameMaybe = ignored_paths_file
  let filenames = files
  let hashesMaybe = hashes

  ignoredPaths <- case ignoredPathsFilenameMaybe of
    Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
    _ -> return []

  recordPairs <- mapM (fileToPairs ignoredPaths) filenames

  let allRecordPairs = concat recordPairs
  let groupMap = sortAndGroup allRecordPairs
  let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)

  let filterArguments = FilterArguments {
    numberOfOccurrencesMaybe = number_of_occurrences,
    minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
    maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
    numberOfHashesMaybe = number_of_hashes,
    havingRecordErrorsMaybe = having_record_errors
  }
  let filteredPairs = filterImageRecords filterArguments statisticsPairs

  let filteredMap = Map.fromList filteredPairs
  case hashesMaybe of
        Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
        _ -> Char8.putStrLn (encodePretty filteredMap)
like image 915
ideaboxer Avatar asked Oct 20 '25 03:10

ideaboxer


1 Answers

As I'm sure you're aware, there's not really enough information here for us to help you make your program more efficient. It might be worth posting some (complete, self-contained) code on the Code Review site for that.

However, I think I can answer your specific question about why splitOn allocates so much memory. In fact, there's nothing particularly special about splitOn or how it's been implemented. Many straightforward Haskell functions will allocate lots of memory, and this in itself doesn't indicate that they've been poorly written or are running inefficiently. In particular, splitOn's memory usage seems similar to other straightforward approaches to splitting a string based on delimiters.

The first thing to understand is that GHC compiled code works differently than other compiled code you're likely to have seen. If you know a lot of C and understand stack frames and heap allocation, or if you've studied some JVM implementations, you might reasonably expect that some of that understanding would translate to GHC executables, but you'd be mostly wrong.

A GHC program is more or less an engine for allocating heap objects, and -- with a few exceptions -- that's all it really does. Nearly every argument passed to a function or constructor (as well as the constructor application itself) allocates a heap object of at least 16 bytes, and often more. Take a simple function like:

fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n-1)

With optimization turned off, it compiles to the following so-called "STG" form (simplified from the actual -O0 -ddump-stg output):

fact = \n -> case n of I# n' -> case n' of
  0# -> I# 1#
  _ -> let sat1 = let sat2 = let one = I#! 1# in n-one
                  in fact sat2;
       in n*sat1

Everywhere you see a let, that's a heap allocation (16+ bytes), and there are presumably more hidden in the (-) and (*) calls. Compiling and running this program with:

main = print $ fact 1000000

gives:

 113,343,544 bytes allocated in the heap
  44,309,000 bytes copied during GC
  25,059,648 bytes maximum residency (5 sample(s))
      29,152 bytes maximum slop
          23 MB total memory in use (0 MB lost due to fragmentation)

meaning that each iteration allocates over a hundred bytes on the heap, though it's literally just performing a comparison, a subtraction, a multiplication, and a recursive call,

This is what @HTNW meant in saying that total allocation in a GHC program is a measure of "work". A GHC program that isn't allocating probably isn't doing anything (again, with some rare exceptions), and a typical GHC program that is doing something will usually allocate at a relatively constant rate of several gigabytes per second when it's not garbage collecting. So, total allocation has more to do with total runtime than anything else, and it isn't a particularly good metric for assessing code efficiency. Maximum residency is also a poor measure of overall efficiency, though it can be helpful for assessing whether or not you have a space leak, if you find that it tends to grow linearly (or worse) with the size of the input where you expect the program should run in constant memory regardless of input size.

For most programs, the most important true efficiency metric in the +RTS -s output is probably the "productivity" rate at the bottom -- it's the amount of time the program spends not garbage collecting. And, admittedly, your program's productivity of 48% is pretty bad, which probably means that it is, technically speaking, allocating too much memory, but it's probably only allocating two or three times the amount it should be, so, at a guess, maybe it should "only" be allocating around 7-8 Gigs instead of 23 Gigs for this workload (and, consequently, running for about 5 seconds instead of 15 seconds).

With that in mind, if you consider the following simple breakDelim implementation:

breakDelim :: String -> [String]
breakDelim str = case break (=='\t') str of
  (a,_:b) -> a : breakDelim b
  (a,[])  -> [a]

and use it like so in a simple tab-to-comma delimited file converter:

main = interact (unlines . map (intercalate "," . breakDelim) . lines)

Then, unoptimized and run on a file with 10000 lines of 1000 3-character fields each, it allocates a whopping 17 Gigs:

  17,227,289,776 bytes allocated in the heap
   2,807,297,584 bytes copied during GC
         127,416 bytes maximum residency (2391 sample(s))
          32,608 bytes maximum slop
               0 MB total memory in use (0 MB lost due to fragmentation)

and profiling it places a lot of blame on breakDelim:

COST CENTRE MODULE    SRC                    %time %alloc

main        Main      Delim.hs:8:1-71         57.7   72.6
breakDelim  Main      Delim.hs:(4,1)-(6,16)   42.3   27.4

In this case, compiling with -O2 doesn't make much difference. The key efficiency metric, productivity, is only 46%. All these results seem to be in line with what you're seeing in your program.

The split package has a lot going for it, but looking through the code, it's pretty clear that little effort has been made to make it particularly efficient or fast, so it's no surprise that splitOn performs no better than my quick-and-dirty custom breakDelim function. And, as I said before, there's nothing special about splitOn that makes it unusually memory hungry -- my simple breakDelim has similar behavior.

With respect to inefficiencies of the String type, it can often be problematic. But, it can also participate in optimizations like list fusion in ways that Text can't. The utility above could be rewritten in simpler form as:

main = interact $ map (\c -> if c == '\t' then ',' else c)

which uses String but runs pretty fast (about a quarter as fast as a naive C getchar/putchar implementation) at 84% productivity, while allocating about 5 Gigs on the heap.

It's quite likely that if you just take your program and "convert it to Text", you'll find it's slower and more memory hungry than the original! While Text has the potential to be much more efficient than String, it's a complicated package, and the way in which Text objects behave with respect to allocation when they're sliced and diced (as when you're chopping a big Text file up into little Text fields) makes it more difficult to get right.

So, some take-home lessons:

  • Total allocation is a poor measure of efficiency. Most well written GHC programs can and should allocate several gigabytes per second of runtime.
  • Many innocuous Haskell functions will allocate lots of memory because of the way GHC compiled code works. This isn't necessarily a sign that there's something wrong with the function.
  • The split package provides a flexible framework for all manner of cool list splitting manipulations, but it was not designed with speed in mind, and it may not be the best method of processing a tab-delimited file.
  • The String data type has a potential for terrible inefficiency, but isn't always inefficient, and Text is a complicated package that won't be a plug-in replacement to fix your String performance woes.

Most importantly:

  • Unless your program is too slow for its intended purpose, its run-time statistics and the theoretical advantages of Text over String are largely irrelevant.
like image 60
K. A. Buhr Avatar answered Oct 23 '25 00:10

K. A. Buhr