Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

size explosion file vs. string

I got a 261MB text file (xdebug output) and when I read it in it occupies an additional 2GB of space dynamic space.

(defun stream->string (tmp-stream)
  (do ((line (read-line tmp-stream nil nil)
             (read-line tmp-stream nil nil))
       (lines nil))
      ((not line) (progn 
                    (FORMAT T "COLLECTED~%")
                    (FORMAT nil "~{~a~^~%~}" (reverse lines))))
    (push line lines)))


(defparameter *test* nil)

  (progn
    (setf *test* nil)
    (sb-ext:gc :full t)
    (room)
    (FORMAT T "----~%")
    (with-open-file (stream "/home/.../debugFiles/xdebug_1.xt")
      (room)
      (FORMAT T "----~%")
      (setf *test* (stream->string stream))
      (sb-ext:gc :full t)
      (room)
      (FORMAT T "----~%"))
    (sb-ext:gc :full t)
    (room))  

Output

Dynamic space usage is:   84,598,224 bytes.
Read-only space usage is:      5,856 bytes.
Static space usage is:         4,160 bytes.
Control stack usage is:        8,408 bytes.
Binding stack usage is:        1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
  20,841,808 bytes for    20,691 code objects.
  15,989,600 bytes for   999,350 cons objects.
  14,532,960 bytes for   118,880 simple-vector objects.
  13,951,792 bytes for   168,301 instance objects.
   5,994,864 bytes for    41,648 simple-character-string objects.
  13,287,200 bytes for   215,901 other objects.
  84,598,224 bytes for 1,564,771 dynamic objects (space total.)
----
Dynamic space usage is:   85,346,752 bytes.
Read-only space usage is:      5,856 bytes.
Static space usage is:         4,160 bytes.
Control stack usage is:        8,536 bytes.
Binding stack usage is:        1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
  20,842,928 bytes for    20,692 code objects.
  16,125,008 bytes for 1,007,813 cons objects.
  14,698,784 bytes for   120,834 simple-vector objects.
  14,239,440 bytes for   171,411 instance objects.
   6,014,144 bytes for    41,776 simple-character-string objects.
  13,426,448 bytes for   219,723 other objects.
  85,346,752 bytes for 1,582,249 dynamic objects (space total.)
----
COLLECTED
Dynamic space usage is:   2,557,851,296 bytes.
Read-only space usage is:      5,856 bytes.
Static space usage is:         4,160 bytes.
Control stack usage is:        8,536 bytes.
Binding stack usage is:        1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
  2,466,544,480 bytes for   817,255 simple-character-string objects.
  91,306,816 bytes for 2,303,370 other objects.
  2,557,851,296 bytes for 3,120,625 dynamic objects (space total.)
----
Dynamic space usage is:   1,131,069,056 bytes.
Read-only space usage is:      5,856 bytes.
Static space usage is:         4,160 bytes.
Control stack usage is:        8,360 bytes.
Binding stack usage is:        1,072 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
  1,053,183,424 bytes for    41,547 simple-character-string objects.
  77,885,632 bytes for 1,510,521 other objects.
  1,131,069,056 bytes for 1,552,068 dynamic objects (space total.)

I could understand a tripling of the size (even though this would still surprise me):

  1. the collection of lines
  2. the string object created by format
  3. the string saved in *test*

However, a factor 10 increase is way to big.

How can that be?

like image 690
Sim Avatar asked Dec 06 '25 08:12

Sim


1 Answers

as Rainer points out, your problem is that sbcl represents string as a vector of utf32 code points, which means that each character is 32 bits.

ideally, the right way to handle files is to process them streaming line-by-line, rather than slurping them all into memory, but if that isn't an option for you, and if you're confident that every character in your file is a base-char i.e. an ascii character, you can pass :element-type 'base-char to with-open-file, and coerce the result of read-line to simple-base-string. this might look like:

(defun file->lines (path)
  (with-open-file (stream path :element-type 'base-char)
    (do ((line (read-line stream nil nil)
               (read-line stream nil nil))
         (lines nil))
        ((not line) (nreverse lines))
      (push (coerce line 'simple-base-string) lines))))

also, note that if your file has many lines, the overhead of storing the lines in a linked list may be significant. if you can predict the number of lines in your file, you may have better performance pre-allocating a large vector, and storing the lines in it, like:

(defun file->lines (path number-of-lines)
  (with-open-file (stream path :element-type 'base-char)
    (do ((line (read-line stream nil nil)
               (read-line stream nil nil))
         (lines (make-array number-of-lines :fill-pointer 0)))
        ((not line) lines)
      (vector-push (coerce line 'simple-base-string) lines))))

but make sure your number-of-lines is an overestimate, or else you may have to do slow reallocate and copy. (that's why i wrote vector-push instead of vector-push-extend.

if you can't predict a number of lines, you may be best off reading into a list, then coercing to a vector at the end, like:

(defun file->lines (path)
  (with-open-file (stream path :element-type 'base-char)
    (do ((line (read-line stream nil nil)
               (read-line stream nil nil))
         (lines nil))
        ((not line) (coerce (nreverse lines) 'vector))
      (push (coerce line 'simple-base-string) lines))))
like image 141
Phoebe Goldman Avatar answered Dec 08 '25 20:12

Phoebe Goldman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!