Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save Large Scipy Sparse Matrix

I am trying to cPickle a large scipy sparse matrix for later use. I am getting this error:

  File "tfidf_scikit.py", line 44, in <module>
    pickle.dump([trainID, trainX, trainY], fout, protocol=-1)
SystemError: error return without exception set

trainX is the large sparse matrix, the other two are lists 6mil elements long.

In [1]: trainX
Out[1]:
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
    with 286674296 stored elements in Compressed Sparse Row format>

At this point, Python RAM usage is 4.6GB and I have 16GB of RAM on my laptop.

I think I'm running into a known memory bug for cPickle where it doesn't work with objects that are too big. I tried marshal as well but I don't think it works for scipy matrices. Can someone offer a solution and preferably an example on how to load and save this?

Python 2.7.5

Mac OS 10.9

Thank you.

like image 779
mchangun Avatar asked Mar 22 '26 01:03

mchangun


1 Answers

I had this problem with a multi-gigabyte Numpy matrix (Ubuntu 12.04 with Python 2.7.3 - seems to be this issue: https://github.com/numpy/numpy/issues/2396 ).

I've solved it using numpy.savetxt() / numpy.loadtxt(). The matrix is compressed adding a .gz file extension when saving.

Since I too had just a single matrix I did not investigate the use of HDF5.

like image 53
Marco Avatar answered Mar 23 '26 13:03

Marco



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!