It's taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).
Note that the file quickly loads into memory. In other words, if I run:
import cPickle as pickle
f = open("bigNetworkXGraph.pickle","rb")
binary_data = f.read() # This part doesn't take long
graph = pickle.loads(binary_data) # This takes ages
How can I speed this last operation up?
Note that I have tried pickling the data both in using both binary protocols (1 and 2), and it doesn't seem to make much difference which protocol I use. Also note that although I am using the "loads" (meaning "load string") function above, it is loading binary data, not ascii-data.
I have 128gb of RAM on the system I'm using, so I'm hoping that somebody will tell me how to increase some read buffer buried in the pickle implementation.
I had great success in reading a ~750 MB igraph data structure (a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here
Example snippet in your case would be something like:
import cPickle as pickle
import gc
f = open("bigNetworkXGraph.pickle", "rb")
# disable garbage collector
gc.disable()
graph = pickle.load(f)
# enable garbage collector again
gc.enable()
f.close()
This definitely isn't the most apt way to do it, however, it reduces the time required drastically.
(For me, it reduced from 843.04s to 41.28s, around 20x)
You're probably bound by Python object creation/allocation overhead, not the unpickling itself. If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With