I need to import a very large dictionary into python and I'm running into some unexpected memory bottlenecks. The dictionary has the form,
d = {(1,2,3):(1,2,3,4), (2,5,6)=(4,2,3,4,5,6), ... }
So each key is a 3-tuple and each value is a relatively small tuple of arbitrary size (probably never more than 30 elements). What makes the dictionary large is the number of keys. A smaller example of what I'm working with has roughly 247257 keys. I generate this dictionary through a simulation so I can write out a text file that defines this dictionary and for the example I just mentioned this is a 94MB file. The bottleneck I am running into is that the initial compile to python byte code eats up about 14GB of ram. So the first time I import the dictionary I see the RAM usage spike up and after a good 10 seconds everything is loaded. If the .pyc file is already generated the import is nearly instant. Using pympler, I've determined that this dictionary is only about 200 MB in memory. What is the deal here? Do I have any other options on how get this dictionary loaded into python or at least compiled to byte code. I'm running the generating simulations in C++ and I can't write files an whatever format I need. Are there any options there (python libraries, etc.)? I'm interfacing with some software that needs this data as a dictionary so please no other suggestions in that realm. Also just in case you are wondering, I have defined the dictionary in the text file like the definition above as well as like so,
d = {}
d[1,2,3] = (1,2,3,4)
d[2,5,6] = (4,2,3,4,5,6)
...
Both give the same memory spike in compile to byte code. In fact, the second one seems to be slightly worse, which is surprising to me. There's got to be some way to tame the amount of ram the initial compile needs. It seems like it should somehow be able to do the compile one key-value pair at a time. Any ideas?
Other info: using python 2.6.5
There's a lot of redundant information and processing done here. This not only results in bigger compile times and memory consumption, but also in code bloat in the generated executable.
Those numbers can easily fit in a 64-bit integer, so one would hope Python would store those million integers in no more than ~8MB: a million 8-byte objects. In fact, Python uses more like 35MB of RAM to store these numbers. Why? Because Python integers are objects, and objects have a lot of memory overhead.
Python doesn't limit memory usage on your program. It will allocate as much memory as your program needs until your computer is out of memory. The most you can do is reduce the limit to a fixed upper cap. That can be done with the resource module, but it isn't what you're looking for.
I guess the problem is that while parsing your file an enormous syntax tree is made with a small overhead for each element that all add up. Once the bytecode is generated the syntax tree is no longer necessary and dumped, resulting in your 200MB data.
Have you tried storing the data in a seperate file in the following format and then dynamically load it in python?
1,2,3=1,2,3
2,5,6=4,2,3,4,5,6
The Python script should look something like this:
file = open("filename")
d = {}
for line in file:
    key, val = line.split("=")
    key = tuple(key.split(","))
    d[key] = tuple(val.split(","))
file.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With