I have been playing with memory_profiler for some time and got this interesting but confusing results from the small program below:
import pandas as pd
import numpy as np
@profile
def f(p):
tmp = []
for _, frame in p.iteritems():
tmp.append([list(record) for record in frame.to_records(index=False)])
# initialize a list of pandas panels
lp = []
for j in xrange(50):
d = {}
for i in xrange(50):
df = pd.DataFrame(np.random.randn(200, 50))
d[i] = df
lp.append(pd.Panel(d))
# execution (iteration)
for panel in lp:
f(panel)
Then if I use memory_profiler's mprof to analyze the memory usage during runtime, mprof run test.py without any other parameters, I get this:
.
There seems to be memory unreleased after each function call f().
tmp is just a local list and should be reassigned and memory reallocated each time f() is called. Obviously there is some discrepancy here in the graph attached. I know that python has its own memory management blocks and also has free list for int and other types, and gc.collect() should do the magic. It turns out that explicit gc.collect() doesn't work. (Maybe because we are working with pandas objects, panels and frames? I don't know.)
The most confusing part is, I don't change or modify any variable in f(). All it does is just put some list representation copies in a local list. Therefore python doesn't need to make a copy of anything. Then why and how does this happen?
=================
Some other observations:
1) If I call f() with f(panel.copy()) (last line of code), passing the copy instead of the original object reference, I have a totally different memory usage result:
. Is python that smart to tell that this value passed is a copy so that it could do some internal tricks to release the memory after each function call?
2) I think it might be because of df.to_records(). Well if I change it to frame.values, I would get similar flat memory curve, just like memory_profiling_results_2.png shown above, during the iteration (Although I do need to_records() because it maintains the column dtype, while .values messes the dtypes up). But I looked into frame.py's implementation on to_records(). I don't see why it would hold on the memory out there, while .values would work just fine.
I am running the program on Windows, with python 2.7.8, memory_profiler 0.43 and psutil 5.0.1.
This is not a memory leak. What you are seeing is a side effect of pandas.core.NDFrame caching some results. This allows it to return the same information the second time you ask for it without running the calculations again. Change the end of your sample code to look like the following code and run it. You should find that the second time through the memory increase will not happen, and the execution time will be less.
import time
# execution (iteration)
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
print('-------------------------------------')
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With