I train a RNN network, the first epoch used 7.5 hours. But with the training process runs, tensorflow runs slower and slower, the second epoch used 55 hours. I checked the code, most APIs that become slower with time are these :
session.run([var1, var1, ...], feed_dict=feed),tensor.eval(feed_dict=feed). For example, one line code is session.run[var1, var2, ...], feed_dict=feed), as the program begins, It uses 0.1 seconds, but with the process runs, the time used for this line of code becomes bigger and bigger, After 10 hours, time this line spends comes to 10 seconds.
I have been befall this several times. Which triggered this? How could I do to avoid this?
If this line of code: self.shapes = [numpy.zeros(g[1].get_shape(), numy.float32) for g in self.compute_gradients]  adds nodes to the graph of tensorflow? I suspect this maybe the reason. This line of code will be called many times periodically,and self is not an object of tf.train.optimizer.
Try finalizing your graph after you create it (graph.finalize()). This will prevent operations to be added to the graph. I also think self.compute_gradients is adding operations to the graph. Try defining the operation outside your loop and running it inside your loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With