I am working with tensorflow and have been training some models and saving them after each epoch using the  tf.saver() method. I am able to save and load models just fine and I am doing this in the usual way.
with tf.Graph().as_default(), tf.Session() as session:
    initialiser = tf.random_normal_initializer(config.mean, config.std)
    with tf.variable_scope("model",reuse=None, initializer=initialiser):
        m = a2p(session, config, training=True)
    saver = tf.train.Saver()   
    ckpt = tf.train.get_checkpoint_state(model_dir)
    if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path)
        saver.restore(session, ckpt.model_checkpoint_path)
    ...
    for i in range(epochs):
       runepoch()
       save_path = saver.save(session, '%s.ckpt'%i)
My code is set up to save a model for each epoch which should be labelled accordingly. However, I have noticed that after fifteen epochs of training I only have check point files for the last five epochs (10, 11, 12, 13,14). The documentation doesn't say anything about this so I am at a loss as to why it is happening.
Does the saver only allow for keeping five checkpoints or have I done something wrong?
Is there a way to make sure that all of the checkpoints are kept?
You can choose how many checkpoints to save when you create your Saver object by setting the max_to_keep argument which defaults to 5.
saver = tf.train.Saver(max_to_keep=10000)
setting max_to_keep=None actually makes the Saver keep all checkpoints.
For eg.,
saver = tf.train.Saver(max_to_keep=None)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With