I'm using keras defined as submodule in tensorflow v2. I'm training my model using <code>fit_generator()</code> method. I want to save my model every 10 epochs. How can I achieve this? In Keras (not as a submodule of tf), I can give <code>ModelCheckpoint(model_savepath,period=10)</code>. But in tf v2, they've changed this to <code>ModelCheckpoint(model_savepath, save_freq)</code> where <code>save_freq</code> can be <code>'epoch'</code> in which case model is saved every epoch. If <code>save_freq</code> is integer, model is saved after so many samples have been processed. But I want it to be after 10 epochs. How can I achieve this?

Using <code>tf.keras.callbacks.ModelCheckpoint</code> use <code>save_freq='epoch'</code> and pass an extra argument <code>period=10</code>. Although this is not documented in the official docs, that is the way to do it (notice it is documented that you can pass <code>period</code>, just doesn't explain what it does).

Save model every 10 epochs tensorflow.keras v2

Tags:

python

deep-learning

keras

tensorflow2.0

tf.keras

I'm using keras defined as submodule in tensorflow v2. I'm training my model using fit_generator() method. I want to save my model every 10 epochs. How can I achieve this?

In Keras (not as a submodule of tf), I can give ModelCheckpoint(model_savepath,period=10). But in tf v2, they've changed this to ModelCheckpoint(model_savepath, save_freq) where save_freq can be 'epoch' in which case model is saved every epoch. If save_freq is integer, model is saved after so many samples have been processed. But I want it to be after 10 epochs. How can I achieve this?

751

asked Nov 27 '19 11:11

Nagabhushan S N

3 Answers

Using tf.keras.callbacks.ModelCheckpoint use save_freq='epoch' and pass an extra argument period=10.

Although this is not documented in the official docs, that is the way to do it (notice it is documented that you can pass period, just doesn't explain what it does).

175

answered Oct 21 '22 19:10

bluesummers

Explicitly computing the number of batches per epoch worked for me.

BATCH_SIZE = 20
STEPS_PER_EPOCH = train_labels.size / BATCH_SIZE
SAVE_PERIOD = 10

# Create a callback that saves the model's weights every 10 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    save_freq= int(SAVE_PERIOD * STEPS_PER_EPOCH))

# Train the model with the new callback
model.fit(train_images, 
          train_labels,
          batch_size=BATCH_SIZE,
          steps_per_epoch=STEPS_PER_EPOCH,
          epochs=50, 
          callbacks=[cp_callback],
          validation_data=(test_images,test_labels),
          verbose=0)

answered Oct 21 '22 20:10

Antonio Sánchez

The param period mentioned in the accepted answer is now not available anymore.

Using the save_freq param is an alternative, but risky, as mentioned in the docs; e.g., if the dataset size changes, it may become unstable: Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (again taken from the docs).

Thus, I use a subclass as a solution:

class EpochModelCheckpoint(tf.keras.callbacks.ModelCheckpoint):

    def __init__(self,
                 filepath,
                 frequency=1,
                 monitor='val_loss',
                 verbose=0,
                 save_best_only=False,
                 save_weights_only=False,
                 mode='auto',
                 options=None,
                 **kwargs):
        super(EpochModelCheckpoint, self).__init__(filepath, monitor, verbose, save_best_only, save_weights_only,
                                                   mode, "epoch", options)
        self.epochs_since_last_save = 0
        self.frequency = frequency

    def on_epoch_end(self, epoch, logs=None):
        self.epochs_since_last_save += 1
        # pylint: disable=protected-access
        if self.epochs_since_last_save % self.frequency == 0:
            self._save_model(epoch=epoch, batch=None, logs=logs)

    def on_train_batch_end(self, batch, logs=None):
        pass

use it as

callbacks=[
     EpochModelCheckpoint("/your_save_location/epoch{epoch:02d}", frequency=10),
]

Note that, dependent on your TF version, you may have to change the args in the call to the superclass __init__.