While playing with model.fit_on_batch method and custom training loops I realized that in the custom training loop code the loss and gradient do not take into account any l1-l2 regularizers and hence optimizer.apply_gradients method does not take the regularizers into account. Below you can find the code to show this but the idea is pretty simple. So my questions is if there is a method to use all these optimizers in optimizer detail agnostic way to take the regularizers into account. How is it implemented in Keras? On a related note, model.fit_on_batch returns a value that it not the loss (as claimed in the docstring) but something else. I was wondering if someone here knows what it returns.
Code
To see this effect first create some data
x=tf.constant([[1]])
y=tf.constant([[1]])
and create a function to make a reproducible model
def make_model(l1=.01,l2=.01):
tf.random.set_seed(42)
np.random.seed(42)
model=tf.keras.models.Sequential([
tf.keras.layers.Dense(2,'softmax',
use_bias=False,
kernel_regularizer=tf.keras.regularizers.l1_l2(l1=l1,l2=l2),
input_shape=(1,))
])
return model
Now run Keras train_on_batch
model=make_model()
loss_object=tf.keras.losses.SparseCategoricalCrossentropy()
optimizer=tf.keras.optimizers.RMSprop()
model.compile(loss=loss_object,optimizer=optimizer)
model.train_on_batch(x,y)
and compare the outputs with the custom training loop as explained in the above link as well as here
model=make_model()
loss_object=tf.keras.losses.SparseCategoricalCrossentropy()
optimizer=tf.keras.optimizers.RMSprop()
@tf.function
def train_step(x,y):
with tf.GradientTape() as tape:
predictions = model(x)
loss = loss_object(y, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
train_step(x,y).numpy()
You will see the two results are different unless l1==0 and l2==0.
Actually I found out the answer in Aurelien Geron's book 
In fact after I implemented the code below, I found that this is covered in the tensorflow guide on custom training (I don't know why its not in the tutorials mentioned in the question since its an important point). The solution in there is more general than the one mentioned here but I am keeping this as it sheds a bit more light on whats happening.
So it is as simple as modifying the custom training loop to
def add_model_regularizer_loss(model):
loss=0
for l in model.layers:
if hasattr(l,'layers') and l.layers: # the layer itself is a model
loss+=add_model_loss(l)
if hasattr(l,'kernel_regularizer') and l.kernel_regularizer:
loss+=l.kernel_regularizer(l.kernel)
if hasattr(l,'bias_regularizer') and l.bias_regularizer:
loss+=l.bias_regularizer(l.bias)
return loss
def train_step(x,y):
with tf.GradientTape() as tape:
predictions = model(x)
loss = loss_object(y, predictions)
loss += add_model_regularizer_loss(model)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
To answer the second part of my question, it is this loss value that keras's model fit method returns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With