After 85 epochs the loss (a cosine distance) of my model (a RNN with 3 LSTM layers) become NaN. Why does it happen and how can I fix it? Outputs of my model also become NaN.
My model :
tf.reset_default_graph()
seqlen = tf.placeholder(tf.int32, [None])
x_id = tf.placeholder(tf.int32, [None, None])
y_id = tf.placeholder(tf.int32, [None, None])
embeddings_matrix = tf.placeholder(np.float32, [vocabulary_size, embedding_size])
x_emb = tf.nn.embedding_lookup(embeddings_matrix, x_id)
y_emb = tf.nn.embedding_lookup(embeddings_matrix, y_id)
cells = [tf.contrib.rnn.LSTMCell(s, activation=a) for s, a in [(400, tf.nn.relu), (400, tf.nn.relu), (400, tf.nn.tanh)]]
cell = tf.contrib.rnn.MultiRNNCell(cells)
outputs, _ = tf.nn.dynamic_rnn(cell, x_emb, dtype=tf.float32, sequence_length=seqlen)
loss = tf.losses.cosine_distance(tf.nn.l2_normalize(outputs, 2), tf.nn.l2_normalize(y_emb, 2), 1)
tf.summary.scalar('loss', loss)
opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
merged = tf.summary.merge_all()
The output of the training :
Epoch 80/100
    Time : 499 s    Loss : 0.972911523852701    Val Loss : 0.9729658
Epoch 81/100
    Time : 499 s    Loss : 0.9723407568655597   Val Loss : 0.9718646
Epoch 82/100
    Time : 499 s    Loss : 0.9718870568505438   Val Loss : 0.971976
Epoch 83/100
    Time : 499 s    Loss : 0.9913996352643445   Val Loss : 0.990693
Epoch 84/100
    Time : 499 s    Loss : 0.9901496524596137   Val Loss : 0.98957264
Epoch 85/100
    Time : 499 s    Loss : nan  Val Loss : nan
Epoch 86/100
    Time : 498 s    Loss : nan  Val Loss : nan
Epoch 87/100
    Time : 498 s    Loss : nan  Val Loss : nan
Epoch 88/100
    Time : 499 s    Loss : nan  Val Loss : nan
Epoch 89/100
    Time : 498 s    Loss : nan  Val Loss : nan
Epoch 90/100
    Time : 498 s    Loss : nan  Val Loss : nan
And here sis the curve of the loos during the entire training :

The blue curve is the loss on training data and the orange one in the loss on validation data.
The learning rate used for ADAM is 0.001.
My x and y got the following shape : [batch size, maximum sequence length], they're both set to None, because the last batch of each epoch is smaller, and the maximal sequence length change at each batch.
x and y go through an embedding lookup and become of shape [batch size, maximum sequence length, embedding size], the embedding for the padding word is a vector of 0.
The dynamic rnn take the length of each sequence (seqlen in the code, with a shape of [batch size]) so it will only make predictions for the exact length of each sequence and the rest of the output will be padded with vectors of zero, as for y.
My guess is the values of the output become so close of zero, that once they're squared to compute the cosine distance they become 0 so it leads to a division by zero.
Cosine distance formula :
I don't know if I'm right, neither how to prevent this.
EDIT:
I just checked weights of every layers and they're all NaN
SOLVED:
Using a l2 regularization worked.
tf.reset_default_graph()
seqlen = tf.placeholder(tf.int32, [None])
x_id = tf.placeholder(tf.int32, [None, None])
y_id = tf.placeholder(tf.int32, [None, None])
embeddings_matrix = tf.placeholder(np.float32, [vocabulary_size, embedding_size])
x_emb = tf.nn.embedding_lookup(embeddings_matrix, x_id)
y_emb = tf.nn.embedding_lookup(embeddings_matrix, y_id)
cells = [tf.contrib.rnn.LSTMCell(s, activation=a) for s, a in [(400, tf.nn.relu), (400, tf.nn.relu), (400, tf.nn.tanh)]]
cell = tf.contrib.rnn.MultiRNNCell(cells)
outputs, _ = tf.nn.dynamic_rnn(cell, x_emb, dtype=tf.float32, sequence_length=seqlen)
regularizer = tf.reduce_sum([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
cos_distance = tf.losses.cosine_distance(tf.nn.l2_normalize(outputs, 2), tf.nn.l2_normalize(y_emb, 2), 1)
loss = cos_distance + beta * regularizer
opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
tf.summary.scalar('loss', loss)
tf.summary.scalar('regularizer', regularizer)
tf.summary.scalar('cos_distance', cos_distance)
merged = tf.summary.merge_all()
The weights of every layer becoming NaN may be a signal that your model is experiencing an exploding gradient problem. 
I think as the number of epochs increase, the weight values in your layers may be becoming too big. I suggest you implement some sort of Gradient Clipping or Weight Regularization (check the link attached).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With