I have a simple sequential model using TimeDistributed(Dense...) as the final layer after an LSTM layer. I am training on time series data in sequences of 20 time steps. The loss function is Mean Absolute Error, defined as:
def mean_absolute_error(y_true, y_pred):
return K.mean(K.abs(y_pred - y_true), axis=-1)
(from https://github.com/fchollet/keras/blob/master/keras/losses.py)
A snippet of the model is:
LSTM(
framelen
, return_sequences=True
)
TimeDistributed(
Dense(
framelen
, activation="relu"
)
)
The data being fed is of size (batches, timesteps, framelen), where timesteps is 20 as stated, batches covers the whole dataset, and framelen is 13 parameters scaled to 0 - 1.0. The final result should be a set of framelen parameters predicting the next steps in the sequence.
I am trying to confirm whether the standard loss functions do actually calculate loss across all the time steps in the output. Looking at the code it looks like the loss may just be calculated on a single time step, but that could be just my poor understanding of the code.
I have attempted to run the same training with both this model and the equivalent where the final layer is a plain Dense (and obviously structuring the expected output as a single step each time). The plain Dense model appears to train far better than the TimeDistributed equivalent. The former manages to converge at a lower minimum and the qualitative output is much better.
Does anybody have good insight into the way the loss functions work when time series data is used with TimeDistributed as the output? Does it achieve a calculation of loss for each time step in the output? And if so, how does it use a loss that is a scalar value?
The way they work depends entirely on how they're defined.
Most usually, all elements in the tensor do participate in the loss function. What may change is the order in which they're processed, if they're summed before or after some calculation. They're grouped by axes (dimensions equal to your target data). So, the order of the calculations (which axis is considered first), and the order in which they're summed and in which the mean results are taken.
In Keras, the most usual is to see that it calculates sublosses in the last axis of the tensors, and then it makes the mean or sum.
When you're working with a time series output in the form (samples, steps, featuresOrClasses), Keras' standard functions will often work grouping by featuresOrClasses, and then the sum and mean values are calculated.
This is logical for classification problems, for instance. If you have 3 output classes and you want a categorical_crossentropy, this result must be calculated individually in each time step, considering only 3 classes. So it's fine to calculate results in the last axis (which is the only axis that considers 3 classes) and then summing for the steps and samples.
But it's not enough to tell you why your losses were different. This depends on what your targets are. Are they classes? Are they foretelling a series? Etc.
The main difference is that there will be way more elements participating in the loss. It's probably harder indeed to fit all of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With