Positional encoding is added to the input before it is passed into the transformer model, because otherwise the attention mechanism would be order invariant. However, both the encoder and decoder are layered, with attention being used on each layer. So if order is important for the attention mechanism, shouldn't the positional encoding be added to the input of each multiheaded attention block, instead of just once at the input to the model?
The transformer uses residual connections, and hence the positional encodings carry over through multiple layers in the encoder and decoder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With