The activation function Gaussian Error Linear Units(GELUs) is used in the popular NLP model BERT. Is there any solid reason ?
It is not known why certain activation functions work better than others in different contexts. So the only answer for "why use GELU instead of ReLu" is "because it works better"
Edit: there is some explanation possible, see this blog (archive link to the blog post here). relu
can suffer from "problems where significant amount of neuron in the network become zero and don’t practically do anything." gelu
is smoother near zero and "is differentiable in all ranges, and allows to have gradients(although small) in negative range" which helps with this problem.
GELU is a smoother version of the RELU.
ReLU vs GELU:
I think the reason is stated in the paper:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With