Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why "GELU" activation function is used instead of ReLu in BERT?

The activation function Gaussian Error Linear Units(GELUs) is used in the popular NLP model BERT. Is there any solid reason ?

like image 393
Shamane Siriwardhana Avatar asked Sep 12 '25 01:09

Shamane Siriwardhana


2 Answers

It is not known why certain activation functions work better than others in different contexts. So the only answer for "why use GELU instead of ReLu" is "because it works better"

Edit: there is some explanation possible, see this blog (archive link to the blog post here). relu can suffer from "problems where significant amount of neuron in the network become zero and don’t practically do anything." gelu is smoother near zero and "is differentiable in all ranges, and allows to have gradients(although small) in negative range" which helps with this problem.

like image 177
Sam H. Avatar answered Sep 13 '25 23:09

Sam H.


GELU is a smoother version of the RELU.

ReLU vs GELU:

enter image description here

I think the reason is stated in the paper:

enter image description here

like image 26
Lerner Zhang Avatar answered Sep 13 '25 22:09

Lerner Zhang