Why "GELU" activation function is used instead of ReLu in BERT?

Question

The activation function Gaussian Error Linear Units(GELUs) is used in the popular NLP model BERT. Is there any solid reason ?

Sam H. · Accepted Answer

It is not known why certain activation functions work better than others in different contexts. So the only answer for "why use GELU instead of ReLu" is "because it works better"

Edit: there is some explanation possible, see this blog (archive link to the blog post here). relu can suffer from "problems where significant amount of neuron in the network become zero and don’t practically do anything." gelu is smoother near zero and "is differentiable in all ranges, and allows to have gradients(although small) in negative range" which helps with this problem.

Lerner Zhang · Answer

GELU is a smoother version of the RELU.

ReLU vs GELU:

enter image description here

I think the reason is stated in the paper:

enter image description here

Why "GELU" activation function is used instead of ReLu in BERT?

Tags:

deep-learning

nlp

Shamane Siriwardhana

2 Answers

Sam H.

Lerner Zhang

Recent Activity

Donate For Us

Why "GELU" activation function is used instead of ReLu in BERT?

Tags:

deep-learning

nlp

Shamane Siriwardhana

2 Answers

Sam H.

Lerner Zhang

Related questions

Recent Activity

Donate For Us