In NLP tasks, it's very common that people annotate a sentence with SOC (start of a sentence) and EOC(end of a sentence). Why do they do that?
Is it a task dependent performance? For instance, the reason you do padding in NER problems is different from the reason you do padding for translation problems? As in the NER problem you do padding as to extract more useful features from the context, however in a translation problem, you do padding to identify the end of a sentence because the decoder is trained sentence-by-sentence.
Let's say we want to use a RNN (recurrent neural net) to complete a sentence for us. Let's give it the sentence "If at first you don't succeed,". We'd like it to output "try try again" and then know to stop. It's the stop that's important. If we just use a period then we cannot use the same RNN to output a multi sentence response.
If we are using the RNN instead to respond to a question, then perhaps the answer has multiple sentences.
Let's say we train an RNN on poetry and we want it to make original poetry in the style of how we trained it. We will have to give it the first token to start the poetry with. We could give it the first word, ... or we could just say start. If we train the RNN to always start from a unique token (like a start of output token) then the RNN can chose the first word to use.
The start and end of something of a thing is so intuitive to us that I think it's easy to forget that at one point we had to learn when enough is enough (end token) and when or how to start (start token), but both of these things the RNN has to learn.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With