I tried spacy for ner but the results are highly unpredictable.Sometimes spacy is not recognizing a particular country.Can anyone please explain why is it happening? I tried on some random sentences.
CASE 1:
nlp = spacy.load("en_core_web_sm")
print(nlp)
sent = "hello china hello japan"
doc = nlp(sent)
for i in doc.ents:
print(i.text," ",i.label_)
OUTPUT:no output in this case.
CASE 2:
nlp = spacy.load("en_core_web_sm")
print(nlp)
sent = "china is a populous nation in East Asia whose vast landscape encompasses grassland, desert, mountains, lakes, rivers and more than 14,000km of coastline."
doc = nlp(sent)
for i in doc.ents:
print(i.text," ",i.label_)
OUTPUT:
<spacy.lang.en.English object at 0x7f2213bde080>
china GPE
East Asia LOC
more than 14,000km QUANTITY
Natural Language models, like spaCy NER, learn from the contextual structure of the sentence (surrounding words). Why is that? Let's take the word Anwarvic as an example which is a new word that you haven't seen before and probably the spaCy model hasn't seen it before either. Let's see how the NER model is going to act when the provided sentence change:
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "I love Anwarvic"
>>> doc = nlp(sent)
>>> for i in doc.ents:
... print(i.text," ",i.label_)
Anwarvic PERSON
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "Anwarvic is gigantic"
>>> doc = nlp(sent)
>>> for i in doc.ents:
... print(i.text," ",i.label_)
Anwarvic ORG
>>> nlp = spacy.load("en_core_web_sm")
>>> sent = "Anwarvic is awesome"
>>> doc = nlp(sent)
>>> for i in doc.ents:
... print(i.text," ",i.label_)
As we can see, the extracted entites vary when the contextual structure of Anwarvic varies. So, in the first sentece the verb love is very common with people. That's why spaCy model predicted it as a PERSON. And the same happens with the second sentence where we use gigantic to describe organizations like ORG. In the third sentece, awesome is a pretty generic adjective that can be used to describe basically anything. That's why the spaCy NER model was confused.
Actually, when I ran the first provided code on my machine, it extracts both china and japan like so:
china GPE
japan GPE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With