I want to know if there is an elegant way to get the index of an Entity with respect to a Sentence. I know I can get the index of an Entity in a string using ent.start_char and ent.end_char, but that value is with respect to the entire string.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion. Apple just launched a new Credit Card.")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
I want the Entity Apple in both the sentences to point to start and end indexes 0 and 5 respectively. How can I do that?
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
Essentially, spacy. load() is a convenience wrapper that reads the pipeline's config. cfg , uses the language and pipeline information to construct a Language object, loads in the model data and weights, and returns it.
orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.
To obtain fine-grained POS tags, we could use the tag_ attribute.
You need to subtract the sentence start position from the entity start positions:
for ent in doc.ents:
    print(ent.text, ent.start_char-ent.sent.start_char, ent.end_char-ent.sent.start_char, ent.label_)
#                                 ^^^^^^^^^^^^^^^^^^^^              ^^^^^^^^^^^^^^^^^^^^
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
Apple 0 5 ORG
Credit Card 26 37 ORG
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With