Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In spacy: Add a span (doc[a:b]) as entity in a spacy doc (python)

I am using regex over a whole document to catch the spans in which such regex occurs:

import spacy
import re

nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"[Uu](nited|\.?) ?[Ss](tates|\.?)"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None 
    # if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

There is a way to get the span (list of tokens) corresponding to the regex match on the doc even if the boundaries of the regex match do not correspond to token boundaries. See: How can I expand the match to a valid token sequence? In https://spacy.io/usage/rule-based-matching

So far so good.

Now that I have a collectuon of spans how do I convert them into entities? I am aware of the entity ruler: The EntityRuler is a pipeline component (see also the link above) but that entityruler takes patterns as inputs to search in the doc and not spans.

If I want to use regex over the whole document to get the collection os spans I want to convert into ents what is the next step here? Entityruler? How? Or something else?

Put simpler:

nlp = spacy.load("en_core_web_sm")
doc = nlp("The aplicable law is article 102 section b sentence 6 that deals with robery")

I would like to generate an spacy ent (entity) out of doc[5,10] with label "law" in order to be able to: A) loop over all the law entities in the texts B) use the visualizer to display the different entities contained in the doc

like image 524
JFerro Avatar asked Oct 31 '25 13:10

JFerro


1 Answers

The most flexible way to add spans as entities to a doc is to use Doc.set_ents:

from spacy.tokens import Span

span = doc.char_span(start, end, label="ENT")
doc.set_ents(entities=[span], default="unmodified")

Use the default option to specify how to set all the other tokens in the doc. By default the other tokens are set to O, but you can use default="unmodified" to leave them untouched, e.g. if you're adding entities incrementally.

https://spacy.io/api/doc#set_ents

like image 53
aab Avatar answered Nov 03 '25 02:11

aab



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!