Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy Matcher - Only Match Longest String [duplicate]

I'm trying to create noun chunks using the spacy pattern matcher. For example, if I have a sentence "The ice hockey scrimmage took hours." I want to return "ice hockey scrimmage" and "hours". I currently have this:

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None,  [{"POS": "NOUN"}, {"POS": "NOUN", "OP": "*"}, {"POS": "NOUN", "OP": "*"}] )

doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id] 
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)

But it is returning all versions of "ice hockey scrimmage" and not just the longest.

12482938965902279598 NounChunks 1 2 ice
12482938965902279598 NounChunks 1 3 ice hockey
12482938965902279598 NounChunks 2 3 hockey
12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 2 4 hockey scrimmage
12482938965902279598 NounChunks 3 4 scrimmage
12482938965902279598 NounChunks 5 6 hours

Is there something I'm missing in how to define the pattern? I want it to return only:

12482938965902279598 NounChunks 1 4 ice hockey scrimmage
12482938965902279598 NounChunks 5 6 hours
like image 809
user3242036 Avatar asked Sep 03 '25 09:09

user3242036


1 Answers

I do not know of an in-built way to filter out the longest span, but there is an utility functionspacy.util.filter_spans(spans) which helps with this. It chooses the longest span among the given spans and if multiple overlapping spans have the same length, it gives priority to the span which occurs first in the list of spans.

import spacy 

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None,  [{"POS": "NOUN", "OP": "+"}] )

doc = nlp("The ice hockey scrimmage took hours.")
matches = matcher(doc)

spans = [doc[start:end] for _, start, end in matches]
print(spacy.util.filter_spans(spans))

Output

[ice hockey scrimmage, hours]
like image 50
Raqib Avatar answered Sep 04 '25 23:09

Raqib