Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add a SpaCy Tokenizer Exception: Do not split '>>'

I am trying to add an exception to recognize '>>' and '>> ' as an indicator to start a new sentence. For example,

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'>> We should. >>No.')

for sent in doc.sents:
    print (sent)

It prints out:

>> We should.
>
>
No.

But, I'd like it to print out:

>> We should.
>> No. 

Thank you for your time in advance!

like image 777
Eric J Avatar asked Oct 25 '25 22:10

Eric J


1 Answers

You need to create a custom component. The code examples provide a custom sentence segmentation example. From the documentation, the example does the following:

Example of adding a pipeline component to prohibit sentence boundaries before certain tokens.

The code (adapting the example to your needs):

import spacy


def prevent_sentence_boundaries(doc):
    for token in doc:
        if not can_be_sentence_start(token):
            token.is_sent_start = False
    return doc


def can_be_sentence_start(token):
    if token.i > 0 and token.nbor(-1).text == '>':
        return False
    return True

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(prevent_sentence_boundaries, before='parser')

raw_text = u'>> We should. >> No.'
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
for sentence in sentences:
    print(sentence)

Output

>> We should.
>> No.
like image 107
Dani Mesejo Avatar answered Oct 29 '25 11:10

Dani Mesejo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!