Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent Spacy tokenizer from splitting on specific character

When using spacy to tokenize a sentence, I want it to not split into tokens on /

Example:

import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
    print(i)

Output:

Get
10ct
/
liter
off
when
using
our
App

I want it to be like Get , 10ct/liter, off, when ....

I was able to find how to add more ways to split into tokens for spacy, but not how to avoid specific splitting techniques.

like image 585
Siddharth Jain Avatar asked Nov 30 '25 20:11

Siddharth Jain


1 Answers

I suggest using a custom tokenizer, see Modifying existing rule sets:

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## =>  ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']

Note the commented #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), line, I simply took out the / char from the [:<>=/] character class. This rule split at / that is between a letter/digit and a letter.

If you need to still split '12/ct' into three tokens, you will need to add another line below the r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) line:

r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),
like image 150
Wiktor Stribiżew Avatar answered Dec 03 '25 11:12

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!