Good day SO,
I am trying to post-process hyphenated words that are tokenized into separate tokens when they were supposedly a single token. For example:
Example:
Sentence: "up-scaled"
Tokens: ['up', '-', 'scaled']
Expected: ['up-scaled']
For now, my solution is to use the matcher:
matcher = Matcher(nlp.vocab)
pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},
           {'ORTH': '-'},
           {'IS_ALPHA': True, 'IS_SPACE': False}]
matcher.add('HYPHENATED', None, pattern)
def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    #print(doc)
    return doc
nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp(text)
However, this will cause an expected issue below:
Example 2:
Sentence: "I know I will be back - I had a very pleasant time"
Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']
Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']
Is there a way where I can process only words separated by hyphens that do not have spaces between the characters? So that words like 'up-scaled' will be matched and combined into a single token, but not '.. back - I ..'
Thank you very much
EDIT: I have tried the solution posted: Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?
However, I didn't use this solution because it resulted in wrong tokenization of words with apostrophes (') and Numbers with decimals:
Sentence: "It's"
Tokens: ["I", "t's"]
Expected: ["It", "'s"]
Sentence: "1.50"
Tokens: ["1", ".", "50"]
Expected: ["1.50"]
That is why I used Matcher instead of trying to edit the regex.
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
SpaCy automatically breaks your document into tokens when a document is created using the model.
Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.
The Matcher is not really the right tool for this. You should modify the tokenizer instead.
If you want to preserve how everything else is handled and only change the behavior for hyphens, you should modify the existing infix pattern and preserve all the other settings. The current English infix pattern definition is here:
https://github.com/explosion/spaCy/blob/58533f01bf926546337ad2868abe7fc8f0a3b3ae/spacy/lang/punctuation.py#L37-L49
You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer. So, if you comment out the hyphen pattern and define a custom tokenizer:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
    )
    infix_re = compile_infix_regex(infixes)
    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)
nlp = spacy.load("en")
nlp.tokenizer = custom_tokenizer(nlp)
print([t.text for t in nlp("It's 1.50, up-scaled haven't")])
# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]
You do need to provide the current prefix/suffix/token_match settings when initializing the new Tokenizer to preserve the existing tokenizer behavior. See also (for German, but very similar): https://stackoverflow.com/a/57304882/461847
Edited to add (since this does seem unnecessarily complicated and you really should be able to redefine the infix patterns without loading a whole new custom tokenizer):
If you have just loaded the model (for v2.1.8) and you haven't called nlp() yet, you can also just replace the infix_re.finditer without creating a custom tokenizer:
nlp = spacy.load('en')
nlp.tokenizer.infix_finditer = infix_re.finditer
There's a caching bug that should hopefully be fixed in v2.2 that will let this work correctly at any point rather than just with a newly loaded model. (The behavior is extremely confusing otherwise, which is why creating a custom tokenizer has been a better general-purpose recommendation for v2.1.8.)
If nlp = spacy.load('en') throws error,
use nlp = spacy.load("en_core_web_sm")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With