Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy replace token

Tags:

python

spacy

I am trying to replace a word without destroying the space structure in the sentence. Suppose I have the sentence text = "Hi this is my dog.". And I wish to replace dog with Simba. Following the answer from https://stackoverflow.com/a/57206316/2530674 I did:

import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc

doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba . 

Notice how there was an extra space at the end before the full stop (it ought to be Hi this is my Simba.). Is there a way to remove this behaviour. Happy for a general python string processing answer too.

like image 241
sachinruk Avatar asked Oct 17 '25 13:10

sachinruk


2 Answers

The below function replaces any number of matches (found with spaCy), keeps the same whitespacing as the original text, and appropriately handles edge cases (like when the match is at the beginning of the text):

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_lg")

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

def replace_word(orig_text, replacement):
    tok = nlp(orig_text)
    text = ''
    buffer_start = 0
    for _, match_start, _ in matcher(tok):
        if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
            text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
        text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
        buffer_start = match_start + 1
    text += tok[buffer_start:].text
    return text

>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.

>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.
like image 70
Ethan Perez Avatar answered Oct 19 '25 12:10

Ethan Perez


Spacy Tokens have some attributes that could help you. First there's token.text_with_ws, which gives you the token's text with its original trailing whitespace if it had any. Second, token.whitespace_, which just returns the trailing whitespace on the token (empty string if there was no whitespace). If you don't need the large language model for other things you're doing, you could just use Spacy's tokenizer.

from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")

modified = ""
for token in tokens:
    if token.text != "dog":
        modified += token.text_with_ws
    else:
        modified += "Simba"
        modified += token.whitespace_
like image 41
campc704 Avatar answered Oct 19 '25 12:10

campc704