I am trying to replace a word without destroying the space structure in the sentence. Suppose I have the sentence text = "Hi this is my dog."
. And I wish to replace dog with Simba
. Following the answer from https://stackoverflow.com/a/57206316/2530674 I did:
import spacy
nlp = spacy.load("en_core_web_lg")
from spacy.tokens import Doc
doc1 = nlp("Hi this is my dog.")
new_words = [token.text if token.text!="dog" else "Simba" for token in doc1]
Doc(doc1.vocab, words=new_words)
# Hi this is my Simba .
Notice how there was an extra space at the end before the full stop (it ought to be Hi this is my Simba.
). Is there a way to remove this behaviour. Happy for a general python string processing answer too.
The below function replaces any number of matches (found with spaCy), keeps the same whitespacing as the original text, and appropriately handles edge cases (like when the match is at the beginning of the text):
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])
def replace_word(orig_text, replacement):
tok = nlp(orig_text)
text = ''
buffer_start = 0
for _, match_start, _ in matcher(tok):
if match_start > buffer_start: # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
text += replacement + tok[match_start].whitespace_ # Replace token, with trailing whitespace if available
buffer_start = match_start + 1
text += tok[buffer_start:].text
return text
>>> replace_word("Hi this is my dog.", "Simba")
Hi this is my Simba.
>>> replace_word("Hi this dog is my dog.", "Simba")
Hi this Simba is my Simba.
Spacy Tokens have some attributes that could help you. First there's token.text_with_ws, which gives you the token's text with its original trailing whitespace if it had any. Second, token.whitespace_, which just returns the trailing whitespace on the token (empty string if there was no whitespace). If you don't need the large language model for other things you're doing, you could just use Spacy's tokenizer.
from spacy.lang.en import English
nlp = English() # you probably don't need to load whole lang model for this
tokenizer = nlp.tokenizer
tokens = tokenizer("Hi this is my dog.")
modified = ""
for token in tokens:
if token.text != "dog":
modified += token.text_with_ws
else:
modified += "Simba"
modified += token.whitespace_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With