Is it possible to tokenize emojis like :), :(, ;~( properly using the spaCy Python library? e.g. If I run the following code:
import spacy
nlp = spacy.load('en')
doc = nlp("Hello bright world :)")
And then visualize the doc with displaCy:

It incorrectly parses world :) as one token. How can I modify spaCy so it recognizes these additional symbols? Thanks.
edit: Found the following: https://github.com/ines/spacymoji but I think it only supports Unicode emojis like ✨ and not ASCII ones like :)?
Yes, spaCy actually includes a pretty comprehensive list of text-based emoticons as part of its tokenizer exceptions. So using your example above and printing the individual tokens, the emoticon is tokenized correctly:
doc = nlp("Hello bright world :)")
print([token.text for token in doc])
# ['Hello', 'bright', 'world', ':)']
I think what happens here is that you actually came across an interesting (maybe non-ideal) edge case with the displacy defaults. To avoid very long dependency arcs for punctuation, the collapse_punct setting defaults to True. This means that when the visualisation is rendered, punctuation is merged onto the preceding token. Punctuation is identified by checking whether the token's is_punct attribute returns True – which also happens to be the case for ":)".
In your example, you can work around this by setting collapse_punct to False in the options passed to displacy.serve:
displacy.serve(doc, style='dep', options={'collapse_punct': False})
(The displaCy visualizer should probably include an exception for emoticons when merging punctuation. This is currently difficult, because spaCy doesn't have an is_emoji or is_symbol flag. However, it might be a nice addition in the future – you can vote for it on this thread.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With