I'm trying to find a way using python 2.7.1 to parse a string into grapheme clusters. For example, the string:
details = u"Hello 🇦🇹🇻🇪"
I believe should be parsed as:
[u"H", u"e", u"l", u"l", u"o", u"\U0001f1e6\U0001f1f9", u"\U0001f1fb\U0001f1ea"]
I was using grapheme_clusters from the uniseg library, but this produces:
[u"H", u"e", u"l", u"l", u"o", u"\U0001f1e6\U0001f1f9\U0001f1fb\U0001f1ea"]
I have a hard requirement on using 2.7.1. I know Unicode support is better in python 3.X.
This behavior used to be correct, but the rules changed.
As of uniseg version 0.7.1 (current as of this post), the uniseg documentation refers to an outdated version of the Unicode grapheme cluster boundary rules, given in Unicode Standard Annex #29 Version 21. This version includes the rule
Do not break between regional indicator symbols.
where the most recent version of the Unicode grapheme cluster boundary rules, given in Unicode Standard Annex #29 Version 29 says
Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.
You could file a bug report with uniseg, or perhaps a different library would have a more up-to-date implementation. The uniseg bitbucket page links to a few alternatives, such as PyICU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With