I have a csv file with a column containing lists of strings. It seems like those strings have hidden characters which I only get to see when removing certain characters from each string.
#string copied from column
print(len('kommunikationsfähigkeit'))
#same string entered by me
print(len('kommunikationsfähigkeit'))
24
23
when removing parts of the column copied string, I get this:
''̈igkeit'
Anyone know what's going on there? I tried reading the csv with encoding='utf8', but it didn't change anything. I obviously want to get rid of those characters.
Both are UTF-8, but there are different ways of rendering the same visual character. The first string contains U+00E4 — LATIN SMALL LETTER A WITH DIAERESIS. Your second string contains “a” followed by U+0308 — COMBINING DIAERESIS ( ̈ ), which, in combination, is rendered as “ä”.
You can inspect the strings yourself using unicodedata:
import unicodedata
for c in string:
print(unicodedata.name(c))
Both of the above are valid ways of representing “ä”, and they count as equivalent under a suitable Unicode normalisation. You can use unicodedata.normalize to normalise different representations. For instance, you could transform both strings into normal form C (though the first one already happens to be in NFC):
a = 'kommunikationsfähigkeit'
b = 'kommunikationsfähigkeit'
print(f'len(a) = {len(a)}')
# len(a) = 23
print(f'len(b) = {len(b)}')
# len(b) = 24
print(f'a == b: {a == b}')
# a == b: False
norm_a = unicodedata.normalize('NFC', a)
norm_b = unicodedata.normalize('NFC', b)
print(f'len(norm_a) = {len(norm_a)}')
# len(norm_a) = 23
print(f'len(norm_b) = {len(norm_b)}')
# len(norm_b) = 23
print(f'norm_a == norm_b: {norm_a == norm_b}')
# norm_a == norm_b: True
This is one of the reasons I dislike Unicode. Instead of saying "this is the one true way", the standards defined several ways of representing characters. In this case, one string is using the "composed" form, while the other is using "decomposed" form (separate letter and diaeresis).
You may want to consider normalizing the data:
import unicodedata
s1 = 'kommunikationsfähigkeit'
s2 = 'kommunikationsfähigkeit'
ns1 = unicodedata.normalize('NFC', s1)
ns2 = unicodedata.normalize('NFC', s2)
print(s1 == s2, ns1 == ns2)
# prints False True
The above snippet normalizes the strings to composed form, which is what many systems use. The decomposed form tends to appear on macOS systems as that is the default there. You can see the strings originally don't compare as being equal, but they do after normalization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With