Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

delete weird hidden characters from string in Python

I have a csv file with a column containing lists of strings. It seems like those strings have hidden characters which I only get to see when removing certain characters from each string.

#string copied from column
print(len('kommunikationsfähigkeit'))
#same string entered by me 
print(len('kommunikationsfähigkeit'))

24
23

when removing parts of the column copied string, I get this:

''̈igkeit'

Anyone know what's going on there? I tried reading the csv with encoding='utf8', but it didn't change anything. I obviously want to get rid of those characters.

like image 700
Michael Avatar asked Dec 05 '25 15:12

Michael


2 Answers

Both are UTF-8, but there are different ways of rendering the same visual character. The first string contains U+00E4 — LATIN SMALL LETTER A WITH DIAERESIS. Your second string contains “a” followed by U+0308 — COMBINING DIAERESIS ( ̈ ), which, in combination, is rendered as “ä”.

You can inspect the strings yourself using unicodedata:

import unicodedata

for c in string:
    print(unicodedata.name(c))

Both of the above are valid ways of representing “ä”, and they count as equivalent under a suitable Unicode normalisation. You can use unicodedata.normalize to normalise different representations. For instance, you could transform both strings into normal form C (though the first one already happens to be in NFC):

a = 'kommunikationsfähigkeit'
b = 'kommunikationsfähigkeit'
print(f'len(a) = {len(a)}')
# len(a) = 23
print(f'len(b) = {len(b)}')
# len(b) = 24
print(f'a == b: {a == b}')
# a == b: False

norm_a = unicodedata.normalize('NFC', a)
norm_b = unicodedata.normalize('NFC', b)
print(f'len(norm_a) = {len(norm_a)}')
# len(norm_a) = 23
print(f'len(norm_b) = {len(norm_b)}')
# len(norm_b) = 23
print(f'norm_a == norm_b: {norm_a == norm_b}')
# norm_a == norm_b: True
like image 66
Konrad Rudolph Avatar answered Dec 07 '25 05:12

Konrad Rudolph


This is one of the reasons I dislike Unicode. Instead of saying "this is the one true way", the standards defined several ways of representing characters. In this case, one string is using the "composed" form, while the other is using "decomposed" form (separate letter and diaeresis).

You may want to consider normalizing the data:

import unicodedata
s1 = 'kommunikationsfähigkeit'
s2 = 'kommunikationsfähigkeit'
ns1 = unicodedata.normalize('NFC', s1)
ns2 = unicodedata.normalize('NFC', s2)
print(s1 == s2, ns1 == ns2)
# prints False True

The above snippet normalizes the strings to composed form, which is what many systems use. The decomposed form tends to appear on macOS systems as that is the default there. You can see the strings originally don't compare as being equal, but they do after normalization.

like image 24
John Szakmeister Avatar answered Dec 07 '25 04:12

John Szakmeister