delete weird hidden characters from string in Python

Question

I have a csv file with a column containing lists of strings. It seems like those strings have hidden characters which I only get to see when removing certain characters from each string.

#string copied from column
print(len('kommunikationsfähigkeit'))
#same string entered by me 
print(len('kommunikationsfähigkeit'))

24
23

when removing parts of the column copied string, I get this:

''̈igkeit'

Anyone know what's going on there? I tried reading the csv with encoding='utf8', but it didn't change anything. I obviously want to get rid of those characters.

Konrad Rudolph · Accepted Answer

Both are UTF-8, but there are different ways of rendering the same visual character. The first string contains U+00E4 — LATIN SMALL LETTER A WITH DIAERESIS. Your second string contains “a” followed by U+0308 — COMBINING DIAERESIS ( ̈ ), which, in combination, is rendered as “ä”.

You can inspect the strings yourself using unicodedata:

import unicodedata

for c in string:
    print(unicodedata.name(c))

Both of the above are valid ways of representing “ä”, and they count as equivalent under a suitable Unicode normalisation. You can use unicodedata.normalize to normalise different representations. For instance, you could transform both strings into normal form C (though the first one already happens to be in NFC):

a = 'kommunikationsfähigkeit'
b = 'kommunikationsfähigkeit'
print(f'len(a) = {len(a)}')
# len(a) = 23
print(f'len(b) = {len(b)}')
# len(b) = 24
print(f'a == b: {a == b}')
# a == b: False

norm_a = unicodedata.normalize('NFC', a)
norm_b = unicodedata.normalize('NFC', b)
print(f'len(norm_a) = {len(norm_a)}')
# len(norm_a) = 23
print(f'len(norm_b) = {len(norm_b)}')
# len(norm_b) = 23
print(f'norm_a == norm_b: {norm_a == norm_b}')
# norm_a == norm_b: True

John Szakmeister · Answer

This is one of the reasons I dislike Unicode. Instead of saying "this is the one true way", the standards defined several ways of representing characters. In this case, one string is using the "composed" form, while the other is using "decomposed" form (separate letter and diaeresis).

You may want to consider normalizing the data:

import unicodedata
s1 = 'kommunikationsfähigkeit'
s2 = 'kommunikationsfähigkeit'
ns1 = unicodedata.normalize('NFC', s1)
ns2 = unicodedata.normalize('NFC', s2)
print(s1 == s2, ns1 == ns2)
# prints False True

The above snippet normalizes the strings to composed form, which is what many systems use. The decomposed form tends to appear on macOS systems as that is the default there. You can see the strings originally don't compare as being equal, but they do after normalization.

delete weird hidden characters from string in Python

Tags:

python

string

pandas

csv

encoding

Michael

2 Answers

Konrad Rudolph

John Szakmeister

Recent Activity

Donate For Us

delete weird hidden characters from string in Python

Tags:

python

string

pandas

csv

encoding

Michael

2 Answers

Konrad Rudolph

John Szakmeister

Related questions

Recent Activity

Donate For Us