Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:
>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>
Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:
second_string to the likeness of first_string)Thank you.
An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.
The first one is:
<â> 226, Hex 00e2, Octal 342
And the second:
<a>  97,  Hex 61,  Octal 141 < ̂> 770, Hex 0302, Octal 1402
In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.
Ask the website operators. How would we know?!
You need something which turns combining characters into regular characters. A Google search yielded this question, for example.
As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.
There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.
A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.
You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:
from unicodedata import normalize
s = u'a\u0302'
print s, len(s), len(normalize("NFC", s))
will output â 2 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With