I am trying to normalise accented characters in a string in Python 3 like this:
from bs4 import BeautifulSoup
import os
def process_markup():
#the file is utf-8 encoded
fn = os.path.join(os.path.dirname(__file__), 'src.txt') #
markup = BeautifulSoup(open(fn), from_encoding="utf-8")
for player in markup.find_all("div", class_="glossary-player"):
text = player.span.string
print(format_filename(text)) # Python console shows mangled characters not in utf-8
player.span.string.replace_with(format_filename(text))
dest = open("dest.txt", "w", encoding="utf-8")
dest.write(str(markup))
def format_filename(s):
# prepare string
s = s.strip().lower().replace(" ", "-").strip("'")
# transliterate accented characters to non-accented versions
chars_in = "à èìòùáéÃóú"
chars_out = "aeiouaeiou"
no_accented_chars = str.maketrans(chars_in, chars_out)
return s.translate(no_accented_chars)
process_markup()
The input src.txt file is utf-8 encoded:
<div class="glossary-player">
<span class="gd"> FÃ ilte </span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd"> à èìòùáéÃóú </span><span class="en"> aeiouaeiou </span>
</div>
The output file dest.txt looks like this:
<div class="glossary-player">
<span class="gd">fã ilte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">ã ã¨ã¬ã²ã¹ã¡ã©ãÂã³ãº</span><span class="en"> aeiouaeiou </span>
</div>
and I am trying to get it to look like this:
<div class="glossary-player">
<span class="gd">failte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">aeiouaeiou</span><span class="en"> aeiouaeiou </span>
</div>
I know there's solutions like unidecode but just wanted to find out what I'm doing wrong here.
chars.translate(no_accented_chars) doesn't modify chars. It returns a new string with the translation applied. If you want to use the translated string, save it to a variable (perhaps the original chars variable):
chars = chars.translate(no_accented_chars)
or pass it directly to the write call:
dest.write(chars.translate(no_accented_chars))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With