I do the following:
re.sub(r'[^ \nA-Za-z0-9/]+', '', document)
to remove every character which is not alphanumeric, space, newline, or forward slash.
So I basically I want to remove all special characters except for the newline and the forward slash.
However, I do not want to remove the accented letters which various languages have such as in French, German etc.
But if I run the code above then for example the word
Motörhead
becomes
Motrhead
and I do not want to do this.
So how do I run the code above but without removing the accented letters?
UPDATE:
@MattM below has suggested a solution which does work for languages such as English, French, German etc but it certainly does not work for languages such as Polish where all the accented letters were still removed.
I'm pretty sure this would do what you need
x = re.sub(r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿ/]+', '', 'Motörhead')
Also check here for a discussion about javascript regex, which has some relevant info despite any differences
EDIT -
To expand on Outcast's new concern - yes you could include non-Latin characters. However it may get too cumbersome. If you look at a list of Unicode chars, I was including ranges of accented Latin chars. So if you wanted to include all Cyrillic characters as well, we would add Ѐ-ӿ to the regex.
import re
yourString = 'Cyrillic Char Ѥ'
yourString = re.sub(r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]+', '', yourString)
text_file = open("Output.txt", "wb")
text_file.write(yourString.encode('utf8'))
text_file.close()
However with this method you may have to include many ranges, depending on which chars from which languages you want or don't want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With