Remove special characters but not accented letters

Question

I do the following:

re.sub(r'[^ 
A-Za-z0-9/]+', '', document)

to remove every character which is not alphanumeric, space, newline, or forward slash.

So I basically I want to remove all special characters except for the newline and the forward slash.

However, I do not want to remove the accented letters which various languages have such as in French, German etc.

But if I run the code above then for example the word

Motörhead

becomes

Motrhead

and I do not want to do this.

So how do I run the code above but without removing the accented letters?

UPDATE:

@MattM below has suggested a solution which does work for languages such as English, French, German etc but it certainly does not work for languages such as Polish where all the accented letters were still removed.

Matt M · Accepted Answer

I'm pretty sure this would do what you need

x = re.sub(r'[^ 
A-Za-z0-9À-ÖØ-öø-ÿ/]+', '', 'Motörhead')

Also check here for a discussion about javascript regex, which has some relevant info despite any differences

EDIT -

To expand on Outcast's new concern - yes you could include non-Latin characters. However it may get too cumbersome. If you look at a list of Unicode chars, I was including ranges of accented Latin chars. So if you wanted to include all Cyrillic characters as well, we would add Ѐ-ӿ to the regex.

import re

yourString = 'Cyrillic Char Ѥ'
yourString = re.sub(r'[^ 
A-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]+', '', yourString)
text_file = open("Output.txt", "wb")
text_file.write(yourString.encode('utf8'))
text_file.close()

However with this method you may have to include many ranges, depending on which chars from which languages you want or don't want.

Remove special characters but not accented letters

Tags:

python

diacritics

nlp

Outcast

1 Answers

Matt M

Recent Activity

Donate For Us

Remove special characters but not accented letters

Tags:

python

diacritics

nlp

Outcast

1 Answers

Matt M

Related questions

Recent Activity

Donate For Us