Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pattern matching with language-specific characters

Tags:

python

regex

From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:

import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")

Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:

['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']

I want it to yield:

['Sektion', 'München', 'Gruppe', 'Süd']

I am grateful for suggestions how to solve this problem.

like image 703
Jones1220 Avatar asked Nov 01 '25 03:11

Jones1220


1 Answers

You may use

import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']

See the Python 3 demo.

The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.

In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:

p = re.compile(r'[^\W\d_]+', re.U)
like image 186
Wiktor Stribiżew Avatar answered Nov 02 '25 17:11

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!