Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby Regular Expression to match words, including accents and other UTF8 characters

Tags:

regex

ruby

We're trying to find a regular expression that allows us to split sentences into words. Of course, the immediate answer is to use \w, except that it doesn't split on _which we need. Then, we tried [a-zA-Z0-9] (we'd like to allow for numbers inside words), the problem is that it splits on accents, which are fairly common in many langues...

So, ideally, what regexp should I use to split the following sentence in the following words :

"Je ne déguste pas d'asperges, car je n'aime pas ça"

info

["Je","ne","déguste","pas","d", "asperges", "car","je", "n","aime","pas", "ça"]

like image 451
Julien Genestoux Avatar asked Dec 05 '25 15:12

Julien Genestoux


1 Answers

STR = "Je ne déguste pas d'asperges, car je n'aime pas ça"
words = STR.split /[\s,']+/
for w in words
    print w, "\n"
end

The output is:

Je
ne
déguste
pas
d
asperges
car
je
n
aime
pas
ça
like image 169
Brent Newey Avatar answered Dec 08 '25 09:12

Brent Newey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!