I have a file and I want to remove all non-word characters from it, with the exception of ä
, ö
and ü
, which are mutated vowels in the German language. Is there a way to do word.gsub!(/\W/, '')
and put exceptions in it?
Example:
text = "übung bzw. äffchen"
text.gsub!(/\W/, '').
Now it would return "bungbzwffchen"
. It deletes the non word characters, but also removes the mutated vowels ü
and ä
, which I want to keep.
You may be able to define a list of exclusions by using some kind of negative-lookback thing, but the simplest I think would be to just use \w
instead of \W
and negate the whole group:
word.gsub!(/[^\wÄäÖöÜü]/, '')
You could also use word.gsub(/[^\p{Letter}]/, '')
, that should get rid of any characters that are not listed as "Letter" in unicode.
You mention German vowels in your question, I think it's worth noting here that the German alphabet also includes the long-s : ẞ / ß
Update:
To answer your original question, to define a list of exclusions, you use the "negative look-behind" (?<!pat)
:
word.gsub(/\W(?<![ÄäÖöÅåẞß])/, '')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With