Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strip vowels/nekudot/diacritics from hebrew utf-8 in ruby?

Tags:

ruby

utf-8

hebrew

Hebrew prints vowels as nekudot/points around the letters.
with vowels: " יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר"
without vowels: "יתגב כארי לעמוד בבקר לעבודת בוראו שיהא הוא מעורר השחר"

I need a way to strip these vowels from a string. As in convert the string, with vowels, to the string, without vowels. Any suggestions?

-p.s.

I have tried "hebrew.gsub(/[^א-ת]/, '')" but this has two problems: a: this will remove all punctuation, english, etc. I only want to remove the vowels. b: some letters get removed as well. (my understanding is limited, but it seems some letters/vowel combinations become "multibyte" in utf-8 and will not match "א-ת".

I found this:https://gist.github.com/yakovsh/345a71d841871cc3d375 online, but the ruby suggestion only works with rails, (assuming it works at all). However, perhaps that page can be helpful in finding a solution.

Please help, thanks in advance.

like image 324
Jack G Avatar asked Oct 18 '25 21:10

Jack G


1 Answers

The vowels are all between U+0591 and U+05C7, so you can just do

hebrew.gsub(/[\u0591-\u05c7]/,"")

e.g.

" יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר".gsub(/[\u0591-\u05c7]/,"")
# => " יתגבר כארי לעמד בבקר לעבודת בוראו שיהא הוא מעורר השחר"

However, that only works if the vowels are all separate characters in the string - or to say the same thing in Unicode-speak, if the text is in Normalization Form D. You can ensure that is the case by calling String#unicode_normalize on it first:

hebrew.unicode_normalize(:nfd).gsub(/[\u0591-\u05c7]/,"")

This step is necessary because Unicode includes several individual characters that combine a letter with nekuddot in a single code point, for round-trip compatibility with older character sets that did not support combining diacritics. Those characters mean you can't tell just by looking whether the string "בּ" consists of the two-code-point sequence U+05D1 HEBREW LETTER BET followed by U+05BC HEBREW POINT DAGESH OR MAPIQ, or just the single character U+FB31 HEBREW LETTER BET WITH DAGESH. Putting the string into Normalization Form D replaces the latter with the former, and also splits up any other "precomposed" characters into their component parts.

like image 160
Mark Reed Avatar answered Oct 22 '25 07:10

Mark Reed