Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separating between Hebrew and English strings

So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.

I have tried the stupid method of comparing every character:

import string
data = []
for s in slist:
    found = False
    for c in string.ascii_letters:
        if c in s:
            found = True
    if not found:
        data.append(s)

And it works, but it is of course very slow and my list is HUGE. Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.

I'm sure this can be done much better... Help, anyone?

P.S: I prefer to do it within a python program, but a grep command that does the same would also help

like image 823
Ofer Sadan Avatar asked Dec 10 '25 12:12

Ofer Sadan


2 Answers

To check if a string contains any ASCII letters (ie. non-Hebrew) use:

re.search('[' + string.ascii_letters + ']', s)

If this returns true, your string is not pure Hebrew.

like image 143
Błotosmętek Avatar answered Dec 12 '25 01:12

Błotosmętek


This one should work:

import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]

This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.

Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:

data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]

This will discard any string that contains at least one English letter.

like image 44
Sufian Latif Avatar answered Dec 12 '25 02:12

Sufian Latif



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!