Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java regex to filter out non-English text

Tags:

java

regex

I found a few references to regex filtering out non-English but none of them is in Java, aside from the fact that they are all referring to somewhat different problems than what I am trying to solve:

  1. Replace all non-English characters with a space.
  2. Create a method that returns true if a string contains any non-English character.

By "English text" I mean not only actual letters and numbers but also punctuation.

So far, what I have been able to come with for goal #1 is quite simple:

String.replaceAll("\\W", " ")

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

As for goal #2, I could simply trim() the string after the above replaceAll(), then check if it's empty. But... Is there a more efficient way to do this?

like image 959
Regex Rookie Avatar asked Dec 10 '25 10:12

Regex Rookie


2 Answers

In fact, so simple that I suspect that I am missing something... Do you spot any caveats in the above?

\W is equivalent to [^\w], and \w is equivalent to [a-zA-Z_0-9]. Using \W will replace everything which isn't a letter, a number, or an underscore — like tabs and newline characters. Whether or not that's a problem is really up to you.

By "English text" I mean not only actual letters and numbers but also punctuation.

In that case, you might want to use a character class which omits punctuation; something like

[^\w.,;:'"]

Create a method that returns true if a string contains any non-English character.

Use Pattern and Matcher.

Pattern p = Pattern.compile("\\W");

boolean containsSpecialChars(String string)
{
    Matcher m = p.matcher(string);
    return m.find();
}
like image 50
Matt Ball Avatar answered Dec 11 '25 23:12

Matt Ball


Here is my solution. I assume the text may contain English words, punctuation marks and standard ascii symbols such as #, %, @ etc.

private static final String IS_ENGLISH_REGEX = "^[ \\w \\d \\s \\. \\& \\+ \\- \\, \\! \\@ \\# \\$ \\% \\^ \\* \\( \\) \\; \\\\ \\/ \\| \\< \\> \\\" \\' \\? \\= \\: \\[ \\] ]*$";

private static boolean isEnglish(String text) {
		if (text == null) {
			return false;
		}
		return text.matches(IS_ENGLISH_REGEX);
	}
like image 30
Eli Mashiah Avatar answered Dec 11 '25 23:12

Eli Mashiah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!