I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".
The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?
For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.
Quoting the Java docs about isDigit:
A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.
So, I believe the pattern to match digits should be \p{Nd}.
Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.
Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:
(?U)\d+
or in code:
System.out.println("3๓३".matches("(?U)\\d+")); // true
Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:
Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With