In PCRE-regular expressions, \p{N} is supposed to match "Any kind of numeric character from any script". According to descriptions on RegexInfo and also on sidebar-explanations on regex101.com
Trying this out reveals 1¾६೬¹ are matches, so both non-Latin characters and 'special' number characters are included.
But weirdly, Chinese and Japanese are no match: 一六
What is happening here?
EDIT: the original question included ∩ as a no-match-example, stating it is an Egyptian number. There is a hieroglyph that looks like that, but the ∩-sign I posted here is a mathematical intersection, so it is unsurprising that this is not a number.
\p{N} is a shorthand notation for \p{General_Category=Number} It represents all Unicode characters with a General Category value of Nd, Nl, or No. Each Unicode character has a single General Category value and is assigned the most appropriate value for that character.
Unicode has additional properties, other than General Category. In this instance Numeric Type would be more suitable property than General Category.
The OP gives the examples of 1¾६೬¹ which match \p{N}, and 一∩六 which do not. Looking at the General Category and Numeric Type of each character:
| Character | General Category | Numeric Type | 
|---|---|---|
| 1 | Nd | De | 
| ¾ | No | Nu | 
| ६ | Nd | De | 
| ೬ | Nd | De | 
| ¹ | No | Di | 
| 一 | Lo | Nu | 
| ∩ | Sm | None | 
| 六 | Lo | Nu | 
DerivedNumericType.txt assigns a value of Decimal (De), Digit (Di), or Numeric (Nu) to any character with a numeric value. Any other character (i.e. without a numeric type is assigned the value of None).
The character ∩ given in OP is not a number. It is U+2229 INTERSECTION in the Mathematical Operators block, so its general category is Sm and it has no Numeric Type, i.e. None.
If your regular expression engine supports wider use of Unicode properties, instead of \p{N} it would be possible to use [\p{Numeric_Type=Decimal}\p{Numeric_Type=Digit}\p{Numeric_Type=Numeric}]. An abbreviated form would be [\p{nt=de}\p{nt=di}\p{nt=nu}]. But using Numeric Type depends on the engine you are using.
[\p{Numeric_Type=Decimal}\p{Numeric_Type=Digit}\p{Numeric_Type=Numeric}] could be rewritten as [\P{Numeric_Type=None}] or [\P{nt=None}].
The Unicode set [\p{nt=de}\p{nt=di}\p{nt=nu}] is larger than the set [\p{N}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With