Does Perl's \w match all alphanumeric characters defined in the Unicode standard?
For example, will \w match all (say) Chinese and Russian alphanumeric characters?
I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (@ok) {
die unless ($ok =~ /^\w+$/);
}
\w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_] ). The uppercase counterpart \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_] ). In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart.
\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
A regular expression for an alphanumeric string checks that the string contains lowercase letters a-z , uppercase letters A-Z , and numbers 0-9 . Optional quantifiers are used to specify a string length. And the code that does all this looks like this: /^[a-zA-Z0-9]+$/
perldoc perlunicode says
Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database.
\wcan be used to match a Japanese ideograph, for instance.
So it looks like the answer to your question is "yes".
However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.
Yes and no.
If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]. The \w contains both more and less than that. It specifically excludes any \pN which is not \p{Nd} nor \p{Nl}, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}, and are not included in \w.
Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w in a regex matches any single code point that has any of the following four properties:
\p{GC=Alphabetic}\p{GC=Mark}\p{GC=Connector_Punctuation}\p{GC=Decimal_Number}Number 4 above can be expressed in any of these ways, which are all considered equivalent:
\p{Digit} \p{General_Category=Decimal_Number}\p{GC=Decimal_Number}\p{Decimal_Number}\p{Nd}\p{Numeric_Type=Decimal}\p{Nt=De}Note that \p{Digit} is not the same as \p{Numeric_Type=Digit}. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit} property and not plain \p{Digit}. That is because it is considered a \p{Other_Number} or \p{No}. It does, however, have the \p{Numeric_Value=2} property as you would imagine.
It’s really point number 1 above, \p{Alphabetic} ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter} (\pL), but it is not.
Alphabetics include much more than that, all because of the \p{Other_Alphabetic} property, as this in turn
includes some but not all \p{GC=Mark}, all of \p{Lowercase} (which is not the same as \p{GC=Ll} because it adds \p{Other_Lowercase}) and all of \p{Uppercase} (which is not the same as \p{GC=Lu} because it adds \p{Other_Uppercase}).
That’s how it pulls in \p{GC=Letter_Number} like Roman numerals and also
all the circled letters, which are of type \p{Other_Symbol} and \p{Block=Enclosed_Alphanumerics}.
Aren’t you glad we get to use \w? :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With