I've written a Perl script that prints out characters matching a Unicode property. It seems to work all right for most properties so far.
But it prints out ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
ÿ among characters matching [^\w]. These characters should rather match \w. Strangely enough, they match \p{Word}.
I've tried without success:
map { decode ( "UTF-8", $_ ) }map { pack 'U0C*', unpack 'C*', $_ }How can I make [^\w] not match those word characters?
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ':utf8';
my $c;
my $cols = 80;
my $arg = shift;
my $regex = qr/$arg/;
for ( map { chr } 0x20 .. 0xFFFF )
{
next if /\p{Unassigned}|\p{NChar}|\p{Cs}/;
if ( $_ =~ $regex )
{
print STDOUT;
print STDOUT "\n" if ++$c % $cols == 0;
}
}
print STDOUT "\n" if defined $c and $c % $cols != 0;
exit 0;
Good:
$ ./chars.pl '\p{Cyrillic}'
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя
ѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҇ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡ
ҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱ
ӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧᴫᵸⷠⷡⷢⷣⷤⷥⷦⷧⷨⷩⷪⷫⷬⷭⷮⷯⷰⷱⷲⷳⷴⷵⷶⷷ
ⷸⷹⷺⷻⷼⷽⷾⷿꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꙮ꙯꙰꙱꙲꙳꙼꙽꙾ꙿꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏ
ꚐꚑꚒꚓꚔꚕꚖꚗ
$
Good:
$ ./chars.pl '[^\p{Word}]' | grep É
$
Bad:
$ ./chars.pl '[^\w]' | grep É
°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
$
Perl v5.14.2
Unicode support in Perl is huge topic, see e.g. this answer
To make \w match same as \p{Word}, you need to have /u character set modifier in effect (available in Perl since version 5.14).
Simplest way is to just start program with
use v5.14;
which (among other things) enables feature unicode_strings and makes all regexes default to /u character set modifier. You can also just enable that feature explicitly:
use feature 'unicode_strings';
Third way is to use /u modified in regex to change character set on per-regex basis.
You can read about effects of different regex character set modifiers from perlre manpage. These are /d, /u, /a and /l.
The \w is explained in perlrecharclass manpage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With