Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test for Unicode properties on older versions of Perl

Unicode adds new properties to the Unicode database and these are added to later versions of Perl. I'd like to be able to est if a property is allowed on my current version of Perl.

I have tried

perl -E 'say eval {qr/\p{wibble}/}; say q(OK)'

But the eval doesn't trap the missing property, I suspect because it's being looked up at compile time, and the script dies before it reaches the say q(OK)

Is there somehow that I can test if a property exists before using it in a regex?

like image 788
JGNI Avatar asked Dec 07 '25 08:12

JGNI


2 Answers

As an example, we'll use property Regional_Indicator. It was introduced in Unicode 10.0.0, which was introduced in Perl 5.28.0.

$ 5.26t/bin/perl -Mv5.10 -e'qr/\p{Regional_Indicator}/; say "ok"'
Can't find Unicode property definition "Regional_Indicator" in regex; marked by <-- HERE in m/\p{Regional_Indicator} <-- HERE / at -e line 1.

$ 5.28t/bin/perl -Mv5.10 -e'qr/\p{Regional_Indicator}/; say "ok"'
ok

It's not clear if you want to test if it's something allowed in \p{}, or if you want to test if it's a property name (and what exactly you consider a property name). We'll cover these separately.


Test if it's allowed in \p{}

To test if a string is allowed in \p{}, we could catch the exception from qr//. This only works since 5.24. Before then, an unknown property would simply fail to match.

One catch: Static patterns are compiled at compile-time. So let's not use a static pattern.

sub is_property_supported {
   my $maybe_prop = shift;
   return 0 if $maybe_prop =~ /[\\\}]/;
   require v5.24;
   return eval { qr/\p{$maybe_prop}/; 1 };
}

Demo:

use strict;
use warnings;
use feature qw( say );

sub is_property_supported {
   my $maybe_prop = shift;
   return 0 if $maybe_prop =~ /[\\\}]/;
   return eval { qr/\p{$maybe_prop}/; 1 };   # 5.24+
}

sub IsCustom { "0" }

say "$_: ", is_property_supported( $_ ) ? "Supported" : "Not recognized" for @ARGV;
$ 5.26t/bin/perl demo.pl Regional_Indicator
Regional_Indicator: Not recognized

$ 5.28t/bin/perl demo.pl Regional_Indicator
Regional_Indicator: Supported

This tests for things that are valid in \p{}, which isn't property names at all.

L: Supported
Letter: Supported
  Letter: Supported
LeTtER: Supported
gc=L: Supported
General_Category=L: Supported
General_Category=Letter: Supported
General_Category = Letter: Supported
General-Category = Letter: Supported
General Category = Letter: Supported
General_Category: Letter: Supported
General_Category: Not recognized
General-Category: Not recognized
gc: Not recognized
GC: Not recognized
IsCustom: Supported
InCustom: Supported
main::IsCustom: Supported
main::InCustom: Supported

I have no solution for before 5.24.


Test if it's a property name

To test if a string is a Unicode property name, we can use Unicode::UCD.

use Unicode::UCD qw( prop_aliases );

# Returns cleaned up name (a true value) or `undef` (a false value).
sub property_name { ( prop_aliases( $_[0] ) )[ 1 ] }

Demo:

use strict;
use warnings;
use feature qw( say );

use Unicode::UCD qw( prop_aliases );

# Returns cleaned up name (a true value) or `undef` (a false value).
sub property_name { ( prop_aliases( $_[0] ) )[ 1 ] }

sub IsCustom { "0" }

say "$_: ", property_name( $_ ) ? "Supported" : "Not recognized" for @ARGV;
$ 5.26t/bin/perl demo.pl Regional_Indicator
Regional_Indicator: Not recognized

$ 5.28t/bin/perl demo.pl Regional_Indicator
Regional_Indicator: Supported

This matches property names and aliases case insensitively and allowing whitespace. It doesn't know about custom properties.

L: Supported
Letter: Supported
  Letter: Supported
LeTtER: Supported
gc=L: Not recognized
General_Category=L: Not recognized
General_Category=Letter: Not recognized
General_Category = Letter: Not recognized
General-Category = Letter: Not recognized
General Category = Letter: Not recognized
General_Category: Letter: Not recognized
General_Category: Supported
General-Category: Supported
gc: Supported
GC: Supported
IsCustom: Not recognized
InCustom: Not recognized
main::IsCustom: Not recognized
main::InCustom: Not recognized

Test if it's a property name (strictly)

prop_aliases returns the property's short name, then its long name General_Category, then a list of aliases. For example, the General_Category property, it returns gc for the short name, General_Category for the long name, and it returns the alias Category. (It will always have a short and long name, but they could be the same. It might not have any aliases.) This level of detail allows us to tweak is_property_name to our liking.

For example, if we wanted a case-sensitive match with no whitespace allowed, we could use the following:

use Unicode::UCD qw( prop_aliases );

sub is_property_name {
   my $maybe_prop = shift;
   return !!grep { $maybe_prop eq $_ } prop_aliases( $maybe_prop );
}

Demo:

use strict;
use warnings;
use feature qw( say );

use Unicode::UCD qw( prop_aliases );

sub is_property_name {
   my $maybe_prop = shift;
   return !!grep { $maybe_prop eq $_ } prop_aliases( $maybe_prop );
}

sub IsCustom { "0" }

say "$_: ", is_property_name( $_ ) ? "Supported" : "Not recognized" for @ARGV;
$ 5.26t/bin/perl demo.pl Regional_Indicator
Regional_Indicator: Not recognized

$ 5.28t/bin/perl demo.pl Regional_Indicator
Regional_Indicator: Supported

This matches property names more strictly than above.

L: Supported
Letter: Supported
  Letter: Not recognized
LeTtER: Not recognized
gc=L: Not recognized
General_Category=L: Not recognized
General_Category=Letter: Not recognized
General_Category = Letter: Not recognized
General-Category = Letter: Not recognized
General Category = Letter: Not recognized
General_Category: Letter: Not recognized
General_Category: Supported
General-Category: Not recognized
gc: Supported
GC: Not recognized
IsCustom: Not recognized
InCustom: Not recognized
main::IsCustom: Not recognized
main::InCustom: Not recognized
like image 99
ikegami Avatar answered Dec 09 '25 23:12

ikegami


This might not work for your problem, but in the rare cases I've needed to check this, I look at the property inversion map in Unicode::UCD. If the UCD doesn't know that property, you get back an empty list. See also Look up Unicode properties with an inversion map. However, this will not help you with user-defined properties.

My other technique is to know which version of the Unicode Standard I need, find the earliest perl that supports that, and require that as the minimum perl. That might bite you if something goes away or is renamed later, though.

like image 26
brian d foy Avatar answered Dec 09 '25 22:12

brian d foy