I'm trying to find a solution for capitalising names in a perl webapp (using perl v5.10.1). I originally thought to use Lingua::EN::NameCase, but am seeing some problems with accented characters.
I need to be able to deal with accented characters from a variety of european languages (irish, french, german).
I have seen some indications online that Lingua::EN::NameCase should work for my usecase. For example, this page on perlmonks: http://www.perlmonks.org/?node_id=889135
Here is my test code based on above link:
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::NameCase;
use locale;
use POSIX qw(locale_h);
my $locale = 'en_FR.utf8';
setlocale( LC_CTYPE, $locale );
binmode DATA,   ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
while (my $original_name = <DATA>) {
    chomp $original_name;
    my $normalized_name = nc($original_name);
    printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name);
}
sub xlc {
    my $str = shift;
    $_ = lc( $str );
    return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+)/g ) );
};
__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh
Produces the output below. Both L::EN::NC and the custom ucfirst(lc()) solution produce incorrect results (note the capital letters following each accented character). This seems to be because perl regex is matching a "word boundary" before/after each accented character. I would have expected word boundary only to match between a space character and a non-space character.
Can anybody suggest a solution?
Thanks,
Brian.
  ÉTIENNE DE LA BOÉTIE L::EN::NC           éTienne de la BoéTie UCFIRST           ÉTienne De La BoÉTie
    ÉMILIE DU CHÂTELET L::EN::NC             éMilie du ChâTelet UCFIRST             ÉMilie Du ChÂTelet
         HÉLÈNE CIXOUS L::EN::NC                  HéLèNe Cixous UCFIRST                  HÉLÈNe Cixous
    Seán Ó Hannracháín L::EN::NC             SeáN ó HannracháíN UCFIRST             SeÁN ó HannrachÁíN
    Máire Ó hÓgartaigh L::EN::NC             MáIre ó HóGartaigh UCFIRST             MÁIre ó HÓGartaigh
Perl 5.10 is old; you should update it, if you can.
Next you'll find a version I use for similar situations. (tested in a perl 5.14.2)
#!/usr/bin/perl
use strict;
use warnings;
use utf8::all;
while (<DATA>) { chomp;
    printf "%30s ==> %30s\n", $_, xlc($_);
}
sub xlc { my $str = shift;
    $str =~ s/(\w+)/ucfirst(lc($1))/ge;
    $str =~ s/( L[ea]s?
               | Von
               | D[aeou]s?
               )\b
              /lc($1)/xge;
    return $str;
};
__DATA__
ÉTIENNE DE LA BOÉTIE
ÉMILIE DU CHÂTELET
HÉLÈNE CIXOUS
Seán Ó Hannracháín
Máire Ó hÓgartaigh
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With