I'm looking for some regex code with this pattern:
Must contain at least 1 of the following and match the whole string.
Can contain only alpha letters (a-z A-Z) ...
and accented alpha letters (á ä à etc).
I'm using preg_match('/^([\p{L}]*)$/iu', $input), but \p{L} matches all unicode letters, including Chinese. I just want to allow the English alphabet letters but also the accented variants of them.
So JohnDoe, Fübar, Lòrem, FírstNäme, Çákë would all be valid inputs, because they all contain at least 1 alpha letter and/or accented alpha letters, and the whole string matches.
I would suggest this compact regex:
(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+
See demo.
À to ÿ (see this table), so we simply add it to the character class. À-ÿ has a few unwanted characters. Unlike some engines, PCRE (PHP's regex engine) does not support character class subtraction, but we mimic it with the negative lookahead (?![×Þß÷þø])
à can be expressed by several Unicode code points (the à grapheme, or an a with a grave accent). This will only match the non-combined graphemes. Catching all variations is really hard.In your code:
$regex = "~(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])+~u";
$hit = preg_match($regex,$subject,$match);
I came up with the following solution using a combination of preg_match and iconv. Tested with php 5.5 on Windows and Linux:
$testWords = array(
// pass
'Çákë',
'JohnDoe',
'Fübar',
'Lòrem',
'FírstNäme',
// fail
'Ç@kë',
'J0hnDoe',
'F行bar',
'L高rem',
'F前rstNäme',
'Ç学kë',
'0'
);
$matchedWords = array_filter($testWords, function ($word) {
// these characters should not be in the search string but may appear after iconv conversion
$regexCharsNot = '\^~"`\'';
$valid = false;
if (!preg_match("/[$regexCharsNot]/u", $word)) {
if ($word = @iconv('UTF-8', 'ASCII//TRANSLIT', $word)) {
$valid = preg_match("/^[A-Za-z$regexCharsNot]+$/u", $word);
}
}
return $valid;
});
echo print_r($matchedWords, true);
/*
Array
(
[0] => Çákë
[1] => JohnDoe
[2] => Fübar
[3] => Lòrem
[4] => FírstNäme
)
*/
iconv and ASCII//TRANSLIT introduces extraneous characters which is why the $regexCharsNot double validation is required. I came up with that list using the following:
// mb_str_split regex http://www.php.net/manual/en/function.mb-split.php#99851
// list of accented characters http://fasforward.com/list-of-european-special-characters/
$accentedCharacters = preg_split(
'/(?<!^)(?!$)/u',
'ÄäÀàÁáÂâÃãÅåĄąĂăÆæÇçĆćĈĉČčĎđĐďðÈèÉéÊêËëĚěĘęĜĝĢģĤĥÌìÍíÎîÏïĴĵĶķĹĺĻļŁłĽľÑñŃńŇňÖöÒòÓóÔôÕõŐőØøŒœŔŕŘřߌśŜŝŞşŠšŤťŢţÞþÜüÙùÚúÛûŰűŨũŲųŮůŴŵÝýŸÿŶŷŹźŽžŻż');
/*
$unsupported = ''; // 'Ǎǎẞ';
foreach ($accentedCharacters as $c) {
if (!@iconv('UTF-8', 'ASCII//TRANSLIT', $c)) {
$unsupported .= $c;
}
}
*/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With