Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split MB string based on length

I have a string which is in special language character.

先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)

My requirement is to make it an array in case the character limit exceeds my requirement using php. Like if it exceeds say 15 characters.

For that, I have tried

if(mb_strlen($string) > 15){

    $seed = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

But it is breaking. It is not breaking for all the cases but for the one has 35 chars.

Another approach I have tried is using this function:-

function word_chunk($str, $len = 76, $end = "||") {
                        $pattern = '~.{1,' . $len . '}~u'; // like "~.{1,76}~u"
                        $str = preg_replace($pattern, '$0' . $end, $str);
                        return rtrim($str, $end);
            }

Please help and understand that I need help for MB characters only

like image 228
Gagan Avatar asked Feb 01 '26 04:02

Gagan


1 Answers

This will split your string after every 10th "extended grapheme cluster" (suggested by Wiktor up in the comments).

var_export(preg_split('~\X{10}\K~u', $string));

preg_split('~.{10}\K~u', $string) will work on your sample string, but for cases beyond yours, \X is more robust when dealing with unicode.

From https://www.regular-expressions.info/unicode.html:

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

Here is a related SO page.

The \K restarts the fullstring match, so there are no characters lost in the split.

Here is a demo where $len=10 https://regex101.com/r/uO6ur9/2

Code: (Demo)

$string='先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)';
var_export(preg_split('~\X{10}\K~u',$string,));

Output:

array (
  0 => '先秦兩漢先秦兩漢先秦',
  1 => '兩漢漢先秦兩漢漢先秦',
  2 => '兩漢( 243071',
  3 => ')',
)

Implementation:

function word_chunk($str,$len){
    return preg_split('~\X{'.$len.'}\K~u',$str);
}

While preg_split() might be slightly slower than preg_match_all(), one advantage is that preg_split() provides the desired 1-dimensional array. preg_match_all() generates a multi-dimensional array by which you would only need to access the [0] subarray's elements.

like image 160
mickmackusa Avatar answered Feb 02 '26 17:02

mickmackusa