Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split unicode string into list of character strings

Tags:

java

unicode

How to split unicode string containing surrogate-pair characters and normal characters into a List<String> of characters?

(String is required to store surrogate-pair characters consisting of two char)

like image 638
cdalxndr Avatar asked Jan 23 '26 05:01

cdalxndr


1 Answers

Try this.

String s = "😊a👦c😊";
List<String> result = List.of(s.split("(?<=.)"));
for (String e : result)
    System.out.println(e + " : length=" + e.length());

output:

😊 : length=2
a : length=1
👦 : length=2
c : length=1
😊 : length=2

Code points

Or, use a stream of code point integer numbers.

List<String> result = 
    s
    .codePoints()                    // Produce a `IntStream` of code point numbers.
    .mapToObj(Character::toString)   // Produce a `String` containing one or two java chars for each code point in the stream.
    .collect(Collectors.toList());

See this code run live at IdeOne.com.

To capture the code points, use this variation of the above code.

List<Integer> codePointNumbers = 
    s
    .codePoints()            
    .boxed()       
    .collect( Collectors.toList() ) ;

When run:

codePointNumbers.toString(): [128522, 97, 128102, 99, 128522]


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!