Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct way to split UTF-8 String

I want to split a utf-8 string.

I have tried the StringTokenizer but it fails.

The title should be "0" but it shows as "عُدي_صدّام_حُسين".

    String test = "en.m عُدي_صدّام_حُسين 1 0";

    StringTokenizer stringTokenizer = new StringTokenizer(test);
    String code = stringTokenizer.nextToken();
    String title = stringTokenizer.nextToken();

enter image description here What is the correct way to split a utf-8 string?

like image 489
Jason Avatar asked Jun 06 '26 08:06

Jason


1 Answers

The problem here is that the Arabic text isn't "at the end" of the string.

For example, if I select the contents of the string literal (in Chrome), moving my mouse from left-to-right, it selects the en.m first, then selects all of the arabic text, then the 0 1. The text just looks "at the end" because that's how it is being rendered.

The string, as specified in your Java source code actually does have the عُدي_صدّام_حُسين as the second token. So, you're splitting it correctly, you're just not splitting what you think you're splitting.

like image 91
Andy Turner Avatar answered Jun 08 '26 22:06

Andy Turner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!