What are some ways to avoid String.substring from returning substring with invalid unicode character?

Question

Recently, only I notice that, it is possible for substring to return string with invalid unicode character.

For instance

public class Main {

    public static void main(String[] args) {
        String text = "🥦_Salade verte";

        /* We should avoid using endIndex = 1, as it will cause an invalid character in the returned substring. */
        // 1 : ?
        System.out.println("1 : " + text.substring(0, 1));

        // 2 : 🥦
        System.out.println("2 : " + text.substring(0, 2));

        // 3 : 🥦_
        System.out.println("3 : " + text.substring(0, 3));

        // 4 : 🥦_S
        System.out.println("4 : " + text.substring(0, 4));
    }
}

I was wondering, when trimming a long string with String.substring, what are some good ways to avoid the returned substring from containing invalid unicode?

Basil Bourque · Accepted Answer

`char` obsolete

The char type has been legacy since Java 2, essentially broken. As a 16-bit value, char is physically incapable of representing most characters.

Your discovery suggests that the String#substring command is char based. Hence the problem shown in your code.

Code point

Instead, use code point integer numbers when working with individual characters.

int[] codePoints = "🥦_Salade".codePoints().toArray() ;

[129382, 95, 83, 97, 108, 97, 100, 101]

Extract the first character’s code point.

int codePoint = codePoints[ 0 ] ;

129382

Make a single-character String object for that code point.

String firstCharacter = Character.toString( codePoint ) ;

🥦

You can grab a subset of that int array of code points.

int[] firstFewCodePoints = Arrays.copyOfRange( codePoints , 0 , 3 ) ;

And make a String object from those code points.

String s = 
    Arrays
        .stream( firstFewCodePoints ) 
        .collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
        .toString();

🥦_S

Or we can use a constructor of String to take a subset of the array.

String result = new String( codePoints , 0 , 3 ) ;

🥦_S

See this code run live at IdeOne.com.

MC Emperor · Answer

The answer by Basil nicely shows that you should work with code points instead of chars.

A String does not store Unicode code points internally, so there is no way to know which characters belong together forming a Unicode code point, without inspecting the actual contents of the string.

Unicode-aware substring

Here is a Unicode-aware substring method. Since codePoints() returns an IntStream, we can utilize the skip and limit methods to extract a portion of the string.

public static String unicodeSubstring(String string, int beginIndex, int endIndex) {
    int length = endIndex - beginIndex;
    int[] codePoints = string.codePoints()
        .skip(beginIndex)
        .limit(length)
        .toArray();
    return new String(codePoints, 0, codePoints.length);
}

This is what happens in the abovementioned snippet of code. We stream over the Unicode code points, skipping the first beginIndex bytes and limiting the stream to endIndex − beginIndex, and then convertb to int[]. The result is that the int array contains all Unicode code points from beginIndex up to endIndex.

At last, the String class contains a nice constructor to construct a String from an int[] with code points, so we use it to get the String.

Of course, you could tweak the method to be a little more strict by rejecting out-of-bounds values:

if (endIndex < beginIndex) {
    throw new IllegalArgumentException("endIndex < beginIndex");
}
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
    .skip(beginIndex)
    .limit(length)
    .toArray();
if (codePoints.length < length) {
    throw new IllegalArgumentException(
        "begin %s, end %s, length %s".formatted(beginIndex, endIndex, codePoints.length)
    );
}
return new String(codePoints, 0, codePoints.length);

Online demo

What are some ways to avoid String.substring from returning substring with invalid unicode character?

Tags:

java

android

Cheok Yan Cheng

2 Answers

`char` obsolete

Code point

Basil Bourque

Unicode-aware substring

MC Emperor

Recent Activity

Donate For Us

What are some ways to avoid String.substring from returning substring with invalid unicode character?

Tags:

java

android

Cheok Yan Cheng

2 Answers

char obsolete

Code point

Basil Bourque

Unicode-aware substring

MC Emperor

Related questions

Recent Activity

Donate For Us

`char` obsolete