Recently, only I notice that, it is possible for substring to return string with invalid unicode character.
For instance
public class Main {
public static void main(String[] args) {
String text = "🥦_Salade verte";
/* We should avoid using endIndex = 1, as it will cause an invalid character in the returned substring. */
// 1 : ?
System.out.println("1 : " + text.substring(0, 1));
// 2 : 🥦
System.out.println("2 : " + text.substring(0, 2));
// 3 : 🥦_
System.out.println("3 : " + text.substring(0, 3));
// 4 : 🥦_S
System.out.println("4 : " + text.substring(0, 4));
}
}
I was wondering, when trimming a long string with String.substring, what are some good ways to avoid the returned substring from containing invalid unicode?
char obsoleteThe char type has been legacy since Java 2, essentially broken. As a 16-bit value, char is physically incapable of representing most characters.
Your discovery suggests that the String#substring command is char based. Hence the problem shown in your code.
Instead, use code point integer numbers when working with individual characters.
int[] codePoints = "🥦_Salade".codePoints().toArray() ;
[129382, 95, 83, 97, 108, 97, 100, 101]
Extract the first character’s code point.
int codePoint = codePoints[ 0 ] ;
129382
Make a single-character String object for that code point.
String firstCharacter = Character.toString( codePoint ) ;
🥦
You can grab a subset of that int array of code points.
int[] firstFewCodePoints = Arrays.copyOfRange( codePoints , 0 , 3 ) ;
And make a String object from those code points.
String s =
Arrays
.stream( firstFewCodePoints )
.collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
.toString();
🥦_S
Or we can use a constructor of String to take a subset of the array.
String result = new String( codePoints , 0 , 3 ) ;
🥦_S
See this code run live at IdeOne.com.
The answer by Basil nicely shows that you should work with code points instead of chars.
A String does not store Unicode code points internally, so there is no way to know which characters belong together forming a Unicode code point, without inspecting the actual contents of the string.
Here is a Unicode-aware substring method. Since codePoints() returns an IntStream, we can utilize the skip and limit methods to extract a portion of the string.
public static String unicodeSubstring(String string, int beginIndex, int endIndex) {
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
.skip(beginIndex)
.limit(length)
.toArray();
return new String(codePoints, 0, codePoints.length);
}
This is what happens in the abovementioned snippet of code. We stream over the Unicode code points, skipping the first beginIndex bytes and limiting the stream to endIndex − beginIndex, and then convertb to int[]. The result is that the int array contains all Unicode code points from beginIndex up to endIndex.
At last, the String class contains a nice constructor to construct a String from an int[] with code points, so we use it to get the String.
Of course, you could tweak the method to be a little more strict by rejecting out-of-bounds values:
if (endIndex < beginIndex) {
throw new IllegalArgumentException("endIndex < beginIndex");
}
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
.skip(beginIndex)
.limit(length)
.toArray();
if (codePoints.length < length) {
throw new IllegalArgumentException(
"begin %s, end %s, length %s".formatted(beginIndex, endIndex, codePoints.length)
);
}
return new String(codePoints, 0, codePoints.length);
Online demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With