I have a small demo app showing the issues with Java's substring implementation when using unicode codepoints that require surrogate pairs (i.e. cannot be represented in 2 bytes). I'm wondering if my solution works well or if I'm missing anything. I've considered posting on codereview but this has much more to do with Java's implementation of Strings than with my simple code itself.
public class SubstringTest {
public static void main(String[] args) {
String stringWithPlus2ByteCodePoints = "👦👩👪👫";
String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);
System.out.println(stringWithPlus2ByteCodePoints);
System.out.println("invalid sub" + substring1);
System.out.println("invalid sub" + substring2);
System.out.println("invalid sub" + substring3);
String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
System.out.println("real sub:" + realSub1);
System.out.println("real sub:" + realSub2);
System.out.println("real sub:" + realSub3);
}
private static String getRealSubstring(String string, int beginIndex, int endIndex) {
if (string == null)
throw new IllegalArgumentException("String should not be null");
int length = string.length();
if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex > length)
throw new IllegalArgumentException("Invalid indices");
int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
int realEndIndex = string.offsetByCodePoints(0, endIndex);
return string.substring(realBeginIndex, realEndIndex);
}
}
The output:
👦👩👪👫
invalid sub: ?
invalid sub: 👦
invalid sub: ??
real sub: 👦
real sub: 👦👩
real sub: 👩👪
Can I rely on my substring implementation to always give the desired substring that avoids Java's issues with using chars for its substring method?
No need to walk to the beginIndex
twice:
public String codePointSubstring(String s, int start, int end) {
int a = s.offsetByCodePoints(0, start);
return s.substring(a, s.offsetByCodePoints(a, end - start));
}
Translated from this Scala snippet:
def codePointSubstring(s: String, begin: Int, end: Int): String = {
val a = s.offsetByCodePoints(0, begin)
s.substring(a, s.offsetByCodePoints(a, end - begin))
}
I omitted the IllegalArgumentException
s, because they don't seem to contain any more information than the exceptions that would be thrown anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With