Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java substring by code point indices (treating pairs of surrogate code units as single code point)

I have a small demo app showing the issues with Java's substring implementation when using unicode codepoints that require surrogate pairs (i.e. cannot be represented in 2 bytes). I'm wondering if my solution works well or if I'm missing anything. I've considered posting on codereview but this has much more to do with Java's implementation of Strings than with my simple code itself.

public class SubstringTest {
    public static void main(String[] args) {

        String stringWithPlus2ByteCodePoints = "👦👩👪👫";

        String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
        String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
        String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);

        System.out.println(stringWithPlus2ByteCodePoints);
        System.out.println("invalid sub" + substring1);
        System.out.println("invalid sub" + substring2);
        System.out.println("invalid sub" + substring3);

        String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
        String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
        String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
        System.out.println("real sub:"  + realSub1);
        System.out.println("real sub:"  + realSub2);
        System.out.println("real sub:"  + realSub3);
    }

    private static String getRealSubstring(String string, int beginIndex, int endIndex) {
        if (string == null)
            throw new IllegalArgumentException("String should not be null");
        int length = string.length();
        if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex > length)
            throw new IllegalArgumentException("Invalid indices");
        int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
        int realEndIndex = string.offsetByCodePoints(0, endIndex);
        return string.substring(realBeginIndex, realEndIndex);
    }

}

The output:

👦👩👪👫
invalid sub: ?
invalid sub: 👦
invalid sub: ??
real sub: 👦
real sub: 👦👩
real sub: 👩👪

Can I rely on my substring implementation to always give the desired substring that avoids Java's issues with using chars for its substring method?

like image 429
Sebastiaan van den Broek Avatar asked Sep 06 '25 03:09

Sebastiaan van den Broek


1 Answers

No need to walk to the beginIndex twice:

    public String codePointSubstring(String s, int start, int end) {
        int a = s.offsetByCodePoints(0, start);
        return s.substring(a, s.offsetByCodePoints(a, end - start));
    }

Translated from this Scala snippet:

def codePointSubstring(s: String, begin: Int, end: Int): String = {
  val a = s.offsetByCodePoints(0, begin)
  s.substring(a, s.offsetByCodePoints(a, end - begin))
}

I omitted the IllegalArgumentExceptions, because they don't seem to contain any more information than the exceptions that would be thrown anyway.

like image 66
Andrey Tyukin Avatar answered Sep 07 '25 19:09

Andrey Tyukin