Java substring by code point indices (treating pairs of surrogate code units as single code point)

Question

I have a small demo app showing the issues with Java's substring implementation when using unicode codepoints that require surrogate pairs (i.e. cannot be represented in 2 bytes). I'm wondering if my solution works well or if I'm missing anything. I've considered posting on codereview but this has much more to do with Java's implementation of Strings than with my simple code itself.

public class SubstringTest {
    public static void main(String[] args) {

        String stringWithPlus2ByteCodePoints = "👦👩👪👫";

        String substring1 = stringWithPlus2ByteCodePoints.substring(0, 1);
        String substring2 = stringWithPlus2ByteCodePoints.substring(0, 2);
        String substring3 = stringWithPlus2ByteCodePoints.substring(1, 3);

        System.out.println(stringWithPlus2ByteCodePoints);
        System.out.println("invalid sub" + substring1);
        System.out.println("invalid sub" + substring2);
        System.out.println("invalid sub" + substring3);

        String realSub1 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 1);
        String realSub2 = getRealSubstring(stringWithPlus2ByteCodePoints, 0, 2);
        String realSub3 = getRealSubstring(stringWithPlus2ByteCodePoints, 1, 3);
        System.out.println("real sub:"  + realSub1);
        System.out.println("real sub:"  + realSub2);
        System.out.println("real sub:"  + realSub3);
    }

    private static String getRealSubstring(String string, int beginIndex, int endIndex) {
        if (string == null)
            throw new IllegalArgumentException("String should not be null");
        int length = string.length();
        if (endIndex < 0 || beginIndex > endIndex || beginIndex > length || endIndex > length)
            throw new IllegalArgumentException("Invalid indices");
        int realBeginIndex = string.offsetByCodePoints(0, beginIndex);
        int realEndIndex = string.offsetByCodePoints(0, endIndex);
        return string.substring(realBeginIndex, realEndIndex);
    }

}

The output:

👦👩👪👫
invalid sub: ?
invalid sub: 👦
invalid sub: ??
real sub: 👦
real sub: 👦👩
real sub: 👩👪

Can I rely on my substring implementation to always give the desired substring that avoids Java's issues with using chars for its substring method?

Andrey Tyukin · Accepted Answer

No need to walk to the beginIndex twice:

    public String codePointSubstring(String s, int start, int end) {
        int a = s.offsetByCodePoints(0, start);
        return s.substring(a, s.offsetByCodePoints(a, end - start));
    }

Translated from this Scala snippet:

def codePointSubstring(s: String, begin: Int, end: Int): String = {
  val a = s.offsetByCodePoints(0, begin)
  s.substring(a, s.offsetByCodePoints(a, end - begin))
}

I omitted the IllegalArgumentExceptions, because they don't seem to contain any more information than the exceptions that would be thrown anyway.

Java substring by code point indices (treating pairs of surrogate code units as single code point)

Tags:

java

string

char

character-encoding

unicode

Sebastiaan van den Broek

1 Answers

Andrey Tyukin

Recent Activity

Donate For Us

Java substring by code point indices (treating pairs of surrogate code units as single code point)

Tags:

java

string

char

character-encoding

unicode

Sebastiaan van den Broek

1 Answers

Andrey Tyukin

Related questions

Recent Activity

Donate For Us