I need to display the first symbol of a string. The simpliest code for this would be:
String text = "test string";
char firstSymbol = text[0];
But this doesn't work if the character doesn't fit 16 bits, for example "\uD83D\uDC68" (👨, U+1F468). Only half of the character is returned and it is rendered as question mark.
String text = "test string";
int codePoint = text.codePointAt(0);
char[] chars = Character.toChars(codePoint);
String firstSymbol = new String(chars);
This works well for any character that is represented in Unicode. However, there are sequences of Unicode characters are displayed as one symbol. When I run the code above for them only part of symbol is displayed as it happens for "\uD83D\uDC68\u200D\uD83D\uDCBB" (👨💻). In this case I want the result to be the whole string. How can I handle such cases?
It should be charAt() of course, my fault. But char is UTF-16 encoded and can't contain several characters. The first example should be this:
String text = "test string";
char firstSymbol = text.charAt(0);
Another tough example for one symbol is "\u0D23\u0D4D\u200D" (ണ്). It has two characters and zero-width joiner at the end.
I have tried to use android.icu library, which descends from ICU4J, but unfortunately it is supported only starting from API 24. Moreover it produces the same result as the second example, i.e. it doesn't join characters if zero-width joiner is between them.
int breakIterator = BreakIterator.getCharacterInstance();
breakIterator.setText(text);
int begin = breakIterator.first();
int end = breakIterator.next();
String firstSymbol = text.substring(begin, end);
\u200D is Unicode codepoint U+200D ZERO WIDTH JOINER. If you want to extract a sequence of joined codepoints, you are going to have to iterate the string manually until you encounter a non-joined codepoint, eg:
String text = ...;
StringBuilder sequence = new StringBuilder(text.length());
boolean isInJoin = false;
int codePoint;
for (int i = 0; i < text.length(); i = text.offsetByCodePoints(i, 1))
{
codePoint = text.codePointAt(i);
if (codePoint == 0x200D)
{
isInJoin = true;
if (sequence.length() == 0)
continue;
}
else
{
if ((sequence.length() > 0) && (!isInJoin)) break;
isInJoin = false;
}
sequence.appendCodePoint(codePoint);
}
if (isInJoin)
{
for(int i = sequence.length()-1; i >= 0; --i)
{
if (sequence.charAt(i) == 0x200D)
sequence.deleteCharAt(i);
else
break;
}
}
String firstSymbols = sequence.toString();
Alternatively:
String text = ...;
boolean isInJoin = false;
int start = 0, length = 0, next;
int codePoint;
for (int i = 0; i < text.length(); i = next)
{
codePoint = text.codePointAt(i);
if (codePoint == 0x200D)
{
isInJoin = true;
if (length == 0)
{
next = text.offsetByCodePoints(i, 1);
start = next;
continue;
}
}
else
{
if ((length > 0) && (!isInJoin)) break;
isInJoin = false;
}
next = text.offsetByCodePoints(i, 1);
length += (next - i);
}
if (isInJoin)
{
for(int i = length-1; i >= 0; --i)
{
if (text.charAt(i) == 0x200D)
--length;
else
break;
}
}
String firstSymbols = text.substring(start, start+length);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With