This is my code
import java.util.Date;
import java.text.DateFormat;
class DateTime {
public static void main(String[] args) {
String dt = DateFormat.getDateTimeInstance().format(new Date());
System.out.println(dt);
}
}
When compiled and executed with Java 21, the call to 'format()' returns a UTF-16 string containing invalid bytes, represented by a question mark:
Oct 3, 2023, 7:01:17?PM
Has anyone else seen this problem? Thanks.
The Answer by David Conrad is correct. What you are seeing is a new feature, not a bug.
The localization rules defined in the Unicode Consortium’s Common Locale Data Repository (CLDR) are continually evolving. Modern Java relies upon the CLDR as its main source of localization rules. So new versions of the CLDR bring new behaviors in Java.
This is life in the real world. Never harden your expectation of localized values. Those localizations may change in future versions of the CLDR, Java, and human cultures.
If localization behavior is critical to some logic in your code, write unit tests to verify that behavior.
If you want precise reliable textual representation of date-time values, use only standard formats such as ISO 8601. Localization is for human reading, not machine reading
We can verify Conrad’s claim that you are indeed seeing a U+202F NARROW NO-BREAK SPACE (NNBSP). Let's examine each character in your output.
We can inspect each character to get its number assigned by the Unicode Consortium, its code point. Our NNBSP character has a code point of 8,239 decimal, 202F hex.
String dt = DateFormat.getDateTimeInstance ( ).format ( new Date ( ) );
System.out.println ( dt );
String codePoints = dt.codePoints ( ).boxed ( ).toList ( ).toString ( );
System.out.println ( "codePoints = " + codePoints );
When run:
Oct 3, 2023, 6:02:35 PM
codePoints = [79, 99, 116, 32, 51, 44, 32, 50, 48, 50, 51, 44, 32, 54, 58, 48, 50, 58, 51, 53, 8239, 80, 77]
Sure enough, we see the 8239 of our NNBSP is third from the end, before the P and the M.
I would like to add a note about this change in the CLDR: This change is a good one, and makes sense. In logical typographical thinking, the AM/PM of a time-of-day should never be separated from the hours-minutes-seconds. Wrapping AM/PM on another line makes for clumsy reading. Using a non-breaking space rather than a plain breaking space makes sense. Being "thin" is a judgement I'll leave to the typography experts, but I gather makes sense as well.
The immediate solution to your problem of a ? replacement character appearing is to 👉🏾 change the character-encoding of your console app. Whatever console app you are using (which you neglected to mention in your Question) is apparently configured for some archaic character encoding rather than a modern Unicode-savvy character encoding such as UTF-8.
Change the character encoding of your console app (see Comment). Than your errant ? should appear as the true character, a thin non-breaking space.
You are using terribly flawed date-time classes that were years ago supplanted by the modern java.time defined in JSR 310. This use of legacy date-time classes should be avoided, instead using java.time for date-time work.
Your choice of legacy classes is not a factor in the particular issue of your Question. But just FYI, let me show you the modern version of your code.
An Instant object represents a moment as seen in UTC, that is, with an offset from UTC of zero hours-minutes-seconds. You can adjust that moment into a time zone, obtaining a ZonedDateTime. Same point on the timeline, but different wall-clock time/calendar.
Instant instant = Instant.now ( ); // `java.util.Date` was years ago replaced by `java.time.Instant`.
ZoneId z = ZoneId.of ( "Asia/Tokyo" ); // Or, `ZoneId.systemDefault`.
ZonedDateTime zdt = instant.atZone ( z );
Locale locale = Locale.US;
DateTimeFormatter f = DateTimeFormatter.ofLocalizedDateTime ( FormatStyle.MEDIUM ).withLocale ( locale );
String output = zdt.format ( f );
System.out.println ( "output = " + output );
System.out.println ( output.codePoints ( ).boxed ( ).toList ( ).toString ( ) );
When run.
output = Oct 4, 2023, 10:21:32 AM
[79, 99, 116, 32, 52, 44, 32, 50, 48, 50, 51, 44, 32, 49, 48, 58, 50, 49, 58, 51, 50, 8239, 65, 77]
We see the same 8239 before the A and the M.
We can examine the characters by their official Unicode names.
output.codePoints ( ).mapToObj ( Character :: getName ).forEach ( System.out :: println );
When run:
LATIN CAPITAL LETTER O
LATIN SMALL LETTER C
LATIN SMALL LETTER T
SPACE
DIGIT FIVE
COMMA
SPACE
DIGIT TWO
DIGIT ZERO
DIGIT TWO
DIGIT THREE
COMMA
SPACE
DIGIT ONE
DIGIT ZERO
COLON
DIGIT ZERO
DIGIT TWO
COLON
DIGIT TWO
DIGIT SIX
NARROW NO-BREAK SPACE
LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER M
Notice the NARROW NO-BREAK SPACE, third from last.
And we can examine the characters by their code point in hexadecimal rather than decimal.
output.codePoints ( ).mapToObj ( ( int codePoint ) -> String.format ( "U+%04X" , codePoint ) ).forEach ( System.out :: println );
When run:
U+004F
U+0063
U+0074
U+0020
U+0035
U+002C
U+0020
U+0032
U+0030
U+0032
U+0033
U+002C
U+0020
U+0031
U+0030
U+003A
U+0030
U+0035
U+003A
U+0031
U+0037
U+202F
U+0041
U+004D
Notice the U+202F, third from last.
This topic turns out to be an interesting can of worms for Unicode geeks like me.
Section 1 of the Unicode Consortium document, Proposal to synchronize the Core Specification explains that character U+202F NARROW NO-BREAK SPACE (NNBSP) has been incorrectly described as a narrow version of U+00A0 NO-BREAK SPACE. This means the Width variation section of the Non-breaking space page on Wikipedia is incorrect. That Unicode document argues that NNBSP is actually a non-breaking version of U+2009 THIN SPACE.
Another interesting note in that document is that the NNBSP character has largely served two purposes. I quote (my bullets):
Apparently we can now add a third major use to this use: formatting in date-time formats defined by the CLDR.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With