Java 21 problem with DateFormat.getDateTimeInstance().format(new Date())

Question

This is my code

import java.util.Date;
import java.text.DateFormat;

class DateTime {
    public static void main(String[] args) {
        String dt = DateFormat.getDateTimeInstance().format(new Date());
        System.out.println(dt);
    }
}

When compiled and executed with Java 21, the call to 'format()' returns a UTF-16 string containing invalid bytes, represented by a question mark:

Oct 3, 2023, 7:01:17?PM

Has anyone else seen this problem? Thanks.

Basil Bourque · Accepted Answer

New feature, not a bug

The Answer by David Conrad is correct. What you are seeing is a new feature, not a bug.

New version of CLDR

The localization rules defined in the Unicode Consortium’s Common Locale Data Repository (CLDR) are continually evolving. Modern Java relies upon the CLDR as its main source of localization rules. So new versions of the CLDR bring new behaviors in Java.

Localizations evolve

This is life in the real world. Never harden your expectation of localized values. Those localizations may change in future versions of the CLDR, Java, and human cultures.

If localization behavior is critical to some logic in your code, write unit tests to verify that behavior.

ISO 8601

If you want precise reliable textual representation of date-time values, use only standard formats such as ISO 8601. Localization is for human reading, not machine reading

Detecting NNBSP character

We can verify Conrad’s claim that you are indeed seeing a U+202F NARROW NO-BREAK SPACE (NNBSP). Let's examine each character in your output.

We can inspect each character to get its number assigned by the Unicode Consortium, its code point. Our NNBSP character has a code point of 8,239 decimal, 202F hex.

String dt = DateFormat.getDateTimeInstance ( ).format ( new Date ( ) );
System.out.println ( dt );
String codePoints = dt.codePoints ( ).boxed ( ).toList ( ).toString ( );
System.out.println ( "codePoints = " + codePoints );

When run:

Oct 3, 2023, 6:02:35 PM
codePoints = [79, 99, 116, 32, 51, 44, 32, 50, 48, 50, 51, 44, 32, 54, 58, 48, 50, 58, 51, 53, 8239, 80, 77]

Sure enough, we see the 8239 of our NNBSP is third from the end, before the P and the M.

Change is good

I would like to add a note about this change in the CLDR: This change is a good one, and makes sense. In logical typographical thinking, the AM/PM of a time-of-day should never be separated from the hours-minutes-seconds. Wrapping AM/PM on another line makes for clumsy reading. Using a non-breaking space rather than a plain breaking space makes sense. Being "thin" is a judgement I'll leave to the typography experts, but I gather makes sense as well.

Solution: Fix your console

The immediate solution to your problem of a ? replacement character appearing is to 👉🏾 change the character-encoding of your console app. Whatever console app you are using (which you neglected to mention in your Question) is apparently configured for some archaic character encoding rather than a modern Unicode-savvy character encoding such as UTF-8.

Change the character encoding of your console app (see Comment). Than your errant ? should appear as the true character, a thin non-breaking space.

Avoid legacy date-time classes

You are using terribly flawed date-time classes that were years ago supplanted by the modern java.time defined in JSR 310. This use of legacy date-time classes should be avoided, instead using java.time for date-time work.

Your choice of legacy classes is not a factor in the particular issue of your Question. But just FYI, let me show you the modern version of your code.

An Instant object represents a moment as seen in UTC, that is, with an offset from UTC of zero hours-minutes-seconds. You can adjust that moment into a time zone, obtaining a ZonedDateTime. Same point on the timeline, but different wall-clock time/calendar.

Instant instant = Instant.now ( ); // `java.util.Date` was years ago replaced by `java.time.Instant`.
ZoneId z = ZoneId.of ( "Asia/Tokyo" );  // Or, `ZoneId.systemDefault`. 
ZonedDateTime zdt = instant.atZone ( z );
Locale locale = Locale.US;  
DateTimeFormatter f = DateTimeFormatter.ofLocalizedDateTime ( FormatStyle.MEDIUM ).withLocale ( locale );
String output = zdt.format ( f );
System.out.println ( "output = " + output );
System.out.println ( output.codePoints ( ).boxed ( ).toList ( ).toString ( ) );

When run.

output = Oct 4, 2023, 10:21:32 AM
[79, 99, 116, 32, 52, 44, 32, 50, 48, 50, 51, 44, 32, 49, 48, 58, 50, 49, 58, 51, 50, 8239, 65, 77]

We see the same 8239 before the A and the M.

We can examine the characters by their official Unicode names.

output.codePoints ( ).mapToObj ( Character :: getName ).forEach ( System.out :: println );

When run:

LATIN CAPITAL LETTER O
LATIN SMALL LETTER C
LATIN SMALL LETTER T
SPACE
DIGIT FIVE
COMMA
SPACE
DIGIT TWO
DIGIT ZERO
DIGIT TWO
DIGIT THREE
COMMA
SPACE
DIGIT ONE
DIGIT ZERO
COLON
DIGIT ZERO
DIGIT TWO
COLON
DIGIT TWO
DIGIT SIX
NARROW NO-BREAK SPACE
LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER M

Notice the NARROW NO-BREAK SPACE, third from last.

And we can examine the characters by their code point in hexadecimal rather than decimal.

output.codePoints ( ).mapToObj ( ( int codePoint ) -> String.format ( "U+%04X" , codePoint ) ).forEach ( System.out :: println );

When run:

U+004F
U+0063
U+0074
U+0020
U+0035
U+002C
U+0020
U+0032
U+0030
U+0032
U+0033
U+002C
U+0020
U+0031
U+0030
U+003A
U+0030
U+0035
U+003A
U+0031
U+0037
U+202F
U+0041
U+004D

Notice the U+202F, third from last.

For Unicode geeks

This topic turns out to be an interesting can of worms for Unicode geeks like me.

Section 1 of the Unicode Consortium document, Proposal to synchronize the Core Specification explains that character U+202F NARROW NO-BREAK SPACE (NNBSP) has been incorrectly described as a narrow version of U+00A0 NO-BREAK SPACE. This means the Width variation section of the Non-breaking space page on Wikipedia is incorrect. That Unicode document argues that NNBSP is actually a non-breaking version of U+2009 THIN SPACE.

Another interesting note in that document is that the NNBSP character has largely served two purposes. I quote (my bullets):

The NNBSP can be used to represent the narrow space occurring around punctuation characters in French typography, which is called an “espace fine insécable.”
It is used especially in Mongolian text, before certain grammatical suffixes, to provide a small gap that not only prevents word breaking and line breaking, but also triggers special shaping for those suffixes.

Apparently we can now add a third major use to this use: formatting in date-time formats defined by the CLDR.

Java 21 problem with DateFormat.getDateTimeInstance().format(new Date())

Tags:

java

datetime

formatting

non-breaking-characters

java-21

Spencer Shellman

1 Answers

New feature, not a bug

New version of CLDR

Localizations evolve

ISO 8601

Detecting NNBSP character

Change is good

Solution: Fix your console

Avoid legacy date-time classes

For Unicode geeks

Basil Bourque

Recent Activity

Donate For Us

Java 21 problem with DateFormat.getDateTimeInstance().format(new Date())

Tags:

java

datetime

formatting

non-breaking-characters

java-21

Spencer Shellman

1 Answers

New feature, not a bug

New version of CLDR

Localizations evolve

ISO 8601

Detecting NNBSP character

Change is good

Solution: Fix your console

Avoid legacy date-time classes

For Unicode geeks

Basil Bourque

Related questions

Recent Activity

Donate For Us