Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect if a file is not utf-8 encoded?

Tags:

java

utf-8

In Java, how can a file be tested that it's encoding is definitely not utf-8?

I want to be able to validate if the contents are well-formed utf-8.

Furthermore, also need to validate that the file does not start with the byte order mark (BOM).

like image 700
yas Avatar asked Sep 12 '25 15:09

yas


1 Answers

If you just need to test the file, without actually retaining its contents:

Path path = Paths.get("/home/dave/somefile.txt");
try (Reader reader = Files.newBufferedReader(path)) {
    int c = reader.read();
    if (c == 0xfeff) {
        System.out.println("File starts with a byte order mark.");
    } else if (c >= 0) {
        reader.transferTo(Writer.nullWriter());
    }
} catch (CharacterCodingException e) {
    System.out.println("Not a UTF-8 file.");
}
  • Files.newBufferedReader always uses UTF-8 if no charset is provided.
  • 0xfeff is the byte order mark codepoint.
  • reader.transferTo(Writer.nullWriter()) (available as of Java 11) processes the file and immediately discards it.
like image 199
VGR Avatar answered Sep 14 '25 05:09

VGR