Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I decode a large, multi-byte string file progressively in Java?

I have a program that may need to process large files possibly containing multi-byte encodings. My current code for doing this has the problem that creates a memory structure to hold the entire file, which can cause an out of memory error if the file is large:

Charset charset = Charset.forName( "UTF-8" );
CharsetDecoder decoder = charset.newDecoder();
FileInputStream fis = new FileInputStream( file );
FileChannel fc = fis.getChannel();
int lenFile = (int)fc.size();
MappedByteBuffer bufferFile = fc.map( FileChannel.MapMode.READ_ONLY, 0, lenFile );
CharBuffer cb = decoder.decode( bufferFile );
// process character buffer
fc.close();

The problem is that if I chop up the file byte contents using a smaller buffer and feed it piecemeal to the decoder, then the buffer could end in the middle of a multi-byte sequence. How should I cope with this problem?

like image 962
Tyler Durden Avatar asked Dec 01 '25 06:12

Tyler Durden


1 Answers

It is as easy as using a Reader.

A CharsetDecoder is indeed the underlying mechanism which allows the decoding of bytes into chars. In short, you could say that:

// Extrapolation...
byte stream --> decoding       --> char stream
InputStream --> CharsetDecoder --> Reader

The less known fact is that most (but not all... See below) default decoders in the JDK (such as those created from a FileReader for instance, or an InputStreamReader with only a charset) will have a policy of CodingErrorAction.REPLACE. The effect is to replace any invalid byte sequence in the input with the Unicode replacement character (yes, that infamous �).

Now, if you are concerned about the ability for "bad characters" to slip in, you can also select to have a policy of REPORT. You can do that when reading a file, too, as follows; this will have the effect of throwing a MalformedInputException on any malformed byte sequence:

// This is 2015. File is obsolete.
final Path path = Paths.get(...);
final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

try (
    final InputStream in = Files.newInputStream(path);
    final Reader reader = new InputStreamReader(in, decoder);
) {
    // use the reader
}

ONE EXCEPTION to that default replace action appears in Java 8: Files.newBufferedReader(somePath) will try and read in UTF-8, always, and with a default action of REPORT.

like image 74
fge Avatar answered Dec 02 '25 20:12

fge