Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java CharsetDecoder inserting space after every character

Tags:

java

utf-8

groovy

I'm attempting to use this code (found on Stackoverflow) to remove invalid UTF-8 characters:

def text = file.text
CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ByteBuffer.allocate(text.getBytes().length * 2)
CharBuffer cbuf = bytes.asCharBuffer()
cbuf.put(text)
cbuf.flip()
CharBuffer parsed = utf8Decoder.decode(bytes);
println parsed.toString()

The output I get looks like this:

 < d o c u m e n t >
     < t i t l e > S o me  T i t l e   < / t i t l e >
     < s i t e > A S i t e < / s i t e >

Any ideas on why it is behaving like this?

like image 983
Mike Thomsen Avatar asked Dec 13 '25 01:12

Mike Thomsen


1 Answers

No idea why this didn't work, but this is what fixed it (code is in Groovy, not Java):

file.withInputStream { stream ->
    CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
    utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
    utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
    def reader = new BufferedReader(new InputStreamReader(stream, utf8Decoder))
    def line = null

    def sb = new StringBuilder()
    while ( (line = reader.readLine()) != null) {
        sb.append("$line\n")
    }
    reader.close()
}
like image 193
Mike Thomsen Avatar answered Dec 15 '25 19:12

Mike Thomsen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!