Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Windows-1252 to UTF-16 in Java

I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.

So far I've tried

 private String replaceWordChars(String text_in) {
    String s = text_in;

    // smart single quotes and apostrophe
    s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
    // smart double quotes
    s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
    // ellipsis
    s = s.replaceAll("\\u2026", "...");
    // dashes
    s = s.replaceAll("[\\u2013|\\u2014]", "-");
    // circumflex
    s = s.replaceAll("\\u02C6", "^");
    // open angle bracket
    s = s.replaceAll("\\u2039", "<");
    // close angle bracket
    s = s.replaceAll("\\u203A", ">");
    // spaces
    s = s.replaceAll("[\\u02DC|\\u00A0]", " ");

    return s;

Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)

However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example

private String replaceWordChars(String text_in) {
    String s = text_in;
    try {
        byte[] b = s.getBytes("Cp1252");
        byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
        s = new String(encoded, "UTF-16");


    } catch (UnsupportedEncodingException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    return s;

But when I watch the encoding happen in the Eclipse debugger, nothing changes.

There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.

Any thoughts?

like image 821
idonaldson Avatar asked Jan 25 '26 12:01

idonaldson


2 Answers

You could try using java.nio.charset.Charset:

final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));
like image 87
laz Avatar answered Jan 28 '26 00:01

laz


Use the following steps:

  1. Create an InputStreamReader using the source file's encoding (Windows-1252)
  2. Create an OutputStreamWriter using the destination file's encoding (UTF-16)
  3. Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.

So your code may look like this:

public void reencode(InputStream source, OutputStream dest,
        String sourceEncoding, String destEncoding)
        throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
    String in;
    while ((in = reader.readLine()) != null) {
        writer.write(in);
        writer.newLine();
    }
}

This, of course, excludes try/catch stuff and delegates it to the caller.

If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:

public String decode(InputStream source, String sourceEncoding)
        throws IOException {
    BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
    StringWriter writer = new StringWriter();
    String in;
    while ((in = reader.readLine()) != null) {
        writer.write(in);
        writer.write('\n'); // Java newline should be fine, test this just in case
    }
    return writer.toString();
}
like image 24
Brian Avatar answered Jan 28 '26 01:01

Brian



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!