I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?
You could try using java.nio.charset.Charset:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));
Use the following steps:
InputStreamReader using the source file's encoding (Windows-1252)OutputStreamWriter using the destination file's encoding (UTF-16)BufferedReader and BufferedWriter to write contents line-by-line.So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With