I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character “. I output the value before and after and I see that change.
Before:
THIS — IS A “TEST”. 5 > 4. trademark: ™
After:
THIS — IS A “TEST”. 5 > 4. trademark: ?
What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.
FYI, my Jsoup code is doing:
Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();
Thanks for any help!
The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.
The output:
<p>THIS — IS A “TEST”. 5 > 4. trademark: ™</p>
The code:
Document doc = Jsoup.parse("" +
    "<p>THIS — IS A “TEST”. 5 > 4. trademark: ™</p>");
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");
String modifiedFileHtmlStr = doc.html();
System.out.println(modifiedFileHtmlStr);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With