I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.
The problem is that Jsoup unescapes some special characters.
For example, for the input:
<html><head></head><body><p>isn’t</p></body></html>
After running
String check = "<html><head></head><body><p>isn’t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());
I get:
<html><head></head><body><p>isn’t</p></body></html><p></p>
I want to avoid changing the html in any other way except for removing the images.
By using the command:
doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);
I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?
Thank you!
Here is a workaround not involving any charset except the one specified in the HTTP header.
String check = "<html><head></head><body><p>isn’t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");
Document doc = Jsoup.parse(check);
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
OUTPUT
<html><head></head><body><p>isn’t</p></body></html>
DISCUSSION
I wish there was a solution in Jsoup's API - @dlv
Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.
Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.
Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (—), the original escape sequence (’) or write the encoded character (which is the case in your post).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With