I am trying to remove all the texts between the tags of an HTML page using Jsoup
For example, if the input HTML is
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
The output should be
<!DOCTYPE html>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>
Basically, I want to remove what is returned by doc.text()
I have found a lot of posts to do the contrary and keep only the text, but nothing to solve my problem. Any idea on how to do this?
EDIT
The solution proposed by maverick9999 : https://stackoverflow.com/a/24292349/3589481 will solve most of the cases.
However, as noticed in comments this solution will also remove the nested tags.
As an example:
    String str = "<!DOCTYPE html>" +
                "<html>" +
                "<body>" +
                "<div class='foo'>text <div class='THIS DIV WILL BE REMOVED'>text</div> text </div>" +
                "<h1>My First Heading</h1>\n" +
                "<p>My first paragraph.</p>\n" +
                "</body>\n" +
                "</html>";
        Document doc=Jsoup.parse(str);
        removeAllTexts(doc);
        System.out.println(doc);
        Elements all=doc.select("*");
        Iterator<Element>iterator=all.iterator();
        while(iterator.hasNext()){
            Element e=iterator.next();
            if(!e.ownText().isEmpty()){
                e.text("");
            }
        }
        System.out.println(doc);
Will remove one div in the output:
    <html>
     <head></head>
     <body>
      <div class="foo">
      </div>
     </body>
    </html>
Any thoughts to avoid this?
EDIT 2
For some reason, the tag "meta" is considered as self-closing by Jsoup. So if you have something like this:
System.out.println("\n\n----");
String html = "<!DOCTYPE html>\r\n"
+ "<html>\r\n"
+ "<head>\n" 
+ "<meta content=\"/myimage.png\" itemprop=\"image\">\n"
+ "<title>Title</title>\n" 
+ "<script>Random Javascript here</script>"
+ "</meta>"
+ "</head>"
+ "<body>\r\n"
+ "<h1>My First <i>Heading</i></h1>\r\n"
+ "<hr/>\r\n"
+ "<p>My first paragraph.</p>\r\n"
+ "<p> <div class='foo'>text <div class='bar'> text </div> text </div> </p>\r\n"
+ "</body>\r\n" 
+ "</html>";
Document doc2 = Jsoup.parse(html,"",Parser.xmlParser());
printNodes(doc2);
Then all the tags after meta will not be read. With Pshemo solution, the scripts are removed and if you have br tags with children (for example), they will be removed as well. 
I finally ended up with the following solution (thanks to Pshemo for his help):
   public static void printNodes(Node node) {
        String name = node.nodeName();
        if (name.equals("#doctype")) {
            System.out.println(node);
        } else if (name.equals("#text")) {
            return;
        } else if (name.equals("#document")) {
            for (Node n : node.childNodes())
                printNodes(n);
        } 
        // There is no reason to have text here, so print everything
        else if (name.equals("head") || name.equals("script")){
            System.out.println(node.toString());
        }
        else {
            if (!Tag.valueOf(name).isSelfClosing() || node.childNodeSize()>0) {
                System.out.println("<" + name + getAttributes(node) + ">");
                for (Node n : node.childNodes())
                    printNodes(n);
                System.out.println("</" + name + ">");
            } else {
                // System.out.println("debug: " + name + " is self closing");
                System.out.println("<" + name + getAttributes(node) + "/>");
            }
        }
    }
   public static String getAttributes(Node node) {
        StringBuilder sb = new StringBuilder();
        for (Attribute attr : node.attributes()) {
            sb.append(" ").append(attr.getKey()).append("=\"")
                    .append(attr.getValue()).append("\"");
        }
        return sb.toString();
    }
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
The below code should solve your problem with nested tags:
Updated code:
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element el : doc.select("*")){
    if (!el.ownText().isEmpty()){
        for (TextNode node : el.textNodes())
            node.remove();
    }
}
System.out.println(doc);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With