Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing text nodes and check for alternative text nodes in html : Jsoup

I am trying to parse a html string using jsoup:

<div class="test">
  <br>From: <b class="sendername">Divya</b> 
  <span dir="ltr">&lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;</span>
  <br>Date: Wed, May 27, 2015 at 11:10 AM
  <br>Subject: Plan for the day 27/05/2015
  <br>To: Abhishek&lt;<a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;, 
    <a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;
  <br>Cc: Ram &lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;
  <br>
  <br>
  <br>
  <div dir="ltr">Hi,</div>
 </div>
Document doc = Jsoup.parse( mailBody.getBodyHtml().get( 0 ) );
Elements elem = doc.getElementsByClass( "test" );
int totalElements = 0;
Elements childElements = elem.get( 0 ).;
int brCount = 0;
for( Element childElement : childElements )
{
    totalElements++;
    if( childElement.tagName().equalsIgnoreCase( "br" ) )
    {
        brCount++;
        if( brCount == 3 )
            break;
    }
    else
    brCount = 0;
}
for( int i = 1; i <= totalElements; i++ )
{
    childElements.get( i ).remove();
}

I want to get rid of all content before three consecutive br tags and there should be no text node between them.
i.e. In above case, It will remove all tags(html tags and textnode) and output will be as follows:

<div class="test">
  <div dir="ltr">Hi,</div>
 </div>
  1. How to check if there is a text node between two br tags?
  2. Above code is just removing html tags, but text nodes are not getting deleted. How can I remove that?
like image 832
Abhishek Avatar asked Jan 22 '26 13:01

Abhishek


1 Answers

The structure of the html seems to be constant. So you can try the following CSS selector:

div.test br + br + br + div

DEMO

http://try.jsoup.org/~DiBi9Q_Ye88gi6Hq29Z44ar6xus

SAMPLE CODE

String html = "<div class=\"test\">\n  <br>From: <b class=\"sendername\">Divya</b> \n  <span dir=\"ltr\">&lt;<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>&gt;</span>\n  <br>Date: Wed, May 27, 2015 at 11:10 AM\n  <br>Subject: Plan for the day 27/05/2015\n  <br>To: Abhishek&lt;<a href=\"mailto:[email protected]\" target=\"_blank\">abhishek.sharma@abc.<wbr>com</a>&gt;, \n    <a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>&gt;\n  <br>Cc: Ram &lt;<a href=\"mailto:[email protected]\" target=\"_blank\">[email protected]</a>&gt;\n  <br>\n  <br>\n  <br>\n  <div dir=\"ltr\">Hi,</div>\n </div>";

Document doc = Jsoup.parse(html);

Element mailBody = doc.select("div.test br + br + br + div").first();
if (mailBody == null) {
    throw new RuntimeException("Unable to locate mail body.");
}
System.out.println("** BEFORE:\n" + doc);

Document tmp = Jsoup.parseBodyFragment("<div class='test'>" + mailBody.outerHtml() + "</div>");
mailBody.parent().replaceWith(tmp.select("div.test").first());
System.out.println("\n** AFTER:\n" + doc);

OUTPUT

** BEFORE:
<html>
 <head></head>
 <body>
  <div class="test"> 
   <br>From: 
   <b class="sendername">Divya</b> 
   <span dir="ltr">&lt;<a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt;</span> 
   <br>Date: Wed, May 27, 2015 at 11:10 AM 
   <br>Subject: Plan for the day 27/05/2015 
   <br>To: Abhishek&lt;
   <a href="mailto:[email protected]" target="_blank">abhishek.sharma@abc.<wbr>com</a>&gt;, 
   <a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt; 
   <br>Cc: Ram &lt;
   <a href="mailto:[email protected]" target="_blank">[email protected]</a>&gt; 
   <br> 
   <br> 
   <br> 
   <div dir="ltr">
    Hi,
   </div> 
  </div>
 </body>
</html>

** AFTER:
<html>
 <head></head>
 <body>
  <div class="test">
   <div dir="ltr">
     Hi, 
   </div>
  </div>
 </body>
</html>
like image 92
Stephan Avatar answered Jan 24 '26 01:01

Stephan