Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Point TagSoup Parser to use HTML5 version

Tags:

html

tag-soup

I want TagSoup settings to use HTML5 standars.
I am using tagsoup Parser which is adhearing to HTML4 which doesn't allow a <div> inside an <a> tag. hence, parsing a wrong HTML. However, HTML5 allows the use of the same. How do I makethe tagsoup (org.ccil.cowan.tagsoup) to use HTML5 standards. eg,

<a>
  <div></div>
</a>

becomes,

<a></a>
<div></div>
like image 955
Anish Somani Avatar asked Nov 28 '25 11:11

Anish Somani


1 Answers

I had the same problem with the following structure:

<a>
  <li></li>
  <p></p>
</a>

became,

<a>
  <li></li>
</a>
<p></p>

I resolved it by using a custom HTMLSchema:

private class CustomHTMLSchema extends HTMLSchema
{
    public CustomHTMLSchema()
    {
        super();
        ElementType elA = getElementType("a");
        elA.setModel(elA.model() | M_BLOCK);
    }
}

...

saxParser = SAXParserImpl.newInstance(null);
saxParser.setProperty(Parser.schemaProperty, new CustomHTMLSchema());
like image 122
reyz Avatar answered Nov 30 '25 04:11

reyz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!