Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse an HTML file with QT?

Tags:

c++

html

qt

The goal is to achieve a QDomDocument or something similar with the content of an HTML (not XML) document.

The problem is that some tags, especially script trigger errors:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">
        var a = [1,2,3];
        var b = (2<a.length);
    </script>
</head>
<body/>
</html>

Not well formed: Element type "a.length" must be followed by either attribute specifications, ">" or "/>".

I understand that HTML is not the same as XML, but it seems reasonable that Qt has a solution for this:

  • Setting the parser to accept HTML
  • Another class for HTML
  • A way to set some tags name as CDATA.

My current try only achieves normal XML parsing:

QString mainHtml;

{
    QFile file("main.html");
    if (!file.open(QIODevice::ReadOnly)) qDebug() << "Error reading file main.html";
    QTextStream stream(&file);
    mainHtml = stream.readAll();
    file.close();
}

QQDomDocument doc;
QString errStr;
int errLine=0, errCol=0;
doc.setContent( mainHtml, false, &errStr, &errLine, &errCol);
if (!errStr.isEmpty())
{
    qDebug() << errStr << "L:" << errLine << ":" << errCol;
}

std::function<void(const QDomElement&, int)> printTags=
[&printTags](const QDomElement& elem, int tab)
{
    QString space(3*tab, ' ');
    QDomNode n = elem.firstChild();
    for( ;!n.isNull(); n=n.nextSibling()) 
    {
        QDomElement e = n.toElement();
        if(e.isNull()) continue;
        
        qDebug() << space + e.tagName(); 
        printTags( e, tab+1);
    }
};
printTags(doc.documentElement(), 0);

Note: I would like to avoid including the full webkit for this.

like image 582
Adrian Maire Avatar asked Sep 02 '25 06:09

Adrian Maire


1 Answers

I recommend to use htmlcxx. It is licensed under LPGL. It works on Linux and Windows. If you use windows compile with msys.

To compile it just extract the files and run

./configure --prefix=/usr/local/htmlcxx
make
make install

In your .pro file add the include and library directory.

INCLUDEPATH += /usr/local/htmlcxx/include
LIBS += -L/usr/local/htmlcxx/lib -lhtmlcxx

Usage example

#include <iostream>
#include "htmlcxx/html/ParserDom.h"
#include <stdlib.h>

int main (int argc, char *argv[])
{
  using namespace std;
  using namespace htmlcxx;

  //Parse some html code
  string html = "<html><body>hey<A href=\"www.bbxyard.com\">myhome</A></body></html>";
  HTML::ParserDom parser;
  tree<HTML::Node> dom = parser.parseTree(html);
  //Print whole DOM tree
  cout << dom << endl;

  //Dump all links in the tree
  tree<HTML::Node>::iterator it = dom.begin();
  tree<HTML::Node>::iterator end = dom.end();
  for (; it != end; ++it)
  {
     if (strcasecmp(it->tagName().c_str(), "A") == 0)
     {
       it->parseAttributes();
       cout << it->attribute("href").second << endl;
     }
  }

  //Dump all text of the document
  it = dom.begin();
  end = dom.end();
  for (; it != end; ++it)
  {
    if ((!it->isTag()) && (!it->isComment()))
    {
      cout << it->text() << " ";
    }
  }
  cout << endl;
  return 0;
}

Credits for the example: https://github.com/bbxyard/sdk/blob/master/examples/htmlcxx/htmlcxx-demo.cpp

You can't use an XML parser for HTML. You either use htmlcxx or convert the HTML to valid XML. Then you are free to use QDomDocument, Qt XML parsers, etc.

QWebEngine has also parsing functionality, but brings a large overhead with the application.

like image 187
user3606329 Avatar answered Sep 04 '25 21:09

user3606329