The goal is to achieve a QDomDocument
or something similar with the content of an HTML (not XML) document.
The problem is that some tags, especially script
trigger errors:
<!DOCTYPE html>
<html>
<head>
<script type="text/javascript">
var a = [1,2,3];
var b = (2<a.length);
</script>
</head>
<body/>
</html>
Not well formed: Element type "a.length" must be followed by either attribute specifications, ">" or "/>".
I understand that HTML is not the same as XML, but it seems reasonable that Qt has a solution for this:
My current try only achieves normal XML parsing:
QString mainHtml;
{
QFile file("main.html");
if (!file.open(QIODevice::ReadOnly)) qDebug() << "Error reading file main.html";
QTextStream stream(&file);
mainHtml = stream.readAll();
file.close();
}
QQDomDocument doc;
QString errStr;
int errLine=0, errCol=0;
doc.setContent( mainHtml, false, &errStr, &errLine, &errCol);
if (!errStr.isEmpty())
{
qDebug() << errStr << "L:" << errLine << ":" << errCol;
}
std::function<void(const QDomElement&, int)> printTags=
[&printTags](const QDomElement& elem, int tab)
{
QString space(3*tab, ' ');
QDomNode n = elem.firstChild();
for( ;!n.isNull(); n=n.nextSibling())
{
QDomElement e = n.toElement();
if(e.isNull()) continue;
qDebug() << space + e.tagName();
printTags( e, tab+1);
}
};
printTags(doc.documentElement(), 0);
Note: I would like to avoid including the full webkit for this.
I recommend to use htmlcxx. It is licensed under LPGL. It works on Linux and Windows. If you use windows compile with msys.
To compile it just extract the files and run
./configure --prefix=/usr/local/htmlcxx
make
make install
In your .pro file add the include and library directory.
INCLUDEPATH += /usr/local/htmlcxx/include
LIBS += -L/usr/local/htmlcxx/lib -lhtmlcxx
Usage example
#include <iostream>
#include "htmlcxx/html/ParserDom.h"
#include <stdlib.h>
int main (int argc, char *argv[])
{
using namespace std;
using namespace htmlcxx;
//Parse some html code
string html = "<html><body>hey<A href=\"www.bbxyard.com\">myhome</A></body></html>";
HTML::ParserDom parser;
tree<HTML::Node> dom = parser.parseTree(html);
//Print whole DOM tree
cout << dom << endl;
//Dump all links in the tree
tree<HTML::Node>::iterator it = dom.begin();
tree<HTML::Node>::iterator end = dom.end();
for (; it != end; ++it)
{
if (strcasecmp(it->tagName().c_str(), "A") == 0)
{
it->parseAttributes();
cout << it->attribute("href").second << endl;
}
}
//Dump all text of the document
it = dom.begin();
end = dom.end();
for (; it != end; ++it)
{
if ((!it->isTag()) && (!it->isComment()))
{
cout << it->text() << " ";
}
}
cout << endl;
return 0;
}
Credits for the example: https://github.com/bbxyard/sdk/blob/master/examples/htmlcxx/htmlcxx-demo.cpp
You can't use an XML parser for HTML. You either use htmlcxx or convert the HTML to valid XML. Then you are free to use QDomDocument, Qt XML parsers, etc.
QWebEngine has also parsing functionality, but brings a large overhead with the application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With