My experience tells me that one should not use RegExp to parse HTML/XML, and I completely agree! It's
They all say "use a DOM parser" of some sort, which is fine by me. But now I got curious. How do those work?
I was searching for the DOMDocument class source, and couldn't find it.
This question comes from the fact that filter_var() for instance, is considered a good alternative for validating emails with RegExp, but when you look at the source, you'll see it actually uses RegExp itself!
So, if you were to build a DOM Parser in PHP? How would you go about parsing the HTML? How did they do it?
Java DOM Parser-Overview. The Document Object Model (DOM) is an official recommendation of the World Wide Web Consortium (W3C). It defines an interface that enables programs to access and update the style, structure, and contents of XML documents.
When you parse an XML document with a DOM parser, you get back a tree structure that contains all of the elements of your document. The DOM provides a variety of functions you can use to examine the contents and structure of the document.
DOM API is implemented by a DOM Parser, which is very easy and simple to use. It represents an XML Document into tree format in which each element represents tree branches and creates an In Memory tree representation of XML file and then parses it more memory is required for this. The internal structure can be created by DOM Parser.
The DOM is a common interface for manipulating document structures. One of its design goals is that Java code written for one DOM-compliant parser should run on any other DOM-compliant parser without having to do any modifications. The DOM defines several Java interfaces.
I think you should check out the article How Browsers Work: Behind the Scenes of Modern Web Browsers. It's a lengthy read, but well worth your time. Specifically, the HTML Parser section.
While I cannot do the article justice, perhaps a cursory summary will be good to hold one over until they have the time to read and digest that masterpiece. I must admit though, in this area I am a novice having very little experience. Having developed for the web professionally for about 10 years, the way in which the browser handles and interprets my code has long been a black box.
HTML, XHTML, CSS or JavaScript - take your pick. They all have a grammer, as well as a vocabulary. English is another great example. We have grammatical rules that we expect people, books, and more to follow. We also have a vocabulary made up of nouns, verbs, adjectives and more.
Browsers interpret a document by examining its grammar, as well as its vocabulary. When it comes across items it ultimately doesn't understand, it will let you know (raising exceptions, etc). You and I do the same in common-speak.
I love StackOverflow, but if I could changed one thing it would be be absolutamente broken...
Note in the example above how you immediately start to pick apart the words and relationships between words. The beginning makes complete sense, "I love StackOverflow." Then we come to "...if I could changed," and we immediately stop. "Changed" doesn't belong here. It's likely the author meant "change" instead. Now the vocabulary is right, but the grammar is wrong. A little later we come across "be be" which may also violate a grammatical rule, and just a bit further we encounter the word "absolutamente", which is not part of the English vocabulary - another mistake.
Think of all of this in terms of a DOCTYPE. I have right now opened up on my second monitor the source behind XHTML 1.0 Strict Doctype. Among its internals are lines like the following:
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
This defines the heading entities. And as long as I adhere to the grammar of XHTML, I can use any one of these in my document (<h1>Hello World</h1>). But if I try to make one up, say H7, the browser will stumble over the vocabulary as "foreign," and inform me:
"Line 7, Column 8: element "h7" undefined"
Perhaps while parsing the document we come across <table. We know that we're now dealing with a table element, which has its own set of vocabulary such as tbody, tr, etc. As long as we know the language, the grammar rules, etc., we know when something is wrong. Returning to the XHTML 1.0 Strict Doctype, we find the following:
<!ELEMENT table
     (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
<!ELEMENT caption  %Inline;>
<!ELEMENT thead    (tr)+>
<!ELEMENT tfoot    (tr)+>
<!ELEMENT tbody    (tr)+>
<!ELEMENT colgroup (col)*>
<!ELEMENT col      EMPTY>
<!ELEMENT tr       (th|td)+>
<!ELEMENT th       %Flow;>
<!ELEMENT td       %Flow;>
Given this reference, we can keep a running check against whatever source we're parsing. If the author writes tread, instead of thead, we have a standard by which we can determine that to be in error. When issues are unresolved, and we cannot find rules to match certain uses of grammar and vocabulary, we inform the author that their document is invalid.
I am by no means doing this science justice, however I hope that this serves - if nothing more - to be enough that you might find it within yourself to sit down and read the article referenced as the beginning of this answer, and perhaps sit down and study the various DTD's that we encounter day to day.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With