In other words may one use /<tag[^>]*>.*?<\/tag>/ regex to match the tag html element which does not contain nested tag elements?
For example (lt.html):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <title>greater than sign in attribute value</title>
  </head>
  <body>
    <div>1</div>
    <div title=">">2</div>
  </body>
</html>
Regex:
$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html
And screen-scraper:
#!/usr/bin/env python
import sys
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
    print div.string
$ python lt.py <lt.html
Both give the same output:
1
">2
Expected output:
1
2
w3c says:
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Tags contain a tag name , giving the element's name. HTML elements all have names that only use characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z.
Definition and UsageThe value attribute specifies the value of an <input> element. The value attribute is used differently for different input types: For "button", "reset", and "submit" - it defines the text on the button. For "text", "password", and "hidden" - it defines the initial (default) value of the input field.
HTML attributes are generally classified as required attributes, optional attributes, standard attributes, and event attributes: Usually the required and optional attributes modify specific HTML elements.
Yes, it is allowed (W3C Validator accepts it, only issues a warning).
Unescaped < and > are also allowed inside comments, so such simple regexp can be fooled.
If BeautifulSoup doesn't handle this, it could be a bug or perhaps a conscious design decision to make it more resilient to missing closing quotes in attributes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With