Let's say I wanted to remove vowels from HTML:
<a href="foo">Hello there!</a>Hi!
becomes
<a href="foo">Hll thr!</a>H!
I figure this is a job for Beautiful Soup. How can I select the text in between tags and operate on it like this?
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.
Suppose the variable test_html has the following html content:
<html>
<head><title>Test title</title></head>
<body>
<p>Some paragraph</p>
Useless Text
<a href="http://stackoverflow.com">Some link</a>not a link
<a href="http://python.org">Another link</a>
</body></html>
Just do this:
from BeautifulSoup import BeautifulSoup
test_html = load_html_from_above()
soup = BeautifulSoup(test_html)
for t in soup.findAll(text=True):
    text = unicode(t)
    for vowel in u'aeiou':
        text = text.replace(vowel, u'') 
    t.replaceWith(text)
print soup
That prints:
<html>
<head><title>Tst ttl</title></head>
<body>
<p>Sm prgrph</p>
Uslss Txt
<a href="http://stackoverflow.com">Sm lnk</a>nt  lnk
<a href="http://python.org">Anthr lnk</a>
</body></html>
Note that the tags and attributes are untouched.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With