I am working with the BS library for HTML parsing. My task is to remove everything between the head tags. So if i have <head> A lot of Crap! </head> then the result should be <head></head>. This is the code for it
raw_html = "entire_web_document_as_string"
soup = BeautifulSoup(raw_html)
head = soup.head
head.unwrap()
print(head)
And this works fine. But i want that these changes should take place in the raw_html string that contains the entire html document. How do reflect these commands in the original string and not only in the head string? Can you share a code snippet for doing it?
You're basically asking how to export a string of HTML from BS's soup object.
You can do it this way:
# Python 2.7
modified_raw_html = unicode(soup)
# Python3
modified_raw_html = str(soup)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With