Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Parsing issue with BeautifulSoup Library

I am working with the BS library for HTML parsing. My task is to remove everything between the head tags. So if i have <head> A lot of Crap! </head> then the result should be <head></head>. This is the code for it

raw_html = "entire_web_document_as_string"
soup = BeautifulSoup(raw_html)
head = soup.head
head.unwrap()
print(head)

And this works fine. But i want that these changes should take place in the raw_html string that contains the entire html document. How do reflect these commands in the original string and not only in the head string? Can you share a code snippet for doing it?

like image 248
hnvasa Avatar asked Dec 06 '25 08:12

hnvasa


1 Answers

You're basically asking how to export a string of HTML from BS's soup object.

You can do it this way:

# Python 2.7
modified_raw_html = unicode(soup)

# Python3
modified_raw_html = str(soup)
like image 180
Jivan Avatar answered Dec 07 '25 22:12

Jivan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!