Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Remove HTML Tags & text inbetween HTML Tags

I'm trying to remove HTML tags (Python 3) but also trying to remove the text in between them. My below code snippet doesn't seem to give me the result I'm looking for and all the other questions I've found on SO seem to only look at removing the HTML tags but preserving the text inside the HTML tag which is not what I'm trying to do.

Current Code

import re
...
price="12.00 <b>17.50</b>"
price=re.sub('<[^>]*>', '', price)

String

12.00 <b>17.50</b>

Expected Outcome

12.00

Current Outcome

12.00 17.50
like image 786
llanato Avatar asked Oct 20 '25 01:10

llanato


1 Answers

You can also do it with an HTML Parser, like BeautifulSoup. The idea is to find all the tags and decompose them, then get what is left:

In [8]: from bs4 import BeautifulSoup

In [9]: price = "12.00 <b>17.50</b>"

In [10]: soup = BeautifulSoup(price, "html.parser")

In [11]: for elm in soup.find_all():
    ...:     elm.decompose()
    ...:     

In [12]: print(soup)
12.00 

And, here is a famous topic explaining why you should not process HTML with regular expressions:

  • RegEx match open tags except XHTML self-contained tags
like image 170
alecxe Avatar answered Oct 21 '25 18:10

alecxe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!