I'm trying to remove HTML tags (Python 3) but also trying to remove the text in between them. My below code snippet doesn't seem to give me the result I'm looking for and all the other questions I've found on SO seem to only look at removing the HTML tags but preserving the text inside the HTML tag which is not what I'm trying to do.
Current Code
import re
...
price="12.00 <b>17.50</b>"
price=re.sub('<[^>]*>', '', price)
String
12.00 <b>17.50</b>
Expected Outcome
12.00
Current Outcome
12.00 17.50
You can also do it with an HTML Parser, like BeautifulSoup
. The idea is to find all the tags and decompose them, then get what is left:
In [8]: from bs4 import BeautifulSoup
In [9]: price = "12.00 <b>17.50</b>"
In [10]: soup = BeautifulSoup(price, "html.parser")
In [11]: for elm in soup.find_all():
...: elm.decompose()
...:
In [12]: print(soup)
12.00
And, here is a famous topic explaining why you should not process HTML with regular expressions:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With