Python: Remove HTML Tags & text inbetween HTML Tags

Question

I'm trying to remove HTML tags (Python 3) but also trying to remove the text in between them. My below code snippet doesn't seem to give me the result I'm looking for and all the other questions I've found on SO seem to only look at removing the HTML tags but preserving the text inside the HTML tag which is not what I'm trying to do.

Current Code

import re
...
price="12.00 <b>17.50</b>"
price=re.sub('<[^>]*>', '', price)

String

12.00 <b>17.50</b>

Expected Outcome

12.00

Current Outcome

12.00 17.50

Current Code

import re
...
price="12.00 <b>17.50</b>"
price=re.sub('<[^>]*>', '', price)

String

12.00 <b>17.50</b>

Expected Outcome

12.00

Current Outcome

12.00 17.50

alecxe · Accepted Answer

You can also do it with an HTML Parser, like BeautifulSoup. The idea is to find all the tags and decompose them, then get what is left:

In [8]: from bs4 import BeautifulSoup

In [9]: price = "12.00 <b>17.50</b>"

In [10]: soup = BeautifulSoup(price, "html.parser")

In [11]: for elm in soup.find_all():
    ...:     elm.decompose()
    ...:     

In [12]: print(soup)
12.00

And, here is a famous topic explaining why you should not process HTML with regular expressions:

RegEx match open tags except XHTML self-contained tags

Python: Remove HTML Tags & text inbetween HTML Tags

Tags:

python

regex

python-3.x

llanato

1 Answers

alecxe

Recent Activity

Donate For Us

Python: Remove HTML Tags & text inbetween HTML Tags

Tags:

python

regex

python-3.x

llanato

1 Answers

alecxe

Related questions

Recent Activity

Donate For Us