I have an HTML document and I want to replace all smart quotes with regular quotes. I tried this:
for text_element in html.findAll():
content = text_element.string
if content:
new_content = content \
.replace(u"\u2018", "'") \
.replace(u"\u2019", "'") \
.replace(u"\u201c", '"') \
.replace(u"\u201d", '"') \
.replace("e", "x")
text_element.string.replaceWith(new_content)
(with the e/x transformation just to make it easy to see if things were working or not)
but this is my output:
<p>
This amount of investment is producing results: total final consumption in IEA countries is estimated to be
<strong>
60% lowxr
</strong>
today because of energy efficiency improvements over the last four decades. This has had the effect of
<strong>
avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
</strong>
.
</p>
It seems the BS is drilling down to the child-est tags, but I need to get all the text in the entire page.
Instead of selecting and filtering all the elements/tags, you could just select the text nodes directly by specifying True
for the string
argument:
for text_node in soup.find_all(string=True):
# do something with each text node
As the documentation states, the string
argument is new in version 4.4.0, which means that you may need to use the text
argument instead depending on your version:
for text_node in soup.find_all(text=True):
# do something with each text node
Here is the relevant code for replacing the values:
def remove_smart_quotes (text):
return text.replace(u"\u2018", "'") \
.replace(u"\u2019", "'") \
.replace(u"\u201c", '"') \
.replace(u"\u201d", '"')
soup = BeautifulSoup(html, 'lxml')
for text_node in soup.find_all(string=True):
text_node.replaceWith(remove_smart_quotes(text_node))
As a side note, the Beautiful Soup documentation actually has a section on smart quotes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With