Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace all smart quotes in Beautiful Soup

I have an HTML document and I want to replace all smart quotes with regular quotes. I tried this:

for text_element in html.findAll():
    content = text_element.string
    if content:
        new_content = content \
            .replace(u"\u2018", "'") \
            .replace(u"\u2019", "'") \
            .replace(u"\u201c", '"') \
            .replace(u"\u201d", '"') \
            .replace("e", "x")
        text_element.string.replaceWith(new_content)

(with the e/x transformation just to make it easy to see if things were working or not)

but this is my output:

<p>
 This amount of investment is producing results: total final consumption in IEA countries is estimated to be
   <strong>
      60% lowxr
   </strong>
 today because of energy efficiency improvements over the last four decades. This has had the effect of
   <strong>
      avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
   </strong>
 .
</p>

It seems the BS is drilling down to the child-est tags, but I need to get all the text in the entire page.

like image 910
thumbtackthief Avatar asked Sep 18 '25 11:09

thumbtackthief


1 Answers

Instead of selecting and filtering all the elements/tags, you could just select the text nodes directly by specifying True for the string argument:

for text_node in soup.find_all(string=True):
  # do something with each text node

As the documentation states, the string argument is new in version 4.4.0, which means that you may need to use the text argument instead depending on your version:

for text_node in soup.find_all(text=True):
  # do something with each text node

Here is the relevant code for replacing the values:

def remove_smart_quotes (text):
  return text.replace(u"\u2018", "'") \
             .replace(u"\u2019", "'") \
             .replace(u"\u201c", '"') \
             .replace(u"\u201d", '"')

soup = BeautifulSoup(html, 'lxml')

for text_node in soup.find_all(string=True):
  text_node.replaceWith(remove_smart_quotes(text_node))

As a side note, the Beautiful Soup documentation actually has a section on smart quotes.

like image 177
Josh Crozier Avatar answered Sep 20 '25 01:09

Josh Crozier