I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text. The problem I'm having is that sometimes web pages have newline characters "\n" that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the "\n".
Example:
Your browser probably renders the following all in one line (even though have a newline character in the middle):
This is a paragraph.
And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:
This is a paragraph.
This is another paragraph.
But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:
from bs4 import BeautifulSoup
doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)
soup.text
Out[181]: 'This is a \n paragraph.'
doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)
soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'
Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)? Are there any other simple ways around the problem?
get_text might be helpful here:
>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With