Let's say I have a page with a div. I can easily get that div with soup.find().
Now that I have the result, I'd like to print the WHOLE innerhtml of that div: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML. Is this possible?
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.
With BeautifulSoup 4 use element.encode_contents() if you want a UTF-8 encoded bytestring or use element.decode_contents() if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:
def innerHTML(element): """Returns the inner HTML of an element as a UTF-8 encoded bytestring""" return element.encode_contents() These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.
encode_contents - since 4.0.4def encode_contents( self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): """Renders the contents of this tag as a bytestring. :param indent_level: Each line of the rendering will be indented this many spaces. :param encoding: The bytestring will be in this encoding. :param formatter: The output formatter responsible for converting entities to Unicode characters. """ See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way.
encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.
decode_contents - since 4.0.1decode_contents does the same thing as encode_contents but returns a Python Unicode string instead of an encoded bytestring.
def decode_contents(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): """Renders the contents of this tag as a Unicode string. :param indent_level: Each line of the rendering will be indented this many spaces. :param eventual_encoding: The tag is destined to be encoded into this encoding. This method is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document's encoding. :param formatter: The output formatter responsible for converting entities to Unicode characters. """ BeautifulSoup 3 doesn't have the above functions, instead it has renderContents
def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0): """Renders the contents of this tag as a string in the given encoding. If encoding is None, returns a Unicode string..""" This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.
One of the options could be use something like that:
innerhtml = "".join([str(x) for x in div_element.contents])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With