Using BeautifulSoup to Extract CData

Question

I'm trying to use BeautifulSoup from bs4/Python 3 to extract CData. However, whenever I search for it using the following, it returns an empty result. Can anyone point out what I'm doing wrong?

from bs4 import BeautifulSoup,CData

txt = '''<foobar>We have
         <![CDATA[some data here]]>
         and more.
         </foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
    if isinstance(cd, CData):
        print('CData contents: %r' % cd)

Ryan Heathcote · Accepted Answer

The problem appears to be that the default parser doesn't parse CDATA properly. If you specify the correct parser, the CDATA shows up:

soup = BeautifulSoup(txt,'html.parser')

For more information on parsers, see the docs

I got onto this by using the diagnose function, which the docs recommend:

If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.

Using the diagnose() function gives you output of how the different parsers see your html, which enables you to choose the right parser for your use case.

Using BeautifulSoup to Extract CData

Tags:

python

python-3.x

beautifulsoup

cdata

user2694306

1 Answers

Ryan Heathcote

Recent Activity

Donate For Us

Using BeautifulSoup to Extract CData

Tags:

python

python-3.x

beautifulsoup

cdata

user2694306

1 Answers

Ryan Heathcote

Related questions

Recent Activity

Donate For Us