Python: Parsing SGML

Question

I'm trying to parse some SGML like the following in Python:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<TEXT>
    <TITLE>One</TITLE>
    <BODY>Sample One</BODY>
</TEXT>
<TEXT>
    <TITLE>Two</TITLE>
    <BODY>Sample Two</BODY>
</TEXT>

Here, I'm just looking for everything inside the <BODY> tags (i.e. ["Sample One", "Sample Two"]).

I've tried using BeautifulSoup, but it doesn't like the <!DOCTYPE> in the first line and also expects everything to be wrapped around a root tag like <everything></everything>. While I can manually make these changes before passing it into BeautifulSoup, it feels a bit too hacky.

I'm pretty new to SGML, and also not married to BeautifulSoup, so I'm open to any suggestions.

(For those curious: my specific usecase is the reuters21578 dataset.)

Anand S Kumar · Accepted Answer

You can try using 'html.parser' as the parser instead of lxml-xml. lxml-xml would expect the text to be correct xml , which is not the case.

Example/Demo -

>>> from bs4 import BeautifulSoup
>>> s = """<!DOCTYPE lewis SYSTEM "lewis.dtd">
... <TEXT>
...     <TITLE>One</TITLE>
...     <BODY>Sample One</BODY>
... </TEXT>
... <TEXT>
...     <TITLE>Two</TITLE>
...     <BODY>Sample Two</BODY>
... </TEXT>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> soup.find_all('body')
[<body>Sample One</body>, <body>Sample Two</body>]

Python: Parsing SGML

Tags:

python

parsing

xml-parsing

beautifulsoup

sgml

scip

1 Answers

Anand S Kumar

Recent Activity

Donate For Us

Python: Parsing SGML

Tags:

python

parsing

xml-parsing

beautifulsoup

sgml

scip

1 Answers

Anand S Kumar

Related questions

Recent Activity

Donate For Us