Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a local DTD file to parse an XML file using lxml?

Tags:

python

xml

lxml

dtd

I am trying to parse the DBLP data set using lxml in python. However it is giving this error:

lxml.etree.XMLSyntaxError: Entity 'uuml' not defined, line 54, column 43

DBLP does provide a DTD file for defining entities here. How can I use that file to parse the DBLP XML document?

Here is my current code:

filename = sys.argv[1]
dtd_name = sys.argv[2]
db_name = sys.argv[3]

conn = sqlite3.connect(db_name)

dblp_record_types_for_publications = ('article', 'inproceedings', 'proceedings', 'book', 'incollection',
    'phdthesis', 'masterthesis', 'www')

# read dtd
dtd = ET.DTD(dtd_name) #pylint: disable=E1101

# get an iterable
context = ET.iterparse(filename, events=('start', 'end'), load_dtd=True, #pylint: disable=E1101
    resolve_entities=True) 

# turn it into an iterator
context = iter(context)

# get the root element
event, root = next(context)

n_records_parsed = 0
for event, elem in context:
    if event == 'end' and elem.tag in dblp_record_types_for_publications:
        pub_year = None
        for year in elem.findall('year'):
            pub_year = year.text
        if pub_year is None:
            continue

        pub_title = None
        for title in elem.findall('title'):
            pub_title = title.text
        if pub_title is None:
            continue

        pub_authors = []
        for author in elem.findall('author'):
            if author.text is not None:
                pub_authors.append(author.text)

        # print(pub_year)
        # print(pub_title)
        # print(pub_authors)
        # insert the publication, authors in sql tables
        pub_title_sql_str = pub_title.replace("'", "''")
        pub_author_sql_strs = []
        for author in pub_authors:
            pub_author_sql_strs.append(author.replace("'", "''"))

        conn.execute("INSERT OR IGNORE INTO publications VALUES ('{title}','{year}')".format(
            title=pub_title_sql_str,
            year=pub_year))
        for author in pub_author_sql_strs:
            conn.execute("INSERT OR IGNORE INTO authors VALUES ('{name}')".format(name=author))
            conn.execute("INSERT INTO authored VALUES ('{author}','{publication}')".format(author=author,
                publication=pub_title_sql_str))

        elem.clear()
        root.clear()

        n_records_parsed += 1
        print("No. of records parsed: {}".format(n_records_parsed))

conn.commit()
conn.close()
like image 887
In78 Avatar asked Oct 21 '25 13:10

In78


1 Answers

After keeping the DTD file in the same directory as the XML file and making sure that DTD filename and the name of the DTD file in the doctype declaration (<!DOCTYPE dblp SYSTEM "dblp.dtd">) of the XML document matches, as suggested by mzjn in the comments, it is no longer giving syntax errors.

like image 54
In78 Avatar answered Oct 23 '25 03:10

In78



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!