Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flatten xml into pandas dataframe, deeply nested

Tags:

python

pandas

xml

I'm thinking this might be very easy, and I simply not figured it out yet.

The objective is to 'flatten' into a pandas DataFrame.

Here is one xml (A direct download of a 60~ MB zip file, which extracted inflates to around 800~ MB).

I have tried the following 2 approaches:

The first one, taken from here, has been modified a little bit:

def xml2dfa(xml_data):
    tree = ET.parse(xml_data)
    root = tree.getroot()[1] # Modification here
    all_records = []
    headers = []
    for i, child in enumerate(root):
        record = []
        for subchild in child:
            record.append(subchild.text)
            if subchild.tag not in headers:
                headers.append(subchild.tag)
        all_records.append(record)
    return pd.DataFrame(all_records, columns=headers)

Line 3 (root) was modified to get the element LEIRecords rather than LEIHeader

The previous results in a DataFrame of correct number of rows but only 4 columns:

array(['{http://www.leiroc.org/data/schema/leidata/2014}LEI',
   '{http://www.leiroc.org/data/schema/leidata/2014}Entity',
   '{http://www.leiroc.org/data/schema/leidata/2014}Registration',
   '{http://www.leiroc.org/data/schema/leidata/2014}Extension'], dtype=object)

From columns 2 to 4 there are still nested children with information that could be extracted, but all of the information is lost, as the unique value of any column is an array that looks like this:

array(['\n        '], dtype=object)

The second approach I have been running for at least 16 hours, with no result, so something is not right. I took that from here.

The expected output would be a DataFrame that is completely flat and for whatever information is not there (because a particular tree branch did not go that far, or was not populated, filled with NaN (as in this question)

like image 341
erasmortg Avatar asked Oct 28 '25 12:10

erasmortg


1 Answers

I was facing a similar problem. I had xml from ebscohost about research articles returned from a search.

Using xmltodict https://github.com/martinblech/xmltodict

import xmltodict

with open(filename) as fd:
    doc = xmltodict.parse(fd.read())

This converted the xml to a nested dict

Using the example code from the stack overflow link,

def flatten_dict(dd, separator='_', prefix=''):
    return { prefix + separator + k if prefix else k : v
             for kk, vv in dd.items()
             for k, v in flatten_dict(vv, separator, kk).items()
             } if isinstance(dd, dict) else { prefix : dd }

I flattened the dict at the level of individual articles (two levels down in my case - doc['records']['rec'])

flattened_doc = [flatten_dict(x) for x in doc['records']['rec']]

and then made a Dataframe from the resulting list

data1 = pd.DataFrame(flattened_doc)

Some of the columns still contain dicts, but it is at a level that I don't care about. The function to flatten the dict will only flatten two levels down as it is written.

like image 170
memebrain Avatar answered Oct 30 '25 02:10

memebrain



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!