Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting uneven elements in XML file into dataframe

Tags:

r

xml

I have been trying to get an XML file into a dataframe but am struggling, I have tried a few approaches and this is where I am at.

My XML file looks like 20k segments of this:

<?xml version="1.0"?>
<data experimentId="5244" savingTime="2018-01-06T14:25:48-0500" eventType="Workflow" userId="303">
  <root>
    <set id="ASSAY_WORKFLOW">
      <row state="MODIFIED" pk="5905_Standard_Validation_Standard_Validation">
        <field name="ASSAY_ID">5244</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="WORKFLOW_STATE">0</field>
        <field name="ASSAY_WORKFLOW_STATE">InDelegation</field>
        <field name="WORKFLOW_LAST_STEP_ID">17896</field>
      </row>
    </set>
    <set id="WORKFLOW_STEPS">
      <row state="NEW" pk="17896">
        <field name="STEP_ID">17896</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="STEP_DATE">2018-01-06T14:25:45-0500</field>
        <field name="STEP_DATE_TZ">America/New_York</field>
        <field name="USER_ID">303</field>
        <field name="USER_FULL_NAME">Ron Swanson</field>
        <field name="NEW_WORKFLOW_ASSAY_STATE">InDelegation</field>
        <field name="FORMER_WORKFLOW_ASSAY_STATE">Draft</field>
        <field name="ROLE_ID">1</field>
      </row>
    </set>
    <set id="WORKFLOW_STEP_VARIABLES">
      <row state="NEW" pk="17896&#xA4;nextActorId">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">nextActorId</field>
        <field name="VALUE">2</field>
      </row>
      <row state="NEW" pk="17896&#xA4;validateToPendingValidation">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">validateToPendingValidation</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;signToPendingSignature">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">signToPendingSignature</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;comment">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">comment</field>
        <field name="VALUE">GH-VAP, IgG1 repeats,</field>
      </row>
      <row state="NEW" pk="17896&#xA4;actionDelegateU">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">actionDelegateU</field>
        <field name="VALUE">directDelegateU</field>
      </row>
    </set>
    <set id="WORKFLOW_ROLE_NAMES">
      <row state="NEW" pk="1">
        <field name="ROLE_ID">1</field>
        <field name="LANGUAGE_ID">2</field>
        <field name="DESCRIPTION">Author</field>
      </row>
    </set>
  </root>
</data>

For each root node there are child elements that have the same tag "field" with attribute "name". The values of which identify the value and name of the column I want it under in my data frame.

I can get everything out with this:

library(XML)

xmlfilealt <- xmlParse("data/eln_audit_workflow.xml")

username <- xpathSApply(xmlfilealt, "//field[@name='USER_FULL_NAME']", xmlValue)
title <- xpathSApply(xmlfilealt, "//field[@name='VALUE']", xmlValue)
state <- xpathSApply(xmlfilealt, "//field[@name='ASSAY_WORKFLOW_STATE']", xmlValue)
actionDate <- xpathSApply(xmlfilealt, "//field[@name='STEP_DATE']", xmlValue)
actor <- xpathSApply(xmlfilealt, "//field[@name='DESCRIPTION']", xmlValue)

I planned to create a data.frame with them but the vectors are all slighty different lengths which I assume is because there are probably some missing elements in some of the root nodes. Can someone clue me in on how to handle this?

Thanks

like image 527
r3vdev Avatar asked Dec 20 '25 07:12

r3vdev


2 Answers

For children elements that may or may not be present, consider iterating across the parent node, here being <row>, by node position. Then, build lists of dataframes casting each column to the needed value or zero length string using XPath's concat to always return a result for equal length columns. Finally rbind all dataframes of list for final.

row_length <- length(xpathSApply(xmlfilealt, "//row"))

df_List <- lapply(seq(row_length), function(i){

  data.frame(
    username = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='USER_FULL_NAME'],'')", i), xmlValue),
    title = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='VALUE'],'')", i), xmlValue),
    state = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='ASSAY_WORKFLOW_STATE'],'')", i), xmlValue),
    actionDate = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='STEP_DATE'],'')", i), xmlValue),
    actor = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='DESCRIPTION'],'')", i), xmlValue),
    stringsAsFactors = FALSE
  )

})

# CONCATENATE ALL DFs 
finaldf <- do.call(rbind, df_List)

# CONVERT ZERO-LENGTH STRINGS TO NA
finaldf[] <- sapply(finaldf, function(col) ifelse(col=='', NA, col))

finaldf
#      username                 title        state               actionDate  actor
# 1 Ron Swanson                     2 InDelegation 2018-01-06T14:25:45-0500 Author
# 2        <NA>                 false         <NA>                     <NA>   <NA>
# 3        <NA>                 false         <NA>                     <NA>   <NA>
# 4        <NA> GH-VAP, IgG1 repeats,         <NA>                     <NA>   <NA>
# 5        <NA>       directDelegateU         <NA>                     <NA>   <NA>
# 6        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 7        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 8        <NA>                  <NA>         <NA>                     <NA>   <NA>
like image 168
Parfait Avatar answered Dec 22 '25 00:12

Parfait


This xml is very inconsistent and difficult to parse out in a consistent manner. I prefer to use the xml2 package, since I find the syntax easier to use.

library(xml2)

# parse all of the root nodes into separate nodes
rootnodes<-xml_find_all(page, "root")

# read the desired fields from each individual root nodes
a<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='ASSAY_WORKFLOW_STATE']"))})
b<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='STEP_DATE']"))})
c<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='USER_FULL_NAME']"))})
d<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='DESCRIPTION']"))})

#Create the desired output
df=data.frame(assaystate = a, stepdate=b, name = c, description = d)

The advantage of this approach is the expectation of each root node should contain each desired field. xml2's xml_find_first will return NA if that field/node is missing thus all the results will be the same size.

I left out the value field since there were at least fields with the value attribute and it was not clear if one or all of those values were required.

like image 35
Dave2e Avatar answered Dec 22 '25 00:12

Dave2e



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!