Getting uneven elements in XML file into dataframe

Question

I have been trying to get an XML file into a dataframe but am struggling, I have tried a few approaches and this is where I am at.

My XML file looks like 20k segments of this:

<?xml version="1.0"?>
<data experimentId="5244" savingTime="2018-01-06T14:25:48-0500" eventType="Workflow" userId="303">
  <root>
    <set id="ASSAY_WORKFLOW">
      <row state="MODIFIED" pk="5905_Standard_Validation_Standard_Validation">
        <field name="ASSAY_ID">5244</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="WORKFLOW_STATE">0</field>
        <field name="ASSAY_WORKFLOW_STATE">InDelegation</field>
        <field name="WORKFLOW_LAST_STEP_ID">17896</field>
      </row>
    </set>
    <set id="WORKFLOW_STEPS">
      <row state="NEW" pk="17896">
        <field name="STEP_ID">17896</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="STEP_DATE">2018-01-06T14:25:45-0500</field>
        <field name="STEP_DATE_TZ">America/New_York</field>
        <field name="USER_ID">303</field>
        <field name="USER_FULL_NAME">Ron Swanson</field>
        <field name="NEW_WORKFLOW_ASSAY_STATE">InDelegation</field>
        <field name="FORMER_WORKFLOW_ASSAY_STATE">Draft</field>
        <field name="ROLE_ID">1</field>
      </row>
    </set>
    <set id="WORKFLOW_STEP_VARIABLES">
      <row state="NEW" pk="17896&#xA4;nextActorId">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">nextActorId</field>
        <field name="VALUE">2</field>
      </row>
      <row state="NEW" pk="17896&#xA4;validateToPendingValidation">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">validateToPendingValidation</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;signToPendingSignature">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">signToPendingSignature</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;comment">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">comment</field>
        <field name="VALUE">GH-VAP, IgG1 repeats,</field>
      </row>
      <row state="NEW" pk="17896&#xA4;actionDelegateU">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">actionDelegateU</field>
        <field name="VALUE">directDelegateU</field>
      </row>
    </set>
    <set id="WORKFLOW_ROLE_NAMES">
      <row state="NEW" pk="1">
        <field name="ROLE_ID">1</field>
        <field name="LANGUAGE_ID">2</field>
        <field name="DESCRIPTION">Author</field>
      </row>
    </set>
  </root>
</data>

For each root node there are child elements that have the same tag "field" with attribute "name". The values of which identify the value and name of the column I want it under in my data frame.

I can get everything out with this:

library(XML)

xmlfilealt <- xmlParse("data/eln_audit_workflow.xml")

username <- xpathSApply(xmlfilealt, "//field[@name='USER_FULL_NAME']", xmlValue)
title <- xpathSApply(xmlfilealt, "//field[@name='VALUE']", xmlValue)
state <- xpathSApply(xmlfilealt, "//field[@name='ASSAY_WORKFLOW_STATE']", xmlValue)
actionDate <- xpathSApply(xmlfilealt, "//field[@name='STEP_DATE']", xmlValue)
actor <- xpathSApply(xmlfilealt, "//field[@name='DESCRIPTION']", xmlValue)

I planned to create a data.frame with them but the vectors are all slighty different lengths which I assume is because there are probably some missing elements in some of the root nodes. Can someone clue me in on how to handle this?

Thanks

Parfait · Accepted Answer

For children elements that may or may not be present, consider iterating across the parent node, here being <row>, by node position. Then, build lists of dataframes casting each column to the needed value or zero length string using XPath's concat to always return a result for equal length columns. Finally rbind all dataframes of list for final.

row_length <- length(xpathSApply(xmlfilealt, "//row"))

df_List <- lapply(seq(row_length), function(i){

  data.frame(
    username = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='USER_FULL_NAME'],'')", i), xmlValue),
    title = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='VALUE'],'')", i), xmlValue),
    state = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='ASSAY_WORKFLOW_STATE'],'')", i), xmlValue),
    actionDate = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='STEP_DATE'],'')", i), xmlValue),
    actor = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='DESCRIPTION'],'')", i), xmlValue),
    stringsAsFactors = FALSE
  )

})

# CONCATENATE ALL DFs 
finaldf <- do.call(rbind, df_List)

# CONVERT ZERO-LENGTH STRINGS TO NA
finaldf[] <- sapply(finaldf, function(col) ifelse(col=='', NA, col))

finaldf
#      username                 title        state               actionDate  actor
# 1 Ron Swanson                     2 InDelegation 2018-01-06T14:25:45-0500 Author
# 2        <NA>                 false         <NA>                     <NA>   <NA>
# 3        <NA>                 false         <NA>                     <NA>   <NA>
# 4        <NA> GH-VAP, IgG1 repeats,         <NA>                     <NA>   <NA>
# 5        <NA>       directDelegateU         <NA>                     <NA>   <NA>
# 6        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 7        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 8        <NA>                  <NA>         <NA>                     <NA>   <NA>

Dave2e · Answer

This xml is very inconsistent and difficult to parse out in a consistent manner. I prefer to use the xml2 package, since I find the syntax easier to use.

library(xml2)

# parse all of the root nodes into separate nodes
rootnodes<-xml_find_all(page, "root")

# read the desired fields from each individual root nodes
a<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='ASSAY_WORKFLOW_STATE']"))})
b<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='STEP_DATE']"))})
c<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='USER_FULL_NAME']"))})
d<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='DESCRIPTION']"))})

#Create the desired output
df=data.frame(assaystate = a, stepdate=b, name = c, description = d)

The advantage of this approach is the expectation of each root node should contain each desired field. xml2's xml_find_first will return NA if that field/node is missing thus all the results will be the same size.

I left out the value field since there were at least fields with the value attribute and it was not clear if one or all of those values were required.

Getting uneven elements in XML file into dataframe

Tags:

r

xml

r3vdev

2 Answers

Parfait

Dave2e

Recent Activity

Donate For Us

Getting uneven elements in XML file into dataframe

Tags:

r

xml

r3vdev

2 Answers

Parfait

Dave2e

Related questions

Recent Activity

Donate For Us