I have been trying to get an XML file into a dataframe but am struggling, I have tried a few approaches and this is where I am at.
My XML file looks like 20k segments of this:
<?xml version="1.0"?>
<data experimentId="5244" savingTime="2018-01-06T14:25:48-0500" eventType="Workflow" userId="303">
<root>
<set id="ASSAY_WORKFLOW">
<row state="MODIFIED" pk="5905_Standard_Validation_Standard_Validation">
<field name="ASSAY_ID">5244</field>
<field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
<field name="WORKFLOW_STATE">0</field>
<field name="ASSAY_WORKFLOW_STATE">InDelegation</field>
<field name="WORKFLOW_LAST_STEP_ID">17896</field>
</row>
</set>
<set id="WORKFLOW_STEPS">
<row state="NEW" pk="17896">
<field name="STEP_ID">17896</field>
<field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
<field name="STEP_DATE">2018-01-06T14:25:45-0500</field>
<field name="STEP_DATE_TZ">America/New_York</field>
<field name="USER_ID">303</field>
<field name="USER_FULL_NAME">Ron Swanson</field>
<field name="NEW_WORKFLOW_ASSAY_STATE">InDelegation</field>
<field name="FORMER_WORKFLOW_ASSAY_STATE">Draft</field>
<field name="ROLE_ID">1</field>
</row>
</set>
<set id="WORKFLOW_STEP_VARIABLES">
<row state="NEW" pk="17896¤nextActorId">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">nextActorId</field>
<field name="VALUE">2</field>
</row>
<row state="NEW" pk="17896¤validateToPendingValidation">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">validateToPendingValidation</field>
<field name="VALUE">false</field>
</row>
<row state="NEW" pk="17896¤signToPendingSignature">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">signToPendingSignature</field>
<field name="VALUE">false</field>
</row>
<row state="NEW" pk="17896¤comment">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">comment</field>
<field name="VALUE">GH-VAP, IgG1 repeats,</field>
</row>
<row state="NEW" pk="17896¤actionDelegateU">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">actionDelegateU</field>
<field name="VALUE">directDelegateU</field>
</row>
</set>
<set id="WORKFLOW_ROLE_NAMES">
<row state="NEW" pk="1">
<field name="ROLE_ID">1</field>
<field name="LANGUAGE_ID">2</field>
<field name="DESCRIPTION">Author</field>
</row>
</set>
</root>
</data>
For each root node there are child elements that have the same tag "field" with attribute "name". The values of which identify the value and name of the column I want it under in my data frame.
I can get everything out with this:
library(XML)
xmlfilealt <- xmlParse("data/eln_audit_workflow.xml")
username <- xpathSApply(xmlfilealt, "//field[@name='USER_FULL_NAME']", xmlValue)
title <- xpathSApply(xmlfilealt, "//field[@name='VALUE']", xmlValue)
state <- xpathSApply(xmlfilealt, "//field[@name='ASSAY_WORKFLOW_STATE']", xmlValue)
actionDate <- xpathSApply(xmlfilealt, "//field[@name='STEP_DATE']", xmlValue)
actor <- xpathSApply(xmlfilealt, "//field[@name='DESCRIPTION']", xmlValue)
I planned to create a data.frame with them but the vectors are all slighty different lengths which I assume is because there are probably some missing elements in some of the root nodes. Can someone clue me in on how to handle this?
Thanks
For children elements that may or may not be present, consider iterating across the parent node, here being <row>, by node position. Then, build lists of dataframes casting each column to the needed value or zero length string using XPath's concat to always return a result for equal length columns. Finally rbind all dataframes of list for final.
row_length <- length(xpathSApply(xmlfilealt, "//row"))
df_List <- lapply(seq(row_length), function(i){
data.frame(
username = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='USER_FULL_NAME'],'')", i), xmlValue),
title = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='VALUE'],'')", i), xmlValue),
state = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='ASSAY_WORKFLOW_STATE'],'')", i), xmlValue),
actionDate = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='STEP_DATE'],'')", i), xmlValue),
actor = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='DESCRIPTION'],'')", i), xmlValue),
stringsAsFactors = FALSE
)
})
# CONCATENATE ALL DFs
finaldf <- do.call(rbind, df_List)
# CONVERT ZERO-LENGTH STRINGS TO NA
finaldf[] <- sapply(finaldf, function(col) ifelse(col=='', NA, col))
finaldf
# username title state actionDate actor
# 1 Ron Swanson 2 InDelegation 2018-01-06T14:25:45-0500 Author
# 2 <NA> false <NA> <NA> <NA>
# 3 <NA> false <NA> <NA> <NA>
# 4 <NA> GH-VAP, IgG1 repeats, <NA> <NA> <NA>
# 5 <NA> directDelegateU <NA> <NA> <NA>
# 6 <NA> <NA> <NA> <NA> <NA>
# 7 <NA> <NA> <NA> <NA> <NA>
# 8 <NA> <NA> <NA> <NA> <NA>
This xml is very inconsistent and difficult to parse out in a consistent manner. I prefer to use the xml2 package, since I find the syntax easier to use.
library(xml2)
# parse all of the root nodes into separate nodes
rootnodes<-xml_find_all(page, "root")
# read the desired fields from each individual root nodes
a<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='ASSAY_WORKFLOW_STATE']"))})
b<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='STEP_DATE']"))})
c<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='USER_FULL_NAME']"))})
d<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='DESCRIPTION']"))})
#Create the desired output
df=data.frame(assaystate = a, stepdate=b, name = c, description = d)
The advantage of this approach is the expectation of each root node should contain each desired field. xml2's xml_find_first will return NA if that field/node is missing thus all the results will be the same size.
I left out the value field since there were at least fields with the value attribute and it was not clear if one or all of those values were required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With