Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse sub path of an XML file using XML2 package

Tags:

r

xml

tidyverse

I have the following xml page that looks like this which I need to parse using xml2

enter image description here

However, with this code, I cannot get the list under the subcellularLocation xpath :

library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"

doc <- xmlfile %>%
  xml2::read_xml()

xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)

What is the right way to do it?


Update

The desired output is a vector:

[1] "Nucleus"                                                   
[2] "Chromosome"                                                
[3] "Cytoplasm"                                                 
[4] "Secreted"                                                  
[5] "Cell membrane"
[6] "Peripheral membrane protein" 
[7] "Extracellular side"
[8] "Endosome"                                                  
[9] "Endoplasmic reticulum-Golgi intermediate compartment"  
like image 249
scamander Avatar asked Oct 24 '25 09:10

scamander


1 Answers

Use x <- xml_find_all(doc, "//d1:subcellularLocation")

Whenever you meet a troublesome problem, check the document is the first thing to do, use ?xml_find_all and you will see this (at the end of the page)

# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
 <root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
   <f:doc><g:baz /></f:doc>
   <f:doc><g:baz /></f:doc>
 </root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))

So you then go to check xml_ns(doc) and find

d1  <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance

Update

xml_find_all(doc, "//d1:subcellularLocation")
   %>% xml_children()
   %>% xml_text()

## [1] "Nucleus"                                             
## [2] "Chromosome"                                          
## [3] "Cytoplasm"                                           
## [4] "Secreted"                                            
## [5] "Cell membrane"                                       
## [6] "Peripheral membrane protein"                         
## [7] "Extracellular side"                                  
## [8] "Endosome"                                            
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"
like image 118
VicaYang Avatar answered Oct 26 '25 01:10

VicaYang