Parse RSS Feeds with variable XML structures in R

Question

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows:

library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>, it does not work in this particular case with nodes <entry> as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance?

Any help is very much appreciated, thank you.

hrbrmstr · Accepted Answer

You'll need to work with namespaces. Here are XML and xml2 options:

# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)

# xml2
library(xml2)

XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)

Look at using the boolean() XPath operator to be able to handle multiple cases (i.e. the different feed formats).

Karsten W. · Answer

This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining?

If you do not want to use the package, you can still inspect the code and see how they parsed the data.

Parse RSS Feeds with variable XML structures in R

Tags:

parsing

r

xml

rss

Nico21

2 Answers

hrbrmstr

Karsten W.

Recent Activity

Donate For Us

Parse RSS Feeds with variable XML structures in R

Tags:

parsing

r

xml

rss

Nico21

2 Answers

hrbrmstr

Karsten W.

Related questions

Recent Activity

Donate For Us