Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse RSS Feeds with variable XML structures in R

Tags:

parsing

r

xml

rss

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows:

library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>, it does not work in this particular case with nodes <entry> as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance?

Any help is very much appreciated, thank you.

like image 435
Nico21 Avatar asked Nov 29 '25 14:11

Nico21


2 Answers

You'll need to work with namespaces. Here are XML and xml2 options:

# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)

# xml2
library(xml2)

XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)

Look at using the boolean() XPath operator to be able to handle multiple cases (i.e. the different feed formats).

like image 166
hrbrmstr Avatar answered Dec 01 '25 04:12

hrbrmstr


This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining?

If you do not want to use the package, you can still inspect the code and see how they parsed the data.

like image 42
Karsten W. Avatar answered Dec 01 '25 03:12

Karsten W.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!