I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:
1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows:
library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)
Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>, it does not work in this particular case with nodes <entry> as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.
2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance?
Any help is very much appreciated, thank you.
You'll need to work with namespaces. Here are XML and xml2 options:
# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)
# xml2
library(xml2)
XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)
Look at using the boolean() XPath operator to be able to handle multiple cases (i.e. the different feed formats).
This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining?
If you do not want to use the package, you can still inspect the code and see how they parsed the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With