Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

converting xhtml to xml in r

Tags:

r

xml

xhtml

I want to parse a court document I downloaded in xml format. But the response type is application/xhtml+xml. And I'm getting an error in turning this xhtml document to xml in r so that I can extract information I need. See below. Can anyone help? Thank you.

resp_xml <- readRDS("had_NH_xml.rds")

# Load xml2
library(xml2)

# Check response is XML
http_type(resp_xml)
[1] "application/xhtml+xml"

# Examine returned text with content()
NH_text <- content(resp_xml, as = "text") 
NH_text
[1] "<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n        \t<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\" /><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resource/theme.css.jsf?ln=primefaces-redmond\" /><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resource/primefaces.css.jsf?ln=primefaces&amp;v=5.3\" /><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/jquery/jquery.js.jsf?ln=primefaces&amp;v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/jquery/jquery-plugins.js.jsf?ln=primefaces&amp;v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/primefaces.js.jsf?ln=primefaces&amp;v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/primefaces-extensions.js.jsf?ln=primefaces-extensions&amp;v=4.0.0\"></script><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resou... <truncated>
> 
> # Check htmltidy package: https://cran.r-     project.org/web/packages/htmltidy/htmltidy.pdf
> 
# Turn NH_text into an XML document
NH_xml <- read_xml(NH_text)

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url,
as_html = as_html, : Entity 'nbsp' not defined [26]

like image 712
Daniel Lee Avatar asked Jan 23 '26 00:01

Daniel Lee


1 Answers

Named HTML entities are invalid in XML (regardless of what any potential troll comments might otherwise "suggest"). I do not know R programming though what I can tell you is that you need to do string replacement for the following array:

'&nbsp;','&gt;','&lt;'

...and replace them with the following strings:

'&#160;','&#60;','&#62;'

In PHP this would simply be:

$f = array('&nbsp;','&gt;','&lt;');
$r = array('&#160;','&#60;','&#62;');
$a = str_ireplace($f,$r,$a);

...and each relative key/value would be replaced, I'm not sure enough to try to post R code looking at basic tutorials though.

What I can tell you is that if you clean out those strings (and any doctype) then if the rest of the code is not malformed then it should render just fine as application/xml.

like image 187
John Avatar answered Jan 24 '26 13:01

John



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!