Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to convert html lists into data frame in r?

Tags:

r

web-scraping

I have pieces of HTML that I need to convert to values in a dataframe.

For example this piece of html:

<div class="header">
<h3>title 1</h3>
</div>
<div class="content">
<ul>
<li>info1</li>
<li>info2
</li>
<li>info3
</li>
</ul>
</div>
<div class="header">
<h2>title 2</h2>
</div>
<div class="content">
<ul>
<li>info4</li>
<li>info5
</li>
<li>info6
</li>
</ul>
</div>

I want it to be changed into a dataframe like:

    Title  Info
1 title 1 info1
2 title 1 info2
3 title 1 info3
4 title 2 info4
5 title 2 info5
6 title 2 info6

I tried functions in the XML package and the tm.plugin.webmining package. Also I tried the code mentioned on this page:http://tonybreyal.wordpress.com/2011/11/18/htmltotext-extracting-text-from-html-via-xpath/ Until now i haven't succeeded to find a function that does what I want. Does anyone have an idea about how to deal with this problem?

like image 616
rdatasculptor Avatar asked Dec 31 '25 17:12

rdatasculptor


1 Answers

I think the HTML parsing in the XML library will help here. Let's assume that the HTML input you've shown above is stored in a variable called intext. We can then process your data with

library(XML)
hh <- htmlParse(intext, asText=T)

#use xpath to extract data
titles <- xpathSApply(hh, "//div[@class='header']/*/text()", xmlValue)
info <- xpathApply(hh, "//div[@class='content']/ul", function(x) 
    gsub("\\s+","",xpathSApply(x,"./li/text()", xmlValue)))

#merge results together
do.call(rbind, Map(cbind, titles, info))

This returns

     [,1]      [,2]   
[1,] "title 1" "info1"
[2,] "title 1" "info2"
[3,] "title 1" "info3"
[4,] "title 2" "info4"
[5,] "title 2" "info5"
[6,] "title 2" "info6"

which is a matrix that you can easily turn into a data.frame if you like.

like image 104
MrFlick Avatar answered Jan 02 '26 05:01

MrFlick



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!