Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to scrape messages from web based forums with rvest

Tags:

r

rvest

Take a vbulletin site like the one in the example. I want to be able to scrape just the text messages from the threads. However the css selectors for the messages are called #post_message_xxx where xxx is a variable id number.

How can I partially match the selector with html_nodes so I get all the ones that start with #post_message regardless of how they end?

Or maybe I should ask a more general question. How should I go about scraping the page, if I want to be able to attribute authors to the messages and keep track of message order.

Thanks.

library(rvest)
html <- html("http://www.acme.com/forums/new_rules_28429/")
cast <- html_nodes(html, "#post_message_28429")
cast

> <div id="post_message_28429">&#13;            &#13;           Thanks for posting
> this.&#13;        </div> 
> 
> attr(,"class")

[1] "XMLNodeSet"
like image 928
variable Avatar asked Jan 28 '26 03:01

variable


1 Answers

Rather than using a css selector, use an xpath selector which has a starts-with() function

cast <- html_nodes(html, xpath="//div[starts-with(@id,'post_message')]")
like image 189
MrFlick Avatar answered Jan 30 '26 19:01

MrFlick



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!