I'm having some trouble working with twitter data I extracted using the CRAN Version of the twitteR package. In particular, the tolower conversion from the tm package.
I'm following this example
This is what I'm currently doing:
#oauth handshake and so on work fine
google_8.10<- searchTwitter("#Google", n=1500, cainfo="cacert.pem")
google_8.10_text <- sapply(google_8.10, function(x) x$getText())
google_8.10_text_corpus <- Corpus(VectorSource(google_8.10_text))
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, tolower)
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, removePunctuation)
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, function(x)removeWords(x,stopwords()))
The other conversions complete just fine (if tolower isn't run). However the tolower conversion returns:
google_8.10_text_corpus <- tm_map(google_8.10_text_corpus, tolower)
Warnmeldung:
In parallel::mclapply(x, FUN, ...) :
all scheduled cores encountered errors in user code
I'm having the suspicion that this might be caused by some character in one of the tweets but how can I track the problem down?
edit: Indeed, certain characters seem to cause this, eg.:
"#Google #TheInternship THE BEST MOVIE EVER @Jeennyy01 @dylanobrien I love this part \ud83d\ude1c http://t.co/iok5vm83cP"
Here the "\ud83d\ude1c" part causes the error. Any idea on how to automatically strip these phrases (this one is: http://www.charbase.com/1f61c-unicode-face-with-stuck-out-tongue-and-winking-eye) from the tweets?
According to the source tolower can give an error:
Support for "bytes" marked encoding
nzchar and nchar(, "bytes") are independent of the encoding.
nchar(, "char") nchar(, "width") give NA (if allowed) or error. substr substr<- work in bytes
abbreviate chartr make.names strtrim tolower toupper give error.
Here is an example where an error is thrown using an invalid UTF code point:
tolower("\udc80")
Error in tolower("<ed><U+00B2><U+0080>") :
invalid input 'í²€' in 'utf8towcs'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With