Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute ngrams for each row of text data in R

I have a data column of the following format:

Text

Hello world  
Hello  
How are you today  
I love stackoverflow  
blah blah blahdy  

I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately?

like image 942
Brian Avatar asked Nov 23 '25 08:11

Brian


1 Answers

Is this what you're after?

library("RWeka")
library("tm")

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)), 
                          control = list(tokenize = TrigramTokenizer))

inspect(tdm)

A term-document matrix (4 terms, 5 documents)

Non-/sparse entries: 4/16
Sparsity           : 80%
Maximal term length: 20 
Weighting          : term frequency (tf)

                      Docs
Terms                  1 2 3 4 5
  are you today        0 0 1 0 0
  blah blah blahdy     0 0 0 0 1
  how are you          0 0 1 0 0
  i love stackoverflow 0 0 0 1 0
like image 92
Ben Avatar answered Nov 26 '25 02:11

Ben



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!