I'm creating a document-term matrix with the tm-package in R, but some of the words in my corpus get lost in the process somewhere.
I will explain with an example. Let's say I have this small corpus
library(tm)
crps <- " more hours to my next class bout to go home and go night night"
crps <- VCorpus(VectorSource(crps))
When I use DocumentTermMatrix() from the tm-package, it will return these results:
dm <- DocumentTermMatrix(crps)
dm_matrix <- as.matrix(dm)
dm_matrix
# Terms
# Docs and bout class home hours more next night
# 1 1 1 1 1 1 1 1 2
However, what I want (and expected) is:
# Docs and bout class home hours more next night my go to
# 1 1 1 1 1 1 1 1 2 1 2 1
Why does DocumentTermMatrix() skip the words "my","go"and "to"? Is there a way to control and fix this function?
DocumentTermMatrix() automatically discards words that are less than three characters. Therefore, the words to, my and go are not considered when constructing the document-term matrix.
From the help page ?DocumentTermMatrix, you can see there's an optional argument called control. This optional argument has a number of default values for numerous things (see the help page ?termFreq for more details). One of these defaults is a word length of at least three characters, i.e. wordLengths = c(3, Inf). You can change this to accommodate for all words, regardless of word length:
dm <- DocumentTermMatrix(my_corpus, control = list(wordLengths=c(1, Inf))
inspect(dm)
# <<DocumentTermMatrix (documents: 1, terms: 11)>>
# Non-/sparse entries: 11/0
# Sparsity : 0%
# Maximal term length: 5
# Weighting : term frequency (tf)
#
# Terms
# Docs and bout class go home hours more my next night to
# 1 1 1 1 2 1 1 1 1 1 2 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With