I'm a complete beginner in R and this is my first time to post on stackoverflow. Please be gentle:)
I try to learn R by following tutorials and practical examples, but got stuck on this one and don't know what I do wrong.
I try to follow the tutorial as posted here. But get the following error message half when I try to train the model:
Error in na.fail.default(list(doc.class = c(3L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  missing values in object
I hope someone can help me understand what is going on here? I inspected tdmTrain and it only contains NA values. I'm just not sure why and how to fix it. 
This is the code up to the step where I get the error message.
library(NLP)
library(tm) 
library(caret) 
r8train <- read.table("r8-train-all-terms.txt", header=FALSE, sep='\t')
r8test <- read.table("r8-test-all-terms.txt", header=FALSE, sep='\t')
# rename variables
names(r8train) <- c("Class", "docText")
names(r8test) <- c("Class", "docText")
# convert the document text variable to character type
r8train$docText <- as.character(r8train$docText)
r8test$docText <- as.character(r8test$docText)
# create varible to denote if observation is train or test
r8train$train_test <- c("train")
r8test$train_test <- c("test")
# merge the train/test data
merged <- rbind(r8train, r8test)
# remove objects that are no longer needed 
remove(r8train, r8test)
merged <- merged[which(merged$Class %in% c("crude","money-fx","trade")),]
# drop unused levels in the response variable
merged$Class <- droplevels(merged$Class) 
# counts of each class in the train/test sets
table(merged$Class,merged$train_test)
# a vector source interprets each element of the vector as a document
sourceData <- VectorSource(merged$docText)
# create the corpus
corpus <- Corpus(sourceData)
# preprocess/clean the training corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase
corpus <- tm_map(corpus, removeNumbers) # remove digits
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, stripWhitespace) # strip extra whitespace
corpus <- tm_map(corpus, removeWords, stopwords('english')) # remove stopwords
# create term document matrix (tdm)
tdm <- DocumentTermMatrix(corpus)
as.matrix(tdm)[10:20,200:210] # inspect a portion of the tdm
# create tf-idf weighted version of term document matrix
weightedtdm <- weightTfIdf(tdm)
as.matrix(weightedtdm)[10:20,200:210] # inspect same portion of the weighted tdm
# find frequent terms: terms that appear in at least "250" documents here, about 25% of the docs
findFreqTerms(tdm, 250)
# convert tdm's into data frames 
tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))
# split back into train and test sets
tdmTrain <- tdm[which(merged$train_test == "train"),]
weightedTDMtrain <- weightedtdm[which(merged$train_test == "train"),]
tdmTest <-  tdm[which(merged$train_test == "test"),]
weightedTDMtest <- weightedtdm[which(merged$train_test == "test"),]
# remove objects that are no longer needed to conserve memory
remove(tdm,weightedtdm)
# append document labels as last column
tdmTrain$doc.class <- merged$Class[which(merged$train_test == "train")]
tdmTest$doc.class <- merged$Class[which(merged$train_test == "test")]
weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]
weightedTDMtest$doc.class  <- merged$Class[which(merged$train_test == "test")]
# set resampling scheme
ctrl <- trainControl(method="repeatedcv",number = 10, repeats = 3) #,classProbs=TRUE)
# fit a kNN model using the weighted (td-idf) term document matrix
# tuning parameter: K
set.seed(100)
knn.tfidf <- train(doc.class ~ ., data = weightedTDMtrain, method = "knn", trControl = ctrl) #, tuneLength = 20)
To see which values in each of these vectors R recognizes as missing, we can use the is.na function. It will return a TRUE/FALSE vector with as any elements as the vector we provide. We can see that R distinguishes between the NA and “NA” in x2–NA is seen as a missing value, “NA” is not.
In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data. For more practice on working with missing data, try this course on cleaning data in R.
The problem lies in this part of the code:
tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))
dim(weightedtdm) #returns rows and columns
   10 10
You never use this to create a data.frame out of a tdm. You only get the first 10 rows and 10 columns. Not all the data from the tdm.
You need to use:
tdm <- as.data.frame(as.matrix(tdm))
weightedtdm <- as.data.frame(as.matrix(weightedtdm))
dim(weightedtdm)
[1]  993 9243
Here you can see the enormous difference between the 2 ways.
Using the first weightedtdm will result in 700 NA values for all columns except doc.class when you run weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]
This is the reason why train returns the error message.
Using the second way will work and your train will start to run. (slowly because of the repeated cross validation.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With