I have data with binary YES/NO Class response. Using following code for running RF model. I have problem in getting confusion matrix result.
dataR <- read_excel("*:/*.xlsx")
Train <- createDataPartition(dataR$Class, p=0.7, list=FALSE)
training <- dataR[ Train, ]
testing <- dataR[ -Train, ]
model_rf <- train( Class~., tuneLength=3, data = training, method =
"rf", importance=TRUE, trControl = trainControl (method = "cv", number =
5))
Results:
Random Forest
3006 samples
82 predictor
2 classes: 'NO', 'YES'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 2405, 2406, 2405, 2404, 2404
Addtional sampling using SMOTE
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7870921 0.2750655
44 0.7787721 0.2419762
87 0.7767760 0.2524898
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
So far fine, but when I run this code:
# Apply threshold of 0.50: p_class
class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")
# Create confusion matrix
p <-confusionMatrix(class_log, testing[["Class"]])
##gives the accuracy
p$overall[1]
I get this error:
Error in model_rf[, 1] : incorrect number of dimensions
I appreciate if you guys can help me to get confusion matrix result.
As I understand you would like to obtain the confusion matrix for cross validation in caret.
For this you need to specify savePredictions in trainControl. If it is set to "final" predictions for the best model are saved. By specifying classProbs = T probabilities for each class will be also saved.
data(iris)
iris_2 <- iris[iris$Species != "setosa",] #make a two class problem
iris_2$Species <- factor(iris_2$Species) #drop levels
library(caret)
model_rf <- train(Species~., tuneLength = 3, data = iris_2, method =
"rf", importance = TRUE,
trControl = trainControl(method = "cv",
number = 5,
savePredictions = "final",
classProbs = T))
Predictions are in:
model_rf$pred
sorted as per CV fols, to sort as in original data frame:
model_rf$pred[order(model_rf$pred$rowIndex),2]
to obtain a confusion matrix:
confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species)
#output
Confusion Matrix and Statistics
Reference
Prediction versicolor virginica
versicolor 46 6
virginica 4 44
Accuracy : 0.9
95% CI : (0.8238, 0.951)
No Information Rate : 0.5
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8
Mcnemar's Test P-Value : 0.7518
Sensitivity : 0.9200
Specificity : 0.8800
Pos Pred Value : 0.8846
Neg Pred Value : 0.9167
Prevalence : 0.5000
Detection Rate : 0.4600
Detection Prevalence : 0.5200
Balanced Accuracy : 0.9000
'Positive' Class : versicolor
In a two class setting often specifying 0.5 as the threshold probability is sub-optimal. The optimal threshold can be found after training by optimizing Kappa or Youden's J statistic (or any other preferred) as a function of the probability. Here is an example:
sapply(1:40/40, function(x){
versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4]
class <- ifelse(versicolor >=x, "versicolor", "virginica")
mat <- confusionMatrix(class, iris_2$Species)
kappa <- mat$overall[2]
res <- data.frame(prob = x, kappa = kappa)
return(res)
})
Here the highest kappa is not obtained at threshold == 0.5 but at 0.1. This should be used carefully because it can lead to over-fitting.
You can try this to create confusion matrix and check accuracy
m <- table(class_log, testing[["Class"]])
m #confusion table
#Accuracy
(sum(diag(m)))/nrow(testing)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With