I have a question regarding the rfe function from the caret library. On the caret-homepage link they give the following RFE algorithm: 
algorithm
For this example I am using the rfe function with 3-fold cross-validation and the train function with a linear-SVM and 5-fold cross-validation. 
library(kernlab)
library(caret)
data(iris)
# parameters for the tune function, used for fitting the svm
trControl <- trainControl(method = "cv", number = 5)
# parameters for the RFE function
rfeControl <- rfeControl(functions = caretFuncs, method = "cv",
                     number= 4, verbose = FALSE )
rf1 <- rfe(as.matrix(iris[,1:4]), as.factor(iris[,5]) ,sizes = c( 2,3) ,  
           rfeControl = rfeControl, trControl = trControl, method = "svmLinear")
rfe would split the data (150 samples) into 3 foldstrain function would be run on the training-set (100 samples) with 5 fold cross validation to tune the model parameters - with subsequent RFE.    What confuses me is that when I take a look on the results of the rfe function:
> lapply(rf1$control$index, length)
$Fold1
[1] 100
$Fold2
[1] 101
$Fold3
[1] 99
> lapply(rf1$fit$control$index, length)
$Fold1
[1] 120
$Fold2
[1] 120
$Fold3
[1] 120
$Fold4
[1] 120
$Fold5
[1] 120
From that it appears that the size of the training sets from the 5-fold cv is 120 samples when I would expect a size of 80. ??
So it would be great if someone could clarify how rfe and train work together.
Cheers
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] pROC_1.5.4      e1071_1.6-1     class_7.3-5     caret_5.15-048 
 [5] foreach_1.4.0   cluster_1.14.3  plyr_1.7.1      reshape2_1.2.1 
 [9] lattice_0.20-10 kernlab_0.9-15 
loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.1 grid_2.15.1     iterators_1.0.6
 [5] stringr_0.6.1   tools_2.15.1   
Implement the train() Function in R The train() method (from the caret library) is used for classification and regression training. It is also used to tune the models by picking the complexity parameters.
By default, caret will estimate a tuning grid for each method. However, sometimes the defaults are not the most sensible given the nature of the data. The tuneGrid argument allows the user to specify a custom grid of tuning parameters as opposed to simply using what exists implicitly.
Train Control means the control and regulation of all rail operations (including Train Movements, movements of rolling stock and track maintenance vehicles) to ensure the safe, efficient and proper operation of the Network.
tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
The problem here is that lapply(rf1$fit$control$index, length) does not store what we think it does.
For me to understand that it was necessary to look into the code. What happens there is the following:
When you call rfe the whole data is passed to the nominalRfeWorkflow. 
In nominalRfeWorkflow, the train and test data splitted according to rfeControl (in our example 3 times according to the 3-folded CV rule) is passed to rfeIter. 
These splits we can find in our result under rf1$control$index.
In rfeIter the ~100 training samples (our example) are used to find the final variables (which is the output of that function). 
As I understand it, the ~50 test samples (our example) are used to calculate the performance for the different variable sets but they are only stored as external performance but not used to select the final variables. 
For selecting these the performance estimates of the 5 fold cross validation are used.
But we cannot find these indices in the final result returned by rfe. 
If we really need them, we need to fetch them from fitObject$control$index in rfeIter, return them to nominalRfeWorkflow, then to rfe and from there in the resulting rfe-Class object returned by rfe.
So what is stored in lapply(rf1$fit$control$index, length)? - When rfe found the best variables the final model fit is created with the best variables and the full reference data (150). rf1$fit is created in rfe as follows:
fit <- rfeControl$functions$fit(x[, bestVar, drop = FALSE],
                                 y,
                                 first = FALSE,
                                 last = TRUE,
                                 ...)
This function is again runs the train function and does a final cross validation with the full reference data, the final feature set and trControl given via the ellipses (...). 
Since our trControl is supposed to do 5 fold CV it is thus correct that lapply(rf1$fit$control$index, length) returns 120 since we have to calculate 150/5*4=120.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With