Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nest cross validation for predictions using groups

I'm not able to do something and I would like to know if it's a bug or normal way.

I was trying to a Nested Cross Validation on dataset, and each of it belong to a patient. To avoid learning and testing on the same patient, I've seen that you implement a "group" mecanism and GroupKFold seems the right one in my case. As my classifier get differents parameters, I proceed to GridSearchCv to fix hyper parameters of my model. In the same way, I suppose that testing / training have to belong on differents patients.

( For those that are interested in Nested Cross Validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html )

I proceed that way:

pipe = Pipeline([('pca', PCA()),
                 ('clf', SVC()),
                 ])
# Find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(estimator=pipe, param_grid=some_param, cv=GroupKFold(n_splits=5), verbose=1)
grid_search.fit(X=features, y=labels, groups=groups)

# Nested CV with parameter optimization
predictions = cross_val_predict(grid_search, X=features, y=labels, cv=GroupKFold(n_splits=5), groups=groups)

And get some:

File : _split.py", line 489, in _iter_test_indices
    raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.

In the code it appear that groups is not shared by _fit_and_predict() method to the estimator and so, groups needed can't be used.

Can I have some clues on it? Have a nice day, Best regards

like image 429
Romain Cendre Avatar asked Oct 27 '25 08:10

Romain Cendre


1 Answers

I had the same problem and I couldn't find another way than implementing it in a more hands-on fashion:

outer_cv = GroupKFold(n_splits=4).split(X_data, y_data, groups=groups)
nested_cv_scores = []
for train_ids, test_ids in outer_cv:
    inner_cv = GroupKFold(n_splits=4).split(X_data[train_ids, :], y_data.iloc[train_ids], groups=groups[train_ids])

    rf = RandomForestClassifier()
    rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100,
                                   cv=inner_cv, verbose=2, random_state=42,
                                   n_jobs=-1, scoring=my_squared_score)
    # Fit the random search model
    rf_random.fit(X_data[train_ids, :], y_data.iloc[train_ids])
    print(rf_random.best_params_)

    nested_cv_scores.append(rf_random.score(X_data[test_ids,:], y_data.iloc[test_ids]))

print("Nested cv score - meta learning: " + str(np.mean(nested_cv_scores)))

I hope this helps.

Best regards, Felix

like image 136
Felix Avatar answered Oct 28 '25 23:10

Felix