I'm not able to do something and I would like to know if it's a bug or normal way.
I was trying to a Nested Cross Validation on dataset, and each of it belong to a patient. To avoid learning and testing on the same patient, I've seen that you implement a "group" mecanism and GroupKFold seems the right one in my case. As my classifier get differents parameters, I proceed to GridSearchCv to fix hyper parameters of my model. In the same way, I suppose that testing / training have to belong on differents patients.
( For those that are interested in Nested Cross Validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html )
I proceed that way:
pipe = Pipeline([('pca', PCA()),
('clf', SVC()),
])
# Find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(estimator=pipe, param_grid=some_param, cv=GroupKFold(n_splits=5), verbose=1)
grid_search.fit(X=features, y=labels, groups=groups)
# Nested CV with parameter optimization
predictions = cross_val_predict(grid_search, X=features, y=labels, cv=GroupKFold(n_splits=5), groups=groups)
And get some:
File : _split.py", line 489, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.
In the code it appear that groups is not shared by _fit_and_predict() method to the estimator and so, groups needed can't be used.
Can I have some clues on it? Have a nice day, Best regards
I had the same problem and I couldn't find another way than implementing it in a more hands-on fashion:
outer_cv = GroupKFold(n_splits=4).split(X_data, y_data, groups=groups)
nested_cv_scores = []
for train_ids, test_ids in outer_cv:
inner_cv = GroupKFold(n_splits=4).split(X_data[train_ids, :], y_data.iloc[train_ids], groups=groups[train_ids])
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100,
cv=inner_cv, verbose=2, random_state=42,
n_jobs=-1, scoring=my_squared_score)
# Fit the random search model
rf_random.fit(X_data[train_ids, :], y_data.iloc[train_ids])
print(rf_random.best_params_)
nested_cv_scores.append(rf_random.score(X_data[test_ids,:], y_data.iloc[test_ids]))
print("Nested cv score - meta learning: " + str(np.mean(nested_cv_scores)))
I hope this helps.
Best regards, Felix
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With