If I exclude my custom transformer the GridSearchCV runs fine, but with, it errors. Here is a fake dataset:
import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler
df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
                       "Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4], 
                       "Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})
class MyTransformer(TransformerMixin):
    def transform(self, x, **transform_args):
        x["Number"] = x["Number"].apply(lambda row: row*2)
        return x
    def fit(self, x, y=None, **fit_args):
        return self
x_train = df
y_train = x_train.pop("Label")    
mapper = DataFrameMapper([
    ("Number", MinMaxScaler()),
    ("Letter", LabelBinarizer()),
    ])
pipe = Pipeline([
    ("custom", MyTransformer()),
    ("mapper", mapper),
    ("classifier", RandomForestClassifier()),
    ])
param_grid = {"classifier__min_samples_split":[10,20], "classifier__n_estimators":[2,3,4]}
model_grid = sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=2, scoring="accuracy")
model_grid.fit(x_train, y_train)
and the error is
list indices must be integers, not str
How can I make GridSearchCV work while there is a custom transformer in my pipeline?
I know this answer comes rather late, but I've encountered the same behavior with sklearn and BaseSearchCV derivative classes. The problem actually seems to stem from the _PartitionIterator class in the sklearn cross_validation module, as it makes the assumption that everything emitted from every TransformerMixin class in the pipeline is going to be array-like, and thus it generates slices of indices that are used to index incoming X args in a array-like manner. Here's the __iter__ method:
def __iter__(self):
    ind = np.arange(self.n)
    for test_index in self._iter_test_masks():
        train_index = np.logical_not(test_index)
        train_index = ind[train_index]
        test_index = ind[test_index]
        yield train_index, test_index 
And the BaseSearchCV grid search metaclass calls cross_validation's _fit_and_score, which uses a method called safe_split. Here's the relevant line:
X_subset = [X[idx] for idx in indices]
This will absolutely produce unexpected results if X is a pandas dataframe, which you're emitting from your transform function.
There are two ways I've found to fix this:
Make sure to return an array from your transformer:
return x.as_matrix()
This is a hack. If the pipe of transformers demands the input to the next transformer be a DataFrame, as was my case, you can write a utilities script that is essentially the same as the sklearn grid_search module, but includes some clever validation methods that are called in the _fit method of the BaseSearchCV class:
def _validate_X(X):
    """Returns X if X isn't a pandas frame, otherwise 
    the underlying matrix in the frame. """
    return X if not isinstance(X, pd.DataFrame) else X.as_matrix()
def _validate_y(y):
    """Returns y if y isn't a series, otherwise the array"""
    if y is None:
        return y
    # if it's a series
    elif isinstance(y, pd.Series):
        return np.array(y.tolist())
    # if it's a dataframe:
    elif isinstance(y, pd.DataFrame):
        # check it's X dims
        if y.shape[1] > 1:
            raise ValueError('matrix provided as y')
        return y[y.columns[0]].tolist()
    # bail and let the sklearn function handle validation
    return y
As an example, here's my "custom grid_search module".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With