I have a sklearn pipeline performing feature engineering on heterogeneous data types (boolean, categorical, numeric, text) and wanted to try a neural network as my learning algorithm to fit the model. I am running into some problems with the shape of the input data.
I am wondering if what I am trying to do is even possible and or if I should try a different approach?
I have tried a couple different methods but am receiving these errors:
Error when checking input: expected dense_22_input to have shape (11,) but got array with shape (30513,) => I have 11 input features ... so I then tried converting my X and y to arrays and now get this error
ValueError: Specifying the columns using strings is only supported for pandas DataFrames => which I think is because of the ColumnTransformer() where I specify column names
print(X_train_OS.shape)
print(y_train_OS.shape)
(22354, 11)
(22354,)
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import to_categorical # OHE
X_train_predictors = df_train_OS.drop("label", axis=1)
X_train_predictors = X_train_predictors.values
y_train_target = to_categorical(df_train_OS["label"])
y_test_predictors = test_set.drop("label", axis=1)
y_test_predictors = y_test_predictors.values
y_test_target = to_categorical(test_set["label"])
print(X_train_predictors.shape)
print(y_train_target.shape)
(22354, 11)
(22354, 2)
def keras_classifier_wrapper():
    clf = Sequential()
    clf.add(Dense(32, input_dim=11, activation='relu'))
    clf.add(Dense(2, activation='softmax'))
    clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return clf
TOKENS_ALPHANUMERIC_HYPHEN = "[A-Za-z0-9\-]+(?=\\s+)"
boolTransformer = Pipeline(steps=[
    ('bool', PandasDataFrameSelector(BOOL_FEATURES))])
catTransformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
numTransformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('num_scaler', StandardScaler())])
textTransformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
                                 stop_words=stopwords))])
textTransformer_1 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
                                 stop_words=stopwords))])
FE = ColumnTransformer(
    transformers=[
        ('bool', boolTransformer, BOOL_FEATURES),
        ('cat', catTransformer, CAT_FEATURES),
        ('num', numTransformer, NUM_FEATURES),
        ('text0', textTransformer_0, TEXT_FEATURES[0]),
        ('text1', textTransformer_1, TEXT_FEATURES[1])])
clf = KerasClassifier(keras_classifier_wrapper, epochs=100, batch_size=500, verbose=0)
PL = Pipeline(steps=[('feature_engineer', FE),
                     ('keras_clf', clf)])
PL.fit(X_train_predictors, y_train_target)
#PL.fit(X_train_OS, y_train_OS)
I think I understand the problem here however not sure how to solve it. If it is not possible to integrate sklearn ColumnTransformer+Pipeline into Keras model does Keras have a good way for dealing with fixed data types to feature engineer? Thank you!
sklearn is Python's general purpose machine learning library, and it features a lot of utilities not just for building learners but for pipelining and structuring them as well. keras models don't work with sklearn out of the box, but they can be made compatible quite easily.
Since Scikit-Learn allows you to implement your own estimators, there's nothing stopping you from using TensorFlow within Scikit-Learn's framework to compare TensorFlow models against other Scikit-Learn models.
Keras and Pipelines can be categorized as "Machine Learning" tools. Keras and Pipelines are both open source tools. It seems that Keras with 42.5K GitHub stars and 16.2K forks on GitHub has more adoption than Pipelines with 944 GitHub stars and 247 GitHub forks. Decisions including Keras & Pipelines. Fabian Ulmer.
It looks like you are passing your 11 columns of original data through your various column transformers and the number of dimensions is expanding to 30,513 (after count vectorizing your text, one hot encoding etc). Your neural network architecture is set up to accept only 11 input features but is being passed your (now transformed) 30,513 features, which is what error 1 is explaining.
You therefore need to amend the input_dim of your neural network to match the number of features being created in the feature extraction pipeline.
One thing you could do is add an intermediate step between them with something like SelectKBest and set that to something like 20,000 so that you know exactly how many features will eventually be passed to the classifier.
This is a good guide and flowchart on the Google machine learning website - link - look at the flow chart - here you can see they have a 'select top k features' step in the pipeline before training a model.
So, try updating these parts of your code to:
def keras_classifier_wrapper():
    clf = Sequential()
    clf.add(Dense(32, input_dim=20000, activation='relu'))
    clf.add(Dense(2, activation='softmax'))
    clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return clf
and
from sklearn.feature_selection import SelectKBest
select_best_features = SelectKBest(k=20000)
PL = Pipeline(steps=[('feature_engineer', FE),
                     ('select_k_best', select_best_features),
                     ('keras_clf', clf)])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With