sklearn DecisionTreeClassifier with CountVectorizer and additional predictor

Question

I have built a text classification model with sklearn's DecisionTreeClassifier and would like to add another predictor. My data is in a pandas dataframe with columns labeled 'Impression' (text), 'Volume' (floats), and 'Cancer' (label). I've been using only Impression to predict Cancer but would like to use Impression and Volume to predict Cancer instead.

My code previously that ran without issue:

X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

I've tried a few different ways to add the Volume predictor (changes in bold):

1) Only fit_transform the Impressions

X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)

vectorizer = CountVectorizer()
X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])
X_test = vectorizer.transform(X_test)

dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

This throws the error

TypeError: float() argument must be a string or a number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.

2) Call fit_transform on both Impressions and Volumes. Same code as above except for fit_transform line:

X_train = vectorizer.fit_transform(X_train)

This of course throws the error:

ValueError: Number of labels=1800 does not match number of samples=2
...
X_train.shape
(2, 2)
y_train.shape
(1800,)

I'm pretty sure method #1 is the right way to go but I haven't been able to find any tutorials or solutions for how I can add the float predictor to this text classification model.

Any help would be appreciated!

Venkatachalam · Accepted Answer

ColumnTransformer() will exactly solve this problem. Instead of you manually appending the output of CountVectorizer with other columns, we can set the remainder param as passthrough in ColumnTransformer.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import set_config

set_config(print_changed_only='True', display='diagram')

data = pd.DataFrame({'Impression': ['this is the first text',
                                    'second one goes like this',
                                    'third one is very short',
                                    'This is the final statement'],
                     'Volume': [123, 1, 2, 123],
                     'Cancer': [1, 0, 0, 1]})

X_train, X_test, y_train, y_test = train_test_split(
    data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)

ct = make_column_transformer(
    (CountVectorizer(), 'Impression'), remainder='passthrough')

pipeline = make_pipeline(ct, DecisionTreeClassifier())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

Use 0.23.0 version, to see the visuals of pipeline objects (display param in set_config)

enter image description here

Ehsan · Answer

You can use hstack to combine two features together.

from scipy.sparse import hstack
X_train = vectorizer.fit_transform(X_train)
X_train_new = hstack(X_train, np.array(data['Volume']))

Now your new train contain both features. And if I may advice, use tfidfvectorizer instead of countvectorizer since tfidf considers the importance of words in each document/Impresion while countvectorizer only counts number of occurrences of words and hence a word like "THE" will have higher importance than those which really matter to us.

sklearn DecisionTreeClassifier with CountVectorizer and additional predictor

Tags:

python

machine-learning

scikit-learn

decision-tree

user139260

2 Answers

Venkatachalam

Ehsan

Recent Activity

Donate For Us

sklearn DecisionTreeClassifier with CountVectorizer and additional predictor

Tags:

python

machine-learning

scikit-learn

decision-tree

user139260

2 Answers

Venkatachalam

Ehsan

Related questions

Recent Activity

Donate For Us