Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn DecisionTreeClassifier with CountVectorizer and additional predictor

I have built a text classification model with sklearn's DecisionTreeClassifier and would like to add another predictor. My data is in a pandas dataframe with columns labeled 'Impression' (text), 'Volume' (floats), and 'Cancer' (label). I've been using only Impression to predict Cancer but would like to use Impression and Volume to predict Cancer instead.

My code previously that ran without issue:

X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

I've tried a few different ways to add the Volume predictor (changes in bold):

1) Only fit_transform the Impressions

X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)

vectorizer = CountVectorizer()
X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])
X_test = vectorizer.transform(X_test)

dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

This throws the error

TypeError: float() argument must be a string or a number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.

2) Call fit_transform on both Impressions and Volumes. Same code as above except for fit_transform line:

X_train = vectorizer.fit_transform(X_train)

This of course throws the error:

ValueError: Number of labels=1800 does not match number of samples=2
...
X_train.shape
(2, 2)
y_train.shape
(1800,)

I'm pretty sure method #1 is the right way to go but I haven't been able to find any tutorials or solutions for how I can add the float predictor to this text classification model.

Any help would be appreciated!

like image 879
user139260 Avatar asked May 07 '26 09:05

user139260


2 Answers

ColumnTransformer() will exactly solve this problem. Instead of you manually appending the output of CountVectorizer with other columns, we can set the remainder param as passthrough in ColumnTransformer.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn import set_config

set_config(print_changed_only='True', display='diagram')

data = pd.DataFrame({'Impression': ['this is the first text',
                                    'second one goes like this',
                                    'third one is very short',
                                    'This is the final statement'],
                     'Volume': [123, 1, 2, 123],
                     'Cancer': [1, 0, 0, 1]})

X_train, X_test, y_train, y_test = train_test_split(
    data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)

ct = make_column_transformer(
    (CountVectorizer(), 'Impression'), remainder='passthrough')

pipeline = make_pipeline(ct, DecisionTreeClassifier())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

Use 0.23.0 version, to see the visuals of pipeline objects (display param in set_config)

enter image description here

like image 86
Venkatachalam Avatar answered May 09 '26 22:05

Venkatachalam


You can use hstack to combine two features together.

from scipy.sparse import hstack
X_train = vectorizer.fit_transform(X_train)
X_train_new = hstack(X_train, np.array(data['Volume']))

Now your new train contain both features. And if I may advice, use tfidfvectorizer instead of countvectorizer since tfidf considers the importance of words in each document/Impresion while countvectorizer only counts number of occurrences of words and hence a word like "THE" will have higher importance than those which really matter to us.

like image 36
Ehsan Avatar answered May 09 '26 23:05

Ehsan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!