Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine tfidf features with selfmade features

For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.

Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()

tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])

But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.

Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.

like image 549
milvala Avatar asked Dec 14 '25 05:12

milvala


1 Answers

You can use hstack to merge the two sparse matrices, without having to convert to dense format.

from scipy.sparse import hstack

hstack([X_train_counts, X_train_custom])
like image 60
nicogen Avatar answered Dec 16 '25 21:12

nicogen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!