For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.
Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()
tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])
But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.
Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.
You can use hstack to merge the two sparse matrices, without having to convert to dense format.
from scipy.sparse import hstack
hstack([X_train_counts, X_train_custom])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With