How to combine tfidf features with selfmade features

Question

For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.

Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()

tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])

But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.

Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.

nicogen · Accepted Answer

You can use hstack to merge the two sparse matrices, without having to convert to dense format.

from scipy.sparse import hstack

hstack([X_train_counts, X_train_custom])

How to combine tfidf features with selfmade features

Tags:

python

pandas

nlp

scikit-learn

tf-idf

milvala

1 Answers

nicogen

Recent Activity

Donate For Us

How to combine tfidf features with selfmade features

Tags:

python

pandas

nlp

scikit-learn

tf-idf

milvala

1 Answers

nicogen

Related questions

Recent Activity

Donate For Us