How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns

Question

I have several pandas data series, and want to train this data to map to an output, df["output"].

So far I have merged the series into one, and separated each by commas.

df = pd.read_csv("sourcedata.csv")
sample = df["catA"] + "," + df["catB"] + "," + df["catC"]

def my_tokenizer(s):
    return s.split(",")

vect = CountVectorizer()
vect = CountVectorizer(analyzer='word',tokenizer=my_tokenizer, ngram_range=(1, 3), min_df=1) 
train = vect.fit_transform(sample.values)

lf = LogisticRegression()
lfit = lf.fit(train, df["output"])
pred = lambda x: lfit.predict_proba(vect.transform([x]))

The problem is that this is a bag of words approach and doesn't consider
- the unique order in each category. ("orange banana" is different than "banana orange")
- text is one category has different significance than in another ("US" in one category could mean country of origin vs destination)

For example, the entire string could be:
pred("US, Chiquita Banana, China")
Category A: Country of origin
Category B: Company & Type of Fruit (order does matter)
Category C: Destination

The way I am doing it currently ignores any type of ordering, and also generates extra spaces in my feature names for some reason (which messes up things more):

In [1242]: vect.get_feature_names()[0:10]
Out[1242]:
[u'',
 u' ',
 u'  ',
 u'   ',
 u'    ',
 u'     ',
 u'   US',
 u'   CA',
 u'   UK']

Any suggestions are welcome!! Thanks a lot

maxymoo · Accepted Answer

OK, first let's prepare your data set, by selecting the relevant columns and removing leading and trailing spaces using strip:

sample = df[['catA','catB','catC']]
sample = df.apply(lambda col: col.str.strip())

From here you have a couple of options as how to vectorize this for a training set. If you have a smallish number of levels across all of your features (say less than 1000 in total), you can simply treat them as categorical variables and set train = pd.get_dummies(sample) to convert them to binary indicator variables. After this your data will look something like this:

catA_US   catA_CA ... cat_B_chiquita_banana   cat_B_morningstar_tomato ... catC_China ...
1         0           1                       0                            1   
...

Notice that that variable names start with their origin column, so this makes sure that the model will know where they come from. Also you're using exact strings so word order in the second column will be preserved.

If you have too many levels for this to work, or you want to consider the individual words in catB as well as the bigrams, you could apply your CountVectorizer separately to each column, and then use and use hstack to concatenate the resulting output matrices:

import scipy.sparse as sp
vect = CountVectorizer(ngram_range=(1, 3))
train = sp.hstack(sample.apply(lambda col: vect.fit_transform(col)))

How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns

Tags:

python

pandas

machine-learning

numpy

scikit-learn

Eric L

1 Answers

maxymoo

Recent Activity

Donate For Us

How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns

Tags:

python

pandas

machine-learning

numpy

scikit-learn

Eric L

1 Answers

maxymoo

Related questions

Recent Activity

Donate For Us