Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a way to preprocess string features

For a machine learning problem I have for every sample a location feature( a state in America), which looks like this: The whole feature vector looks like this:

array(['oklahoma', 'florida', 'idaho', ..., 'pennsylvania', 'alabama',
   'washington'], dtype=object)

I cannot directly feed this in a sklearn algorithm and therefore I have to somehow convert this into numerical features, but I don't know how I could do this. What are they best ways to convert these string features? Would ASCII conversion work?

edit: I want my every state to have its own unique numerical value.

like image 986
Learner Avatar asked Nov 22 '25 12:11

Learner


2 Answers

You can refer to Label preprocessing:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama',
     'washington'])
le.classes_
# array(['alabama', 'florida', 'idaho', 'oklahoma', 'pennsylvania',
#         'washington'],
#       dtype='|S12')
le.transform(["oklahoma"])
# array([3])
like image 118
alko Avatar answered Nov 25 '25 04:11

alko


If you just want to turn each city name into a unique numerical value then hash(text) would work well.

It may be that a more complex hash function is needed as this is not guaranteed to be the same every time Python is run. In fact in Python 3.3 it will be salted differently each time unless you specifically set it up to do otherwise. The hashlib module contains various different hash algorithms that may suit better.

like image 31
neil Avatar answered Nov 25 '25 02:11

neil