Looking for a way to preprocess string features

Question

For a machine learning problem I have for every sample a location feature( a state in America), which looks like this: The whole feature vector looks like this:

array(['oklahoma', 'florida', 'idaho', ..., 'pennsylvania', 'alabama',
   'washington'], dtype=object)

I cannot directly feed this in a sklearn algorithm and therefore I have to somehow convert this into numerical features, but I don't know how I could do this. What are they best ways to convert these string features? Would ASCII conversion work?

edit: I want my every state to have its own unique numerical value.

alko · Accepted Answer

You can refer to Label preprocessing:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama',
     'washington'])
le.classes_
# array(['alabama', 'florida', 'idaho', 'oklahoma', 'pennsylvania',
#         'washington'],
#       dtype='|S12')
le.transform(["oklahoma"])
# array([3])

neil · Answer

If you just want to turn each city name into a unique numerical value then hash(text) would work well.

It may be that a more complex hash function is needed as this is not guaranteed to be the same every time Python is run. In fact in Python 3.3 it will be salted differently each time unless you specifically set it up to do otherwise. The hashlib module contains various different hash algorithms that may suit better.

Looking for a way to preprocess string features

Tags:

python

machine-learning

scikit-learn

Learner

2 Answers

alko

neil

Recent Activity

Donate For Us

Looking for a way to preprocess string features

Tags:

python

machine-learning

scikit-learn

Learner

2 Answers

alko

neil

Related questions

Recent Activity

Donate For Us