For a machine learning problem I have for every sample a location feature( a state in America), which looks like this: The whole feature vector looks like this:
array(['oklahoma', 'florida', 'idaho', ..., 'pennsylvania', 'alabama',
'washington'], dtype=object)
I cannot directly feed this in a sklearn algorithm and therefore I have to somehow convert this into numerical features, but I don't know how I could do this. What are they best ways to convert these string features? Would ASCII conversion work?
edit: I want my every state to have its own unique numerical value.
You can refer to Label preprocessing:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama',
'washington'])
le.classes_
# array(['alabama', 'florida', 'idaho', 'oklahoma', 'pennsylvania',
# 'washington'],
# dtype='|S12')
le.transform(["oklahoma"])
# array([3])
If you just want to turn each city name into a unique numerical value then hash(text) would work well.
It may be that a more complex hash function is needed as this is not guaranteed to be the same every time Python is run. In fact in Python 3.3 it will be salted differently each time unless you specifically set it up to do otherwise. The hashlib module contains various different hash algorithms that may suit better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With