I'm trying to convert a Pandas dataframe to a NumPy array to create a model with Sklearn. I'll simplify the problem here.
>>> mydf.head(10)
IdVisita
445                                  latam
446                                    NaN
447                                 grados
448                                 grados
449                                eventos
450                                eventos
451         Reescribe-medios-clases-online
454                             postgrados
455                             postgrados
456                             postgrados
Name: cat1, dtype: object
>>> from sklearn import preprocessing
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit(mydf)
Traceback:
ValueError                                Traceback (most recent call last)
<ipython-input-74-f581ab15cbed> in <module>()
      2 mydf.head(10)
      3 enc = preprocessing.OneHotEncoder()
----> 4 enc.fit(mydf)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)
    996         self
    997         """
--> 998         self.fit_transform(X)
    999         return self
   1000 
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
   1052         """
   1053         return _transform_selected(X, self._fit_transform,
-> 1054                                    self.categorical_features, copy=True)
   1055 
   1056     def _transform(self, X):
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
    870     """
    871     if selected == "all":
--> 872         return transform(X)
    873 
    874     X = atleast2d_or_csc(X, copy=copy)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)
   1001     def _fit_transform(self, X):
   1002         """Assumes X contains only categorical features."""
-> 1003         X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
   1004         if np.any(X < 0):
   1005             raise ValueError("X needs to contain only non-negative integers.")
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
    279                     array = np.ascontiguousarray(array, dtype=dtype)
    280                 else:
--> 281                     array = np.asarray(array, dtype=dtype)
    282                 if not allow_nans:
    283                     _assert_all_finite(array)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):
ValueError: invalid literal for long() with base 10: 'postgrados'
Notice IdVisita is the index here and numbers might not be all consecutive.
Any clues?
Your error here is that you are calling OneHotEncoder which from the docs
The input to this transformer should be a matrix of integers
but your df has a single column 'cat1' which is of dtype object which is in fact a String. 
You should use LabelEcnoder:
In [13]:
le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
       'postgrados'], dtype=object)
Note I had to drop the NaN row as this will introduce a mixed dtype which cannot be used for ordering e.g. float > str will not work
A simpler approach is to use DictVectorizer, which does the conversion to integer as well as the OneHotEncoding at the same step.
Using it with the argument DictVectorizer(sparse=False) allows getting a DataFrame after the fit_transform to keep working with Pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With