I'm just starting out with Python and getting stuck on something while playing with the Kaggle Titanic data. https://www.kaggle.com/c/titanic/data
Here's what I am typing in an ipython notebook (train.csv comes from the titanic data from the kaggle link above):
import pandas as pd
df = pd.read_csv("C:/fakepath/titanic/data/train.csv")
I then continue with this to check if there's any bad data in the 'Sex' column:
df['Sex'].value_counts()
Which returns:
male 577
female 314
dtype: int64
df['Gender'] = df['Sex'].map( {'male': 1, 'female': 0} ).astype(int)
this doesn't produce any errors. To verify that it created a new column called 'Gender' with integer values :
df
which returns:
# PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Gender
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S 0
...success, the Gender column is appended to the end and is 0 for female, 1 for male. Now, I create a new pandas dataframe which is a subset of the df dataframe.
df2 = df[ ['Survived', 'Pclass', 'Age', 'Gender', 'Embarked'] ]
df2
which returns:
Survived Pclass Age Gender Embarked
0 0 3 22 1 S
1 1 1 38 0 C
2 1 3 26 0 S
3 1 1 35 0 S
4 0 3 35 1 S
5 0 3 NaN 1 Q
df2['Embarked'].value_counts()
...shows that there are 3 unique values (S, C, Q):
S 644
C 168
Q 77
dtype: int64
However, when I try to execute what I think is the same type of operation as when I converted male/female to 1/0, I get an error:
df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)
returns:
ValueError Traceback (most recent call last)
<ipython-input-29-294c08f2fc80> in <module>()
----> 1 df2['Embarked_int'] = df2['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2}).astype(int)
C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in astype(self, dtype, copy, raise_on_error)
2212
2213 mgr = self._data.astype(
-> 2214 dtype=dtype, copy=copy, raise_on_error=raise_on_error)
2215 return self._constructor(mgr).__finalize__(self)
2216
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, **kwargs)
2500
2501 def astype(self, dtype, **kwargs):
-> 2502 return self.apply('astype', dtype=dtype, **kwargs)
2503
2504 def convert(self, **kwargs):
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in apply(self, f, axes, filter, do_integrity_check, **kwargs)
2455 copy=align_copy)
2456
-> 2457 applied = getattr(b, f)(**kwargs)
2458
2459 if isinstance(applied, list):
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in astype(self, dtype, copy, raise_on_error, values)
369 def astype(self, dtype, copy=False, raise_on_error=True, values=None):
370 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 371 values=values)
372
373 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
399 if values is None:
400 # _astype_nansafe works fine with 1-d only
--> 401 values = com._astype_nansafe(self.values.ravel(), dtype, copy=True)
402 values = values.reshape(self.values.shape)
403 newb = make_block(values,
C:\Anaconda\lib\site-packages\pandas\core\common.pyc in _astype_nansafe(arr, dtype, copy)
2616
2617 if np.isnan(arr).any():
-> 2618 raise ValueError('Cannot convert NA to integer')
2619 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
2620 # work around NumPy brokenness, #1987
ValueError: Cannot convert NA to integer
Any idea why I get this error on the 2nd use of the map function but not the first? There are no NAN values in the Embarked column per value_counts(). I'm guessing it's a noob problem :)
by default value_counts does not count NaN values, you can change this by doing df['Embarked'].value_counts(dropna=False) .
I looked at your value_counts for Gender column (577 + 314 = 891) versus Embarked column (644 + 168 + 77 = 889) and they are different by 2 which means you must have 2 NaN values.
So you either drop them first (using dropna) or fill them with some desired value using fillna.
Also the astype(int) is redundant as you are mapping to an int anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With