I have a dataset with a string column (name: 14) that I want to convert to interpret as a categorical feature. As far as I know there're two ways to do that:
pd.Categorical(data[14])
data[14].astype('category')
While both of these produce result with the same .dtype: CategoricalDtype(categories=[' <=50K', ' >50K'], ordered=False) they're not the same.
Calling .describe() on the results they produce different outputs. The first one outputs information about individual categories while the second one (astype(..)) results in typical describe output with count, unique, top, freq, and name, listing dtype: object.
My question is, then, why / how do they differ?
It's this dataset: http://archive.ics.uci.edu/ml/datasets/Adult
data = pd.read_csv("./adult/adult.data", header=None)
pd.Categorical(data[14]).describe()
data[14].astype('category').describe()
pd.Categorical(data[14]).dtype
data[14].astype('category').dtype
As Bakuriu points out, type(pd.Categorical(data[14])) is Categorical, while
type(data[14].astype('category')) is Series:
import pandas as pd
data = pd.read_csv("./adult/adult.data", header=None)
cat = pd.Categorical(data[14])
ser = data[14].astype('category')
print(type(cat))
# pandas.core.arrays.categorical.Categorical
print(type(ser))
# pandas.core.series.Series
The behavior of describe() differs
because Categorical.describe is defined differently than Series.describe.
Whenever you call Categorical.describe(), you'll get count and freq per category:
In [174]: cat.describe()
Out[174]:
counts freqs
categories
<=50K 24720 0.75919
>50K 7841 0.24081
and whenever you call Series.describe() on a categorical Series, you'll get count, unique, top and freq. Note that count and freq have a different meaning here too:
In [175]: ser.describe()
Out[175]:
count 32561
unique 2
top <=50K
freq 24720
Name: 14, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With