I have a categorical variable with known levels (e.g. hour
that just contains values between 0 and 23), but not all of them are available right now (say, we have measurements from between 0 and 11 o'clock, while hours from 12 to 23 are not covered), though other values are going to be added later. If we naively use pandas.get_dummies()
to map values to indicator variables, we will end up with only 12 of them instead of 24. Is there a way to map values of the categorical variable to a predefined list of dummy variables?
Here's an example of expected behaviour:
possible_values = range(24)
hours = get_dummies_on_steroids(df['hour'], prefix='hour', levels=possible_values)
Using the new and improved Categorical
type in pandas 0.15:
import pandas as pd
import numpy as np
df = pd.DataFrame({'hour': [0, 1, 3, 8, 13, 14], 'val': np.random.randn(6)})
df
Out[4]:
hour val
0 0 -0.098287
1 1 -0.682777
2 3 1.000749
3 8 -0.558877
4 13 1.423675
5 14 1.461552
df['hour_cat'] = pd.Categorical(df['hour'], categories=range(24))
pd.get_dummies(df['hour_cat'])
Out[6]:
0 1 2 3 4 5 6 7 8 9 ...
0 1 0 0 0 0 0 0 0 0 0 ...
1 0 1 0 0 0 0 0 0 0 0 ...
2 0 0 0 1 0 0 0 0 0 0 ...
3 0 0 0 0 0 0 0 0 1 0 ...
4 0 0 0 0 0 0 0 0 0 0 ...
5 0 0 0 0 0 0 0 0 0 0 ...
The situation you describe, where you know your data can take a specific set of values, but
you haven't necessarily observed all of them, is exactly what Categorical
is good for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With