I have this table in Excel:
id class
0 2 3
1 1 3
2 3 5
Now, I want to do a 'special' one-hot encoding in Python. For each id in the first table, there are two numbers. Each number corresponds to a class (class1, class2, etc.). The second table is created based off of the first such that for each id, each number in its row shows up in its corresponding class column and the other columns just get zeros. For example, the numbers for id 0 are 2 and 3. The 2 is placed at class2 and the 3 is placed at class3. Classes 1, 4, and 5 get the default of 0. The result should be like:
id class1 class2 class3 class4 class5
0 0 2 3 0 0
1 1 0 3 0 0
2 0 0 3 0 5
My previous solution,
foo = lambda x: pd.Series([i for i in x.split()])
result=onehot['hotel'].apply(foo)
result.columns=['class1','class2']
pd.get_dummies(result, prefix='class', columns=['class1','class2'])
results in:
class_1 class_2 class_3 class_3 class_5
0 0.0 1.0 0.0 1.0 0.0
1 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 1.0 0.0 1.0
(class_3 appears twice). What can I do to fix this? (After this step, I can transform it to the final format I want.)
You need to make your variables to be categorical and then you can use one hot encoding as shown:
In [18]: df1 = pd.DataFrame({"class":pd.Series(['2','1','3']).astype('category',categories=['1','2','3','4','5'])})
In [19]: df2 = pd.DataFrame({"class":pd.Series(['3','3','5']).astype('category',categories=['1','2','3','4','5'])})
In [20]: df_1 = pd.get_dummies(df1)
In [21]: df_2 = pd.get_dummies(df2)
In [22]: df_1.add(df_2).apply(lambda x: x * [i for i in range(1,len(df_1.columns)+1)], axis = 1).astype(int).rename_axis('id')
Out[22]:
class_1 class_2 class_3 class_4 class_5
id
0 0 2 3 0 0
1 1 0 3 0 0
2 0 0 3 0 5
Does this satisfy your problem as stated?
#!/usr/bin/python
input = [
(0, (2,3)),
(1, (1,3)),
(2, (3,5)),
]
maximum = max(reduce(lambda x, y: x+list(y[1]), input, []))
# Or ...
# maximum = 0
# for i, classes in input:
# maximum = max(maximum, *classes)
# print header.
print "\t".join(["id"] + ["class_%d" % i for i in range(1, 6)])
for i, classes in input:
print i,
for r in range(1, maximum+1):
print "\t",
if r in classes:
print float(r),
else:
print 0.0,
print
Output:
id class_1 class_2 class_3 class_4 class_5
0 0.0 2.0 3.0 0.0 0.0
1 1.0 0.0 3.0 0.0 0.0
2 0.0 0.0 3.0 0.0 5.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With