While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data.
All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5:
 From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn.
From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn.
Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I am using one hot vectors on the data).
A sample of the original data looks like this:
f1 f2 f3  f4  f5  f6  f7  f8  f9  f10  c1  c2  c3
0  C  S  O   1   2   1   1   2   1    2   0   0   0
1  D  S  O   1   3   1   1   2   1    2   0   0   0
2  C  S  O   1   3   1   1   2   1    1   0   0   0
3  D  S  O   1   3   1   1   2   1    2   0   0   0
4  D  A  O   1   3   1   1   2   1    2   0   0   0
5  D  A  O   1   2   1   1   2   1    2   0   0   0
6  D  A  O   1   2   1   1   2   1    1   0   0   0
7  D  A  O   1   2   1   1   2   1    2   0   0   0
8  D  K  O   1   3   1   1   2   1    2   0   0   0
9  C  R  O   1   3   1   1   2   1    1   0   0   0
where X[0] = f1 and I encoded strings to integers as sklearn does not accept strings.
Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable.
Decision trees can handle both categorical and numerical variables at the same time as features, there is not any problem in doing that.
This is needed because not all the machine learning algorithms can deal with categorical data. Many of them cannot operate on label data directly. They require all input variables and output variables to be numeric. That's why We need to encode them.
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems.
Well, I am surprised, but it turns out that sklearn's decision tree cannot handle categorical data indeed. There is a Github issue on this (#4899) from June 2015, but it is still open (UPDATE: it is now closed, but continued in #12866, so the issue is still not resolved).
The problem with coding categorical variables as integers, as you have done here, is that it imposes an order on them, which may or may not be meaningful, depending on the case; for example, you could encode ['low', 'medium', 'high'] as [0, 1, 2], since 'low' < 'medium' < 'high' (we call these categorical variables ordinal), although you are still implicitly making the additional (and possibly undesired) assumption that the distance between 'low' and 'medium' is the same with the distance between 'medium' and 'high' (of no impact in decision trees, but of importance e.g. in k-nn and clustering). But this approach fails completely in cases like, say, ['red','green','blue'] or ['male','female'], since we cannot claim any meaningful relative order between them.
So, for non-ordinal categorical variables, the way to properly encode them for use in sklearn's decision tree is to use the OneHotEncoder module. The Encoding categorical features section of the user's guide might also be helpful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With