Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get information about the trees in a Random Forest in sklearn?

I would like to learn more about the Random Forest Regressors I am building with sklearn. For example, which depth do the trees have on average if I do not regularise?

The reason for this is that I need to regularise the model and want to get a feeling for what the model looks like at the moment. Also, if I set e.g. max_leaf_nodes will it still be necessary to also restrict max_depth or will this "problem" sort of solve itself because the tree cannot be grown too deep it max_leaf_nodes is set. Does this make sense or am I thinking in the wrong direction? I could not find anything in this direction.

like image 429
evilolive Avatar asked Oct 20 '25 16:10

evilolive


2 Answers

If you want to know the average maximum depth of the trees constituting your Random Forest model, you have to access each tree singularly and inquiry for its maximum depth, and then compute a statistic out of the results you obtain.

Let's first make a reproducible example of a Random Forest classifier model (taken from Scikit-learn documentation)

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

clf = RandomForestClassifier(n_estimators=100,
                             random_state=0)
clf.fit(X, y)

Now we can iterate over its estimators_ attribute containing each decision tree. For each decision tree, we inquiry the attribute tree_.max_depth, store away the response and take an average after completing our iteration:

max_depth = list()
for tree in clf.estimators_:
    max_depth.append(tree.tree_.max_depth)

print("avg max depth %0.1f" % (sum(max_depth) / len(max_depth)))

This will provide you an idea of the average maximum depth of each tree composing your Random Forest model (it works exactly the same also for a regressor model, as you have asked about).

Anyway, as a suggestion, if you want to regularize your model, you have better test parameter hypothesis under a cross-validation and grid/random search paradigm. In such a context you actually don't need to question yourself how hyperparameters interact with each other, you just test different combinations and you get the best combination based on cross validation score.

like image 93
Luca Massaron Avatar answered Oct 23 '25 06:10

Luca Massaron


In addition to @Luca Massaron's answer:

I found https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py which can be applied to each tree in the forest using

for tree in clf.estimators_:

The number of leaf nodes can be calculated like this:

n_leaves = np.zeros(n_trees, dtype=int)
for i in range(n_trees):
    n_nodes = clf.estimators_[i].tree_.node_count
    # use left or right children as you want 
    children_left = clf.estimators_[i].tree_.children_left
    for x in range(n_nodes):
        if children_left[x] == -1:
            n_leaves[i] += 1
like image 25
evilolive Avatar answered Oct 23 '25 05:10

evilolive



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!