It's not clear to me why some resources online demonstrate a multi-target Random Forest regression as being instantiated as either
model = MultiOutputRegressor(RandomForestRegressor())
versus:
model = RandomForestRegressor()
when both seemingly generate multiple regressed outputs. Can anyone clarify?
The internal models are different, but they are both multioutput regressors.
MultiOutputRegressor
fits one random forest for each target. Each tree inside then is predicting one of your outputs.
Without the wrapper, RandomForestRegressor
fits trees targeting all the outputs at once. The split criteria are based on the average impurity reduction across the outputs. See the User Guide.
The latter may be better computationally, since fewer trees are being built. It can also make use of the fact that the several outputs for a given input may well be correlated. That's all discussed in the user guide as well.
Some conjecture on my part: On the other hand, if the several outputs for a given input are not correlated, internal splits that are good for one output may be lousy for other inputs, so simply averaging them might not work as well. I think in that case increasing the tree complexity can alleviate the issue (but will also take more computation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With