random_forests
Random Forest
Bootstrap
Bagging
Earlier, we see that we can use bootstrap as a way of assessing the accuracy of a parameter estimate or a prediction. Here, we show how to use the bootstrap to improve the estimate or prediction itself.
Consider
Random Forest
Random Forests
is a substantial modification of bagging that builds a large collection of de-correlated trees and then averages them. The essential idea in bagging is to average many noisy but approximately unbiased models and hence reduce the variance. Trees are ideal candidates for bagging, since they can capture complex interaction structures in the data, and if grown sufficiently deep, have relatively low bias. Since trees are notoriously noisy, they benefit greatly from the averaging (variance reduction).
Since each tree generated in bagging is identically distributed, the expectation of an average of
An average of
As
Specifically, when growing a tree on a bootstrapped dataset:
- Before each split, select
of the input variables at random as candidates for splitting (typically values for are or even as low as 1) - After
such trees are grown, the random forest regression predictor is: - For classification, a random forest obtains a class vote from each tree, and then classifies using majority vote.
Out of Bag Samples
An importance feature of random forest is its use of out-of-bag
samples:
- For each observation
, construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which did not appear.
An out-of-bag error estimate (If short of data, it can be used as validation error) is almost identical to that obtained by
Variable Importance
Random forests also use the OOB samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable. When the
Overfitting
It is certainly true that increasing
with an average over