107 Random Forests
Random forests are among the most reliable off the shelf predictors in supervised learning. They take a high variance base learner, the decision tree, and tame it through two complementary forms of randomization: bootstrap resampling of the training data and random restriction of the candidate features at each split. The result is an ensemble that is accurate, robust to noise, nearly free of tuning, and equipped with built in tools for error estimation and variable assessment. This chapter develops the method from the bias variance perspective, explains why decorrelation is the central idea, derives the out of bag error estimate, scrutinizes two notions of feature importance, and closes with practical tuning guidance.
107.1 1. From Bagging to Random Forests
107.1.1 1.1 The variance problem of trees
A single classification or regression tree partitions the feature space into axis aligned regions and fits a constant in each region. Grown deep, a tree has low bias: it can carve out arbitrarily fine structure. But it has high variance. A small perturbation of the training set can change an early split, which cascades into a completely different partition downstream. Formally, if \(\hat{f}(x)\) is the prediction of a tree at a point \(x\), the expected squared error decomposes as
\[ \mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\big(\mathbb{E}[\hat{f}(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\hat{f}(x))}_{\text{variance}}. \]
For deep trees the bias term is small and the variance term dominates. The natural remedy is averaging, since averaging many noisy but roughly unbiased estimates reduces variance while preserving the low bias.
107.1.2 1.2 Bagging
Bagging, short for bootstrap aggregating, is the direct application of this idea. Given a training set of \(n\) examples, we draw \(B\) bootstrap samples, each formed by sampling \(n\) points with replacement. We fit a tree \(\hat{f}^{(b)}\) on each bootstrap sample and average their predictions:
\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{(b)}(x). \]
For classification we instead take a majority vote, or average the class probability estimates and then take the argmax. Because each tree is grown deep and left unpruned, the individual trees keep their low bias, and the ensemble buys a reduction in variance.
107.1.3 1.3 Why averaging alone is not enough
The catch is that the bootstrap samples overlap heavily, so the trees they produce are correlated. Consider \(B\) identically distributed predictions, each with variance \(\sigma^2\) and pairwise correlation \(\rho\). The variance of their average is
\[ \operatorname{Var}\!\left(\frac{1}{B}\sum_{b=1}^{B} \hat{f}^{(b)}(x)\right) = \rho\,\sigma^2 + \frac{1 - \rho}{B}\,\sigma^2. \]
As \(B \to \infty\) the second term vanishes, but the first term, \(\rho\sigma^2\), does not. The correlation between trees sets a floor on how much variance averaging can remove. If we want a better ensemble, we must drive down \(\rho\). This single equation is the conceptual engine of the random forest.
107.2 2. Decorrelating the Trees
107.2.1 2.1 Random feature subsets at each split
Random forests, introduced by Breiman in 2001, add a second source of randomness on top of bagging. When growing each tree, at every node the algorithm does not consider all \(p\) features as split candidates. Instead it draws a random subset of \(m \le p\) features and searches for the best split only among those. A fresh subset is drawn at every node.
The effect is to break the dominance of strong predictors. In plain bagging, if one or two features are highly informative, nearly every tree will split on them at the top, producing very similar trees and a high \(\rho\). By hiding those features at a fraction of the nodes, random feature selection forces trees to explore alternative structure, which lowers \(\rho\) at the cost of a small increase in the bias and variance of each individual tree. The net effect on the ensemble is usually favorable, because the bound \(\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\) is reduced.
107.2.2 2.2 The algorithm
The full procedure is compact.
for b = 1 to B:
draw a bootstrap sample D_b of size n from the data
grow a tree T_b on D_b:
at each node:
select m features at random from the p available
choose the best split among those m features
split the node
grow until a stopping rule (e.g. min node size) is met
do not prune
prediction:
regression: average the B tree outputs
classification: majority vote, or average class probabilities
The two knobs that distinguish a forest from a bag of trees are \(m\), the number of features sampled per split, and the absence of pruning. Everything else is standard tree machinery.
107.2.3 2.3 Choosing the split dimension \(m\)
Common defaults are \(m = \lfloor \sqrt{p} \rfloor\) for classification and \(m = \lfloor p/3 \rfloor\) for regression, with a floor of one. Small \(m\) yields stronger decorrelation but weaker individual trees; large \(m\) approaches plain bagging. Setting \(m = p\) recovers bagging exactly. The optimal value depends on how many features are relevant: when only a few features carry signal among many noise features, a very small \(m\) raises the chance that a given split never sees a useful feature, so a moderately larger \(m\) can help. In practice \(m\) is the one hyperparameter most worth tuning.
107.3 3. Out of Bag Error
107.3.1 3.1 The out of bag sample
Bootstrap sampling gives random forests a free validation mechanism. When we draw a bootstrap sample of size \(n\) with replacement, each particular observation has probability \((1 - 1/n)^n\) of being omitted. As \(n\) grows this tends to
\[ \lim_{n \to \infty}\left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368. \]
So on average about 37 percent of the observations are left out of any given bootstrap sample. These are the out of bag, or OOB, observations for that tree. Each observation is OOB for roughly a third of the trees.
107.3.2 3.2 Constructing the OOB estimate
To form the OOB prediction for observation \(i\), we aggregate only over those trees for which \(i\) was out of bag:
\[ \hat{f}_{\text{oob}}(x_i) = \operatorname{aggregate}\big\{\,\hat{f}^{(b)}(x_i) : i \notin D_b \,\big\}. \]
Because each such tree never saw \(x_i\) during training, the resulting prediction is honest in the same sense as a held out prediction. The OOB error is the average loss of these predictions over all \(i\). For classification it is the misclassification rate of the OOB votes; for regression it is the OOB mean squared error.
107.3.3 3.3 Why this matters in practice
The OOB error closely tracks the error one would obtain from a separate test set or from cross validation, and it comes at essentially no extra cost, since the trees are already built. This lets you monitor performance as a function of \(B\) and stop adding trees once the OOB error plateaus. One caveat: each observation’s OOB estimate is based on only about a third of the trees, so for small \(B\) the OOB error can be slightly pessimistic. With a few hundred trees this bias is negligible. The OOB estimate can also be used to tune \(m\) without a separate validation split, though for final model selection a proper cross validation remains the safer choice when data permit.
107.4 4. Feature Importance
Random forests offer two principled ways to rank the contribution of each feature. They answer different questions and can disagree, so understanding both is essential.
107.4.1 4.1 Mean decrease in impurity
The first measure accumulates, for each feature, the total reduction in node impurity that its splits achieve, averaged over all trees. When a node \(t\) is split on feature \(j\), the impurity decrease is
\[ \Delta i(t) = i(t) - \frac{n_{t_L}}{n_t}\,i(t_L) - \frac{n_{t_R}}{n_t}\,i(t_R), \]
where \(i(\cdot)\) is the node impurity (Gini index or entropy for classification, variance for regression), \(n_t\) is the number of samples reaching node \(t\), and \(t_L, t_R\) are the children. The importance of feature \(j\) is the sum of \(\Delta i(t)\) over all nodes that split on \(j\), weighted by the fraction of samples reaching each node, averaged across the forest. This is often called mean decrease in impurity, or MDI, or Gini importance.
MDI is computed for free during training, which makes it attractive. But it has a well known bias: it inflates the apparent importance of features with many possible split points, such as continuous variables or high cardinality categoricals, because such features have more opportunities to reduce impurity by chance. It is also computed on the training data, so it can reward overfitting. Treat MDI as a fast first look, not a final verdict.
107.4.2 4.2 Permutation importance
The second measure is model agnostic and directly operational. After the forest is trained, we measure a baseline error, ideally the OOB error. Then for a chosen feature \(j\) we randomly permute its values across the OOB observations, breaking any association between \(j\) and the target while leaving its marginal distribution intact, and recompute the error. The importance of \(j\) is the increase in error:
\[ \text{Imp}(j) = \text{err}_{\text{permuted}(j)} - \text{err}_{\text{baseline}}. \]
If permuting \(j\) barely changes the error, the model was not relying on \(j\). If the error jumps, \(j\) was important. Averaging over several permutations and over the trees gives a stable estimate. Because the permutation is applied to held out data, this measure does not reward training set overfitting in the way MDI can, and it is not systematically biased toward high cardinality features.
baseline = oob_error(forest, X, y)
for each feature j:
X_perm = copy(X); shuffle column j of X_perm over OOB rows
importance[j] = oob_error(forest, X_perm, y) - baseline
107.5 5. Practical Tuning
Random forests are forgiving, which is much of their appeal. Still, a handful of settings repay attention.
107.5.1 5.1 Number of trees
The number of trees \(B\) is not a parameter that overfits. Adding more trees only refines the Monte Carlo average and monotonically improves, or at worst stabilizes, the estimate; it never increases the expected generalization error in the way that, say, depth does. The practical guidance is to use as many trees as your compute budget allows, then verify via the OOB error curve that performance has plateaued. A few hundred to a couple thousand trees is typical. The only cost of more trees is time and memory.
107.5.2 5.2 The split dimension and tree depth
The split dimension \(m\) is the lever with the largest effect on the bias variance tradeoff of the ensemble, so it is the first thing to tune, sweeping values around the default \(\sqrt{p}\) or \(p/3\) and reading off the OOB error. Tree depth and the related stopping rules, minimum samples per leaf and minimum samples to split, are usually left permissive so that trees grow deep and keep low bias, with the ensemble averaging away the variance. On very noisy data or very large datasets, mildly restricting leaf size can reduce variance further and speed up training, so a light sweep of minimum leaf size is sometimes worthwhile.
107.5.3 5.3 Sampling and class imbalance
By default each tree is grown on a bootstrap sample of size \(n\) drawn with replacement. Reducing the sample size or sampling without replacement, often called subsampling, can further decorrelate trees and accelerate training on large data. For imbalanced classification, where one class dominates, the forest will tend to favor the majority class. Effective fixes include balanced bootstrap sampling, in which each draw is stratified to equalize class counts, class weighting in the split criterion, and adjusting the decision threshold on the averaged class probabilities rather than defaulting to a half. Evaluate such models with metrics suited to imbalance, such as the area under the precision recall curve, not raw accuracy.
107.5.4 5.4 A sensible default recipe
For a first pass on a new problem, a strong baseline is several hundred trees, \(m\) at the standard default, deep unpruned trees, and the OOB error used to confirm convergence and to compare a small grid of \(m\) values. This typically lands close to the best achievable accuracy with almost no manual effort, which is precisely why random forests remain a default choice for tabular data.
forest = RandomForest(
n_estimators = 500, # raise until OOB error plateaus
max_features = "sqrt", # tune around this value
min_samples_leaf = 1, # raise slightly only if noisy
oob_score = True, # free validation estimate
class_weight = "balanced" # only for imbalanced targets
)
107.5.5 5.5 Strengths and limits
Random forests handle mixed feature types, missing values through surrogate splits in some implementations, nonlinearities, and interactions with little preprocessing, and they need no feature scaling. They are highly parallel, since trees are independent. Their main limitations are model size and prediction latency for very large forests, weaker extrapolation than parametric models since predictions are bounded by the range of the training targets, and reduced interpretability relative to a single tree. When the goal is maximal predictive accuracy on structured data and the priority is reliability over interpretability, the random forest is hard to beat, and gradient boosted trees are the usual next step when squeezing out further accuracy is worth additional tuning.
107.6 References
- Breiman, L. Random Forests. Machine Learning, 45(1), 5 to 32, 2001. https://link.springer.com/article/10.1023/A:1010933404324
- Breiman, L. Bagging Predictors. Machine Learning, 24(2), 123 to 140, 1996. https://link.springer.com/article/10.1007/BF00058655
- Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd edition, Springer, 2009. https://hastie.su.domains/ElemStatLearn/
- Louppe, G. Understanding Random Forests: From Theory to Practice. PhD thesis, University of Liege, 2014. https://arxiv.org/abs/1407.7502
- Strobl, C., Boulesteix, A., Zeileis, A., and Hothorn, T. Bias in Random Forest Variable Importance Measures. BMC Bioinformatics, 8:25, 2007. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-25
- Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825 to 2830, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
- Probst, P., Wright, M., and Boulesteix, A. Hyperparameters and Tuning Strategies for Random Forest. WIREs Data Mining and Knowledge Discovery, 9(3), 2019. https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1301