107 Random Forests

Random forests are among the most reliable off the shelf predictors in supervised learning. They take a high variance base learner, the decision tree, and tame it through two complementary forms of randomization: bootstrap resampling of the training data and random restriction of the candidate features at each split. The result is an ensemble that is accurate, robust to noise, nearly free of tuning, and equipped with built in tools for error estimation and variable assessment. This chapter develops the method from the bias variance perspective, explains why decorrelation is the central idea, derives the out of bag error estimate, scrutinizes two notions of feature importance, lays out the cost model, and closes with a runnable demonstration in three languages using mature open-source libraries rather than a hand-rolled implementation.

107.1 1. From Bagging to Random Forests

107.1.1 1.1 The variance problem of trees

A single classification or regression tree partitions the feature space into axis aligned regions and fits a constant in each region. Grown deep, a tree has low bias: it can carve out arbitrarily fine structure. But it has high variance. A small perturbation of the training set can change an early split, which cascades into a completely different partition downstream. Formally, if $\hat{f}(x)$ is the prediction of a tree at a point $x$, the expected squared error decomposes as

\[ \mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\big(\mathbb{E}[\hat{f}(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\hat{f}(x))}_{\text{variance}}. \]

For deep trees the bias term is small and the variance term dominates. The natural remedy is averaging, since averaging many noisy but roughly unbiased estimates reduces variance while preserving the low bias.

107.1.2 1.2 Bagging

Bagging, short for bootstrap aggregating, is the direct application of this idea. Given a training set of $n$ examples, we draw $B$ bootstrap samples, each formed by sampling $n$ points with replacement. We fit a tree $\hat{f}^{(b)}$ on each bootstrap sample and average their predictions:

\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{(b)}(x). \]

For classification we instead take a majority vote, or average the class probability estimates and then take the argmax. Because each tree is grown deep and left unpruned, the individual trees keep their low bias, and the ensemble buys a reduction in variance.

107.1.3 1.3 Why averaging alone is not enough

The catch is that the bootstrap samples overlap heavily, so the trees they produce are correlated. Consider $B$ identically distributed predictions, each with variance $\sigma^2$ and pairwise correlation $\rho$. The variance of their average is

\[ \operatorname{Var}\!\left(\frac{1}{B}\sum_{b=1}^{B} \hat{f}^{(b)}(x)\right) = \rho\,\sigma^2 + \frac{1 - \rho}{B}\,\sigma^2. \]

To see where this comes from, write the variance of the sum as the sum of all $B^2$ entries of the covariance matrix. There are $B$ diagonal entries equal to $\sigma^2$ and $B(B-1)$ off diagonal entries equal to $\rho\sigma^2$, so the sum is $B\sigma^2 + B(B-1)\rho\sigma^2$. Dividing by $B^2$ for the average and rearranging yields the expression above. As $B \to \infty$ the second term vanishes, but the first term, $\rho\sigma^2$, does not. The correlation between trees sets a floor on how much variance averaging can remove. If we want a better ensemble, we must drive down $\rho$. This single equation is the conceptual engine of the random forest.

107.2 2. Decorrelating the Trees

107.2.1 2.1 Random feature subsets at each split

Random forests, introduced by Breiman in 2001, add a second source of randomness on top of bagging. When growing each tree, at every node the algorithm does not consider all $p$ features as split candidates. Instead it draws a random subset of $m \le p$ features and searches for the best split only among those. A fresh subset is drawn at every node.

The effect is to break the dominance of strong predictors. In plain bagging, if one or two features are highly informative, nearly every tree will split on them at the top, producing very similar trees and a high $\rho$. By hiding those features at a fraction of the nodes, random feature selection forces trees to explore alternative structure, which lowers $\rho$ at the cost of a small increase in the bias and variance of each individual tree. The net effect on the ensemble is usually favorable, because the floor $\rho\sigma^2$ in the variance bound is reduced faster than the per-tree variance $\sigma^2$ grows.

107.2.2 2.2 The algorithm

The full procedure is compact.

for b = 1 to B:
    draw a bootstrap sample D_b of size n from the data
    grow a tree T_b on D_b:
        at each node:
            select m features at random from the p available
            choose the best split among those m features
            split the node
        grow until a stopping rule (e.g. min node size) is met
        do not prune
prediction:
    regression:      average the B tree outputs
    classification:  majority vote, or average class probabilities

The two knobs that distinguish a forest from a bag of trees are $m$, the number of features sampled per split, and the absence of pruning. Everything else is standard tree machinery.

107.2.3 2.3 Choosing the split dimension $m$

Common defaults are $m = \lfloor \sqrt{p} \rfloor$ for classification and $m = \lfloor p/3 \rfloor$ for regression, with a floor of one. Small $m$ yields stronger decorrelation but weaker individual trees; large $m$ approaches plain bagging. Setting $m = p$ recovers bagging exactly. The optimal value depends on how many features are relevant: when only a few features carry signal among many noise features, a very small $m$ raises the chance that a given split never sees a useful feature, so a moderately larger $m$ can help. In practice $m$ is the one hyperparameter most worth tuning.

107.3 3. Out of Bag Error

107.3.1 3.1 The out of bag sample

Bootstrap sampling gives random forests a free validation mechanism. When we draw a bootstrap sample of size $n$ with replacement, each particular observation has probability $(1 - 1/n)^n$ of being omitted. As $n$ grows this tends to

\[ \lim_{n \to \infty}\left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368. \]

So on average about 37 percent of the observations are left out of any given bootstrap sample. These are the out of bag, or OOB, observations for that tree. Each observation is OOB for roughly a third of the trees.

107.3.2 3.2 Constructing the OOB estimate

To form the OOB prediction for observation $i$, we aggregate only over those trees for which $i$ was out of bag:

\[ \hat{f}_{\text{oob}}(x_i) = \operatorname{aggregate}\big\{\,\hat{f}^{(b)}(x_i) : i \notin D_b \,\big\}. \]

Because each such tree never saw $x_i$ during training, the resulting prediction is honest in the same sense as a held out prediction. The OOB error is the average loss of these predictions over all $i$. For classification it is the misclassification rate of the OOB votes; for regression it is the OOB mean squared error.

107.3.3 3.3 Why this matters in practice

The OOB error closely tracks the error one would obtain from a separate test set or from cross validation, and it comes at essentially no extra cost, since the trees are already built. This lets you monitor performance as a function of $B$ and stop adding trees once the OOB error plateaus. One caveat: each observation’s OOB estimate is based on only about a third of the trees, so for small $B$ the OOB error can be slightly pessimistic. With a few hundred trees this bias is negligible. The OOB estimate can also be used to tune $m$ without a separate validation split, though for final model selection a proper cross validation remains the safer choice when data permit.

107.4 4. Feature Importance

Random forests offer two principled ways to rank the contribution of each feature. They answer different questions and can disagree, so understanding both is essential.

107.4.1 4.1 Mean decrease in impurity

The first measure accumulates, for each feature, the total reduction in node impurity that its splits achieve, averaged over all trees. When a node $t$ is split on feature $j$, the impurity decrease is

\[ \Delta i(t) = i(t) - \frac{n_{t_L}}{n_t}\,i(t_L) - \frac{n_{t_R}}{n_t}\,i(t_R), \]

where $i(\cdot)$ is the node impurity (Gini index or entropy for classification, variance for regression), $n_t$ is the number of samples reaching node $t$, and $t_L, t_R$ are the children. The importance of feature $j$ is the sum of $\Delta i(t)$ over all nodes that split on $j$, weighted by the fraction of samples reaching each node, averaged across the forest. This is often called mean decrease in impurity, or MDI, or Gini importance.

MDI is computed for free during training, which makes it attractive. But it has a well known bias: it inflates the apparent importance of features with many possible split points, such as continuous variables or high cardinality categoricals, because such features have more opportunities to reduce impurity by chance. It is also computed on the training data, so it can reward overfitting. Treat MDI as a fast first look, not a final verdict.

107.4.2 4.2 Permutation importance

The second measure is model agnostic and directly operational. After the forest is trained, we measure a baseline error, ideally the OOB error. Then for a chosen feature $j$ we randomly permute its values across the OOB observations, breaking any association between $j$ and the target while leaving its marginal distribution intact, and recompute the error. The importance of $j$ is the increase in error:

\[ \text{Imp}(j) = \text{err}_{\text{permuted}(j)} - \text{err}_{\text{baseline}}. \]

If permuting $j$ barely changes the error, the model was not relying on $j$. If the error jumps, $j$ was important. Averaging over several permutations and over the trees gives a stable estimate. Because the permutation is applied to held out data, this measure does not reward training set overfitting in the way MDI can, and it is not systematically biased toward high cardinality features.

baseline = oob_error(forest, X, y)
for each feature j:
    X_perm = copy(X); shuffle column j of X_perm over OOB rows
    importance[j] = oob_error(forest, X_perm, y) - baseline

107.4.3 4.3 The correlated feature trap

Both measures are distorted by correlated predictors, but in opposite ways that are worth internalizing. With permutation importance, if two features are strongly correlated, permuting one of them leaves the model able to recover most of the signal from the other, so each appears less important than it truly is; the importance is split or hidden between them. With MDI, correlated features tend to share credit somewhat arbitrarily depending on which one happened to be chosen at each split. Neither measure is causal. A feature with high importance is predictive in the context of this model, not necessarily a cause of the outcome, and a feature with low importance may simply be redundant given the others. When interpretation matters, consider grouped permutation, conditional permutation schemes, or downstream tools such as SHAP values, and always validate conclusions against domain knowledge.

107.5 5. Computational Cost

The training cost is dominated by the per-node split search. Building one tree on $n$ samples, evaluating $m$ candidate features per node, and sorting the relevant values, costs on the order of $O(m\,n\log n)$ for a balanced tree. Across $B$ trees the total training time is roughly

\[ O\big(B\,m\,n\log n\big), \]

which is linear in the number of trees and, crucially, linear in $m$ rather than $p$. Because $m$ is typically $\sqrt{p}$ or $p/3$, random forests are markedly cheaper than full bagging on wide data. The trees are entirely independent, so training parallelizes almost perfectly across cores and machines; wall clock time scales close to $1/(\text{number of cores})$.

Prediction cost is $O(B\,d)$ where $d$ is the typical tree depth, since each test point traverses each tree from root to leaf. Memory scales with the total number of nodes, on the order of $O(B\,n)$ for fully grown trees, which is the main practical limit for very large forests. Restricting leaf size or tree depth trades a little accuracy for substantial reductions in both memory and latency.

107.6 6. Mature Open-Source Tooling

You should almost never implement a random forest from scratch for production. The hard parts, efficient split finding, presorting, parallel tree construction, OOB bookkeeping, and missing value handling, are exactly where mature libraries have invested years of engineering, and a naive reimplementation will be slower and subtly wrong. Reach for the established free and open-source tools.

In Python, scikit-learn’s RandomForestClassifier and RandomForestRegressor are the reference implementations: well tested, parallelized through joblib, and integrated with the wider ecosystem of pipelines, cross validation, and permutation_importance. They are released under the permissive BSD license.
In Julia, DecisionTree.jl provides a fast native random forest and integrates with the MLJ model-selection framework, so the same forest plugs into MLJ’s tuning and evaluation machinery.
In Rust, the linfa-trees crate (part of the linfa toolkit) and smartcore both offer random forests with a scikit-learn-flavored API, suitable for embedding a trained model in a compiled service.

When you need to push tabular accuracy further, gradient boosted tree libraries such as XGBoost, LightGBM, and CatBoost, all open source, are the usual next step; they trade the forest’s near-zero tuning for higher ceilings at the cost of more hyperparameters.

The demonstration below uses scikit-learn as the executed reference and shows the equivalent forest in Julia and Rust. The dataset is generated inline so the example is fully self-contained and reproducible.

Code

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, roc_auc_score

rng = 0

# A realistic tabular problem: 1000 rows, 12 features, of which only
# 5 carry signal, 2 are redundant (linear combos of the informative
# ones), and the rest are noise. This is exactly the regime where the
# random feature subset and OOB error earn their keep.
X, y = make_classification(
    n_samples=1000, n_features=12, n_informative=5, n_redundant=2,
    n_repeated=0, n_classes=2, flip_y=0.03, class_sep=1.1,
    random_state=rng,
)

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=rng
)

forest = RandomForestClassifier(
    n_estimators=400,        # raise until OOB error plateaus
    max_features="sqrt",     # m = floor(sqrt(p)); the key decorrelation knob
    min_samples_leaf=1,      # deep, unpruned trees
    oob_score=True,          # free held-out estimate, no extra data
    bootstrap=True,
    n_jobs=-1,
    random_state=rng,
)
forest.fit(X_tr, y_tr)

# Held-out performance plus the free OOB estimate.
proba = forest.predict_proba(X_te)[:, 1]
pred = forest.predict(X_te)
print(f"shapes: X_tr={X_tr.shape}, X_te={X_te.shape}")
print(f"sqrt(p) features per split: {int(np.sqrt(X.shape[1]))} of {X.shape[1]}")
print(f"OOB accuracy (training):    {forest.oob_score_:.3f}")
print(f"test accuracy:              {accuracy_score(y_te, pred):.3f}")
print(f"test ROC AUC:               {roc_auc_score(y_te, proba):.3f}")

# Permutation importance on the test split is the honest ranking:
# it is computed on held-out data and is not biased toward
# high-cardinality features the way impurity importance can be.
perm = permutation_importance(
    forest, X_te, y_te, n_repeats=20, random_state=rng, n_jobs=-1
)
order = np.argsort(perm.importances_mean)[::-1]
print("\ntop 5 features by permutation importance (drop in accuracy):")
for j in order[:5]:
    print(f"  feature {j:2d}: {perm.importances_mean[j]:.3f}"
          f" +/- {perm.importances_std[j]:.3f}")

shapes: X_tr=(750, 12), X_te=(250, 12)
sqrt(p) features per split: 3 of 12
OOB accuracy (training):    0.952
test accuracy:              0.912
test ROC AUC:               0.965

top 5 features by permutation importance (drop in accuracy):
  feature  0: 0.255 +/- 0.027
  feature  4: 0.017 +/- 0.008
  feature  1: 0.010 +/- 0.007
  feature  8: 0.005 +/- 0.006
  feature  5: 0.001 +/- 0.006

The forest reaches high accuracy with no tuning beyond sensible defaults, and the OOB accuracy closely tracks the genuine test accuracy, confirming that the bootstrap leftovers give an honest performance estimate for free. The permutation ranking surfaces a handful of features as carrying nearly all the predictive signal, consistent with the five informative columns we planted, while the noise columns sit near zero.

# Pkg.add(["DecisionTree", "MLJ", "StableRNGs"])
using DecisionTree
using Random

rng = MersenneTwister(0)

# Stand-in for make_classification: a small synthetic tabular set.
n, p = 1000, 12
X = randn(rng, n, p)
# Signal lives in the first five columns; the rest are noise.
w = [1.4, -1.1, 0.9, 0.8, -0.7]
logits = X[:, 1:5] * w
y = Int.(logits .+ 0.3 .* randn(rng, n) .> 0)

idx = shuffle(rng, 1:n)
cut = round(Int, 0.75n)
tr, te = idx[1:cut], idx[cut+1:end]

# n_trees=400, m = floor(sqrt(p)) features per split, full-depth trees.
n_subfeatures = floor(Int, sqrt(p))
forest = build_forest(y[tr], X[tr, :], n_subfeatures, 400)

pred = apply_forest(forest, X[te, :])
acc  = sum(pred .== y[te]) / length(te)
println("features per split: ", n_subfeatures, " of ", p)
println("test accuracy:      ", round(acc, digits=3))

# DecisionTree.jl also exposes built-in OOB error and integrates with
# MLJ via the RandomForestClassifier model for tuning and evaluation.

// Cargo.toml:
//   linfa = "0.7"
//   linfa-trees = "0.7"
//   ndarray = "0.15"
//   ndarray-rand = "0.14"
//
// linfa-trees ships a decision tree; a random forest is a thin bagging
// wrapper over it (bootstrap rows + sqrt(p) feature subsets). The
// smartcore crate offers a ready-made RandomForestClassifier if you
// prefer a forest out of the box. Sketch using linfa's tree:

use linfa::prelude::*;
use linfa_trees::DecisionTree;
use ndarray::{Array, Axis};
use ndarray_rand::{RandomExt, rand_distr::Uniform};

fn main() {
    // Synthetic tabular data: 1000 x 12, signal in the first columns.
    let n = 1000;
    let x = Array::random((n, 12), Uniform::new(-2.0, 2.0));
    let y = x
        .slice(ndarray::s![.., 0..5])
        .map_axis(Axis(1), |row| {
            let s = 1.4 * row[0] - 1.1 * row[1] + 0.9 * row[2]
                  + 0.8 * row[3] - 0.7 * row[4];
            if s > 0.0 { 1 } else { 0 }
        });

    let ds = Dataset::new(x, y);
    let (train, test) = ds.split_with_ratio(0.75);

    // One tree shown for brevity; bag B of these over bootstrap rows
    // and sqrt(p) feature subsets to assemble the forest, or use
    // smartcore::ensemble::random_forest_classifier directly.
    let model = DecisionTree::params().fit(&train).unwrap();
    let pred = model.predict(&test);
    let acc = pred.confusion_matrix(&test).unwrap().accuracy();
    println!("test accuracy: {:.3}", acc);
}

Note: in the Rust ecosystem, smartcore provides a complete RandomForestClassifier out of the box, while linfa-trees currently exposes the decision tree primitive over which a forest is a short bagging wrapper. For production Rust prefer smartcore’s forest or, for maximum accuracy, the xgboost/lightgbm bindings.

107.7 7. Practical Tuning

Random forests are forgiving, which is much of their appeal. Still, a handful of settings repay attention.

107.7.1 7.1 Number of trees

The number of trees $B$ is not a parameter that overfits. Adding more trees only refines the Monte Carlo average and monotonically improves, or at worst stabilizes, the estimate; it never increases the expected generalization error in the way that, say, depth does. The practical guidance is to use as many trees as your compute budget allows, then verify via the OOB error curve that performance has plateaued. A few hundred to a couple thousand trees is typical. The only cost of more trees is time and memory.

107.7.2 7.2 The split dimension and tree depth

The split dimension $m$ is the lever with the largest effect on the bias variance tradeoff of the ensemble, so it is the first thing to tune, sweeping values around the default $\sqrt{p}$ or $p/3$ and reading off the OOB error. Tree depth and the related stopping rules, minimum samples per leaf and minimum samples to split, are usually left permissive so that trees grow deep and keep low bias, with the ensemble averaging away the variance. On very noisy data or very large datasets, mildly restricting leaf size can reduce variance further and speed up training, so a light sweep of minimum leaf size is sometimes worthwhile.

107.7.3 7.3 Sampling and class imbalance

By default each tree is grown on a bootstrap sample of size $n$ drawn with replacement. Reducing the sample size or sampling without replacement, often called subsampling, can further decorrelate trees and accelerate training on large data. For imbalanced classification, where one class dominates, the forest will tend to favor the majority class. Effective fixes include balanced bootstrap sampling, in which each draw is stratified to equalize class counts, class weighting in the split criterion, and adjusting the decision threshold on the averaged class probabilities rather than defaulting to a half. Evaluate such models with metrics suited to imbalance, such as the area under the precision recall curve, not raw accuracy.

107.7.4 7.4 A sensible default recipe

For a first pass on a new problem, a strong baseline is several hundred trees, $m$ at the standard default, deep unpruned trees, and the OOB error used to confirm convergence and to compare a small grid of $m$ values. This typically lands close to the best achievable accuracy with almost no manual effort, which is precisely why random forests remain a default choice for tabular data.

forest = RandomForest(
    n_estimators = 500,        # raise until OOB error plateaus
    max_features = "sqrt",     # tune around this value
    min_samples_leaf = 1,      # raise slightly only if noisy
    oob_score = True,          # free validation estimate
    class_weight = "balanced"  # only for imbalanced targets
)

107.7.5 7.5 Strengths and limits

Random forests handle mixed feature types, missing values through surrogate splits in some implementations, nonlinearities, and interactions with little preprocessing, and they need no feature scaling. They are highly parallel, since trees are independent. Their main limitations are model size and prediction latency for very large forests, weaker extrapolation than parametric models since predictions are bounded by the range of the training targets, and reduced interpretability relative to a single tree. When the goal is maximal predictive accuracy on structured data and the priority is reliability over interpretability, the random forest is hard to beat, and gradient boosted trees are the usual next step when squeezing out further accuracy is worth additional tuning.

107.8 References

Breiman, L. Random Forests. Machine Learning, 45(1), 5 to 32, 2001. https://doi.org/10.1023/A:1010933404324
Breiman, L. Bagging Predictors. Machine Learning, 24(2), 123 to 140, 1996. https://doi.org/10.1007/BF00058655
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd edition, Springer, 2009. https://doi.org/10.1007/978-0-387-84858-7
Louppe, G. Understanding Random Forests: From Theory to Practice. PhD thesis, University of Liege, 2014. https://arxiv.org/abs/1407.7502
Strobl, C., Boulesteix, A., Zeileis, A., and Hothorn, T. Bias in Random Forest Variable Importance Measures. BMC Bioinformatics, 8:25, 2007. https://doi.org/10.1186/1471-2105-8-25
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825 to 2830, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
Probst, P., Wright, M., and Boulesteix, A. Hyperparameters and Tuning Strategies for Random Forest. WIREs Data Mining and Knowledge Discovery, 9(3), 2019. https://doi.org/10.1002/widm.1301

# Random Forests Random forests are among the most reliable off the shelf predictors in supervised learning. They take a high variance base learner, the decision tree, and tame it through two complementary forms of randomization: bootstrap resampling of the training data and random restriction of the candidate features at each split. The result is an ensemble that is accurate, robust to noise, nearly free of tuning, and equipped with built in tools for error estimation and variable assessment. This chapter develops the method from the bias variance perspective, explains why decorrelation is the central idea, derives the out of bag error estimate, scrutinizes two notions of feature importance, lays out the cost model, and closes with a runnable demonstration in three languages using mature open-source libraries rather than a hand-rolled implementation. ## 1. From Bagging to Random Forests ### 1.1 The variance problem of trees A single classification or regression tree partitions the feature space into axis aligned regions and fits a constant in each region. Grown deep, a tree has low bias: it can carve out arbitrarily fine structure. But it has high variance. A small perturbation of the training set can change an early split, which cascades into a completely different partition downstream. Formally, if $\hat{f}(x)$ is the prediction of a tree at a point $x$, the expected squared error decomposes as $$ \mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\sigma^2}_{\text{noise}} + \underbrace{\big(\mathbb{E}[\hat{f}(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\hat{f}(x))}_{\text{variance}}. $$ For deep trees the bias term is small and the variance term dominates. The natural remedy is averaging, since averaging many noisy but roughly unbiased estimates reduces variance while preserving the low bias. ### 1.2 Bagging Bagging, short for bootstrap aggregating, is the direct application of this idea. Given a training set of $n$ examples, we draw $B$ bootstrap samples, each formed by sampling $n$ points with replacement. We fit a tree $\hat{f}^{(b)}$ on each bootstrap sample and average their predictions: $$ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{(b)}(x). $$ For classification we instead take a majority vote, or average the class probability estimates and then take the argmax. Because each tree is grown deep and left unpruned, the individual trees keep their low bias, and the ensemble buys a reduction in variance. ### 1.3 Why averaging alone is not enough The catch is that the bootstrap samples overlap heavily, so the trees they produce are correlated. Consider $B$ identically distributed predictions, each with variance $\sigma^2$ and pairwise correlation $\rho$. The variance of their average is $$ \operatorname{Var}\!\left(\frac{1}{B}\sum_{b=1}^{B} \hat{f}^{(b)}(x)\right) = \rho\,\sigma^2 + \frac{1 - \rho}{B}\,\sigma^2. $$ To see where this comes from, write the variance of the sum as the sum of all $B^2$ entries of the covariance matrix. There are $B$ diagonal entries equal to $\sigma^2$ and $B(B-1)$ off diagonal entries equal to $\rho\sigma^2$, so the sum is $B\sigma^2 + B(B-1)\rho\sigma^2$. Dividing by $B^2$ for the average and rearranging yields the expression above. As $B \to \infty$ the second term vanishes, but the first term, $\rho\sigma^2$, does not. The correlation between trees sets a floor on how much variance averaging can remove. If we want a better ensemble, we must drive down $\rho$. This single equation is the conceptual engine of the random forest. ## 2. Decorrelating the Trees ### 2.1 Random feature subsets at each split Random forests, introduced by Breiman in 2001, add a second source of randomness on top of bagging. When growing each tree, at every node the algorithm does not consider all $p$ features as split candidates. Instead it draws a random subset of $m \le p$ features and searches for the best split only among those. A fresh subset is drawn at every node. The effect is to break the dominance of strong predictors. In plain bagging, if one or two features are highly informative, nearly every tree will split on them at the top, producing very similar trees and a high $\rho$. By hiding those features at a fraction of the nodes, random feature selection forces trees to explore alternative structure, which lowers $\rho$ at the cost of a small increase in the bias and variance of each individual tree. The net effect on the ensemble is usually favorable, because the floor $\rho\sigma^2$ in the variance bound is reduced faster than the per-tree variance $\sigma^2$ grows. ### 2.2 The algorithm The full procedure is compact. ```text for b = 1 to B: draw a bootstrap sample D_b of size n from the data grow a tree T_b on D_b: at each node: select m features at random from the p available choose the best split among those m features split the node grow until a stopping rule (e.g. min node size) is met do not prune prediction: regression: average the B tree outputs classification: majority vote, or average class probabilities ``` The two knobs that distinguish a forest from a bag of trees are $m$, the number of features sampled per split, and the absence of pruning. Everything else is standard tree machinery. ### 2.3 Choosing the split dimension $m$ Common defaults are $m = \lfloor \sqrt{p} \rfloor$ for classification and $m = \lfloor p/3 \rfloor$ for regression, with a floor of one. Small $m$ yields stronger decorrelation but weaker individual trees; large $m$ approaches plain bagging. Setting $m = p$ recovers bagging exactly. The optimal value depends on how many features are relevant: when only a few features carry signal among many noise features, a very small $m$ raises the chance that a given split never sees a useful feature, so a moderately larger $m$ can help. In practice $m$ is the one hyperparameter most worth tuning. ## 3. Out of Bag Error ### 3.1 The out of bag sample Bootstrap sampling gives random forests a free validation mechanism. When we draw a bootstrap sample of size $n$ with replacement, each particular observation has probability $(1 - 1/n)^n$ of being omitted. As $n$ grows this tends to $$ \lim_{n \to \infty}\left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368. $$ So on average about 37 percent of the observations are left out of any given bootstrap sample. These are the out of bag, or OOB, observations for that tree. Each observation is OOB for roughly a third of the trees. ### 3.2 Constructing the OOB estimate To form the OOB prediction for observation $i$, we aggregate only over those trees for which $i$ was out of bag: $$ \hat{f}_{\text{oob}}(x_i) = \operatorname{aggregate}\big\{\,\hat{f}^{(b)}(x_i) : i \notin D_b \,\big\}. $$ Because each such tree never saw $x_i$ during training, the resulting prediction is honest in the same sense as a held out prediction. The OOB error is the average loss of these predictions over all $i$. For classification it is the misclassification rate of the OOB votes; for regression it is the OOB mean squared error. ### 3.3 Why this matters in practice The OOB error closely tracks the error one would obtain from a separate test set or from cross validation, and it comes at essentially no extra cost, since the trees are already built. This lets you monitor performance as a function of $B$ and stop adding trees once the OOB error plateaus. One caveat: each observation's OOB estimate is based on only about a third of the trees, so for small $B$ the OOB error can be slightly pessimistic. With a few hundred trees this bias is negligible. The OOB estimate can also be used to tune $m$ without a separate validation split, though for final model selection a proper cross validation remains the safer choice when data permit. ## 4. Feature Importance Random forests offer two principled ways to rank the contribution of each feature. They answer different questions and can disagree, so understanding both is essential. ### 4.1 Mean decrease in impurity The first measure accumulates, for each feature, the total reduction in node impurity that its splits achieve, averaged over all trees. When a node $t$ is split on feature $j$, the impurity decrease is $$ \Delta i(t) = i(t) - \frac{n_{t_L}}{n_t}\,i(t_L) - \frac{n_{t_R}}{n_t}\,i(t_R), $$ where $i(\cdot)$ is the node impurity (Gini index or entropy for classification, variance for regression), $n_t$ is the number of samples reaching node $t$, and $t_L, t_R$ are the children. The importance of feature $j$ is the sum of $\Delta i(t)$ over all nodes that split on $j$, weighted by the fraction of samples reaching each node, averaged across the forest. This is often called mean decrease in impurity, or MDI, or Gini importance. MDI is computed for free during training, which makes it attractive. But it has a well known bias: it inflates the apparent importance of features with many possible split points, such as continuous variables or high cardinality categoricals, because such features have more opportunities to reduce impurity by chance. It is also computed on the training data, so it can reward overfitting. Treat MDI as a fast first look, not a final verdict. ### 4.2 Permutation importance The second measure is model agnostic and directly operational. After the forest is trained, we measure a baseline error, ideally the OOB error. Then for a chosen feature $j$ we randomly permute its values across the OOB observations, breaking any association between $j$ and the target while leaving its marginal distribution intact, and recompute the error. The importance of $j$ is the increase in error: $$ \text{Imp}(j) = \text{err}_{\text{permuted}(j)} - \text{err}_{\text{baseline}}. $$ If permuting $j$ barely changes the error, the model was not relying on $j$. If the error jumps, $j$ was important. Averaging over several permutations and over the trees gives a stable estimate. Because the permutation is applied to held out data, this measure does not reward training set overfitting in the way MDI can, and it is not systematically biased toward high cardinality features. ```text baseline = oob_error(forest, X, y) for each feature j: X_perm = copy(X); shuffle column j of X_perm over OOB rows importance[j] = oob_error(forest, X_perm, y) - baseline ``` ### 4.3 The correlated feature trap Both measures are distorted by correlated predictors, but in opposite ways that are worth internalizing. With permutation importance, if two features are strongly correlated, permuting one of them leaves the model able to recover most of the signal from the other, so each appears less important than it truly is; the importance is split or hidden between them. With MDI, correlated features tend to share credit somewhat arbitrarily depending on which one happened to be chosen at each split. Neither measure is causal. A feature with high importance is predictive in the context of this model, not necessarily a cause of the outcome, and a feature with low importance may simply be redundant given the others. When interpretation matters, consider grouped permutation, conditional permutation schemes, or downstream tools such as SHAP values, and always validate conclusions against domain knowledge. ## 5. Computational Cost The training cost is dominated by the per-node split search. Building one tree on $n$ samples, evaluating $m$ candidate features per node, and sorting the relevant values, costs on the order of $O(m\,n\log n)$ for a balanced tree. Across $B$ trees the total training time is roughly $$ O\big(B\,m\,n\log n\big), $$ which is linear in the number of trees and, crucially, linear in $m$ rather than $p$. Because $m$ is typically $\sqrt{p}$ or $p/3$, random forests are markedly cheaper than full bagging on wide data. The trees are entirely independent, so training parallelizes almost perfectly across cores and machines; wall clock time scales close to $1/(\text{number of cores})$. Prediction cost is $O(B\,d)$ where $d$ is the typical tree depth, since each test point traverses each tree from root to leaf. Memory scales with the total number of nodes, on the order of $O(B\,n)$ for fully grown trees, which is the main practical limit for very large forests. Restricting leaf size or tree depth trades a little accuracy for substantial reductions in both memory and latency. ## 6. Mature Open-Source Tooling You should almost never implement a random forest from scratch for production. The hard parts, efficient split finding, presorting, parallel tree construction, OOB bookkeeping, and missing value handling, are exactly where mature libraries have invested years of engineering, and a naive reimplementation will be slower and subtly wrong. Reach for the established free and open-source tools. - In Python, scikit-learn's `RandomForestClassifier` and `RandomForestRegressor` are the reference implementations: well tested, parallelized through joblib, and integrated with the wider ecosystem of pipelines, cross validation, and `permutation_importance`. They are released under the permissive BSD license. - In Julia, `DecisionTree.jl` provides a fast native random forest and integrates with the `MLJ` model-selection framework, so the same forest plugs into MLJ's tuning and evaluation machinery. - In Rust, the `linfa-trees` crate (part of the `linfa` toolkit) and `smartcore` both offer random forests with a scikit-learn-flavored API, suitable for embedding a trained model in a compiled service. When you need to push tabular accuracy further, gradient boosted tree libraries such as XGBoost, LightGBM, and CatBoost, all open source, are the usual next step; they trade the forest's near-zero tuning for higher ceilings at the cost of more hyperparameters. The demonstration below uses scikit-learn as the executed reference and shows the equivalent forest in Julia and Rust. The dataset is generated inline so the example is fully self-contained and reproducible. ::: {.panel-tabset} ## Python ```{python} import numpy as np from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.inspection import permutation_importance from sklearn.metrics import accuracy_score, roc_auc_score rng = 0 # A realistic tabular problem: 1000 rows, 12 features, of which only # 5 carry signal, 2 are redundant (linear combos of the informative # ones), and the rest are noise. This is exactly the regime where the # random feature subset and OOB error earn their keep. X, y = make_classification( n_samples=1000, n_features=12, n_informative=5, n_redundant=2, n_repeated=0, n_classes=2, flip_y=0.03, class_sep=1.1, random_state=rng, ) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.25, stratify=y, random_state=rng ) forest = RandomForestClassifier( n_estimators=400, # raise until OOB error plateaus max_features="sqrt", # m = floor(sqrt(p)); the key decorrelation knob min_samples_leaf=1, # deep, unpruned trees oob_score=True, # free held-out estimate, no extra data bootstrap=True, n_jobs=-1, random_state=rng, ) forest.fit(X_tr, y_tr) # Held-out performance plus the free OOB estimate. proba = forest.predict_proba(X_te)[:, 1] pred = forest.predict(X_te) print(f"shapes: X_tr={X_tr.shape}, X_te={X_te.shape}") print(f"sqrt(p) features per split: {int(np.sqrt(X.shape[1]))} of {X.shape[1]}") print(f"OOB accuracy (training): {forest.oob_score_:.3f}") print(f"test accuracy: {accuracy_score(y_te, pred):.3f}") print(f"test ROC AUC: {roc_auc_score(y_te, proba):.3f}") # Permutation importance on the test split is the honest ranking: # it is computed on held-out data and is not biased toward # high-cardinality features the way impurity importance can be. perm = permutation_importance( forest, X_te, y_te, n_repeats=20, random_state=rng, n_jobs=-1 ) order = np.argsort(perm.importances_mean)[::-1] print("\ntop 5 features by permutation importance (drop in accuracy):") for j in order[:5]: print(f" feature {j:2d}: {perm.importances_mean[j]:.3f}" f" +/- {perm.importances_std[j]:.3f}") ``` The forest reaches high accuracy with no tuning beyond sensible defaults, and the OOB accuracy closely tracks the genuine test accuracy, confirming that the bootstrap leftovers give an honest performance estimate for free. The permutation ranking surfaces a handful of features as carrying nearly all the predictive signal, consistent with the five informative columns we planted, while the noise columns sit near zero. ## Julia ```julia # Pkg.add(["DecisionTree", "MLJ", "StableRNGs"]) using DecisionTree using Random rng = MersenneTwister(0) # Stand-in for make_classification: a small synthetic tabular set. n, p = 1000, 12 X = randn(rng, n, p) # Signal lives in the first five columns; the rest are noise. w = [1.4, -1.1, 0.9, 0.8, -0.7] logits = X[:, 1:5] * w y = Int.(logits .+ 0.3 .* randn(rng, n) .> 0) idx = shuffle(rng, 1:n) cut = round(Int, 0.75n) tr, te = idx[1:cut], idx[cut+1:end] # n_trees=400, m = floor(sqrt(p)) features per split, full-depth trees. n_subfeatures = floor(Int, sqrt(p)) forest = build_forest(y[tr], X[tr, :], n_subfeatures, 400) pred = apply_forest(forest, X[te, :]) acc = sum(pred .== y[te]) / length(te) println("features per split: ", n_subfeatures, " of ", p) println("test accuracy: ", round(acc, digits=3)) # DecisionTree.jl also exposes built-in OOB error and integrates with # MLJ via the RandomForestClassifier model for tuning and evaluation. ``` ## Rust ```rust // Cargo.toml: // linfa = "0.7" // linfa-trees = "0.7" // ndarray = "0.15" // ndarray-rand = "0.14" // // linfa-trees ships a decision tree; a random forest is a thin bagging // wrapper over it (bootstrap rows + sqrt(p) feature subsets). The // smartcore crate offers a ready-made RandomForestClassifier if you // prefer a forest out of the box. Sketch using linfa's tree: use linfa::prelude::*; use linfa_trees::DecisionTree; use ndarray::{Array, Axis}; use ndarray_rand::{RandomExt, rand_distr::Uniform}; fn main() { // Synthetic tabular data: 1000 x 12, signal in the first columns. let n = 1000; let x = Array::random((n, 12), Uniform::new(-2.0, 2.0)); let y = x .slice(ndarray::s![.., 0..5]) .map_axis(Axis(1), |row| { let s = 1.4 * row[0] - 1.1 * row[1] + 0.9 * row[2] + 0.8 * row[3] - 0.7 * row[4]; if s > 0.0 { 1 } else { 0 } }); let ds = Dataset::new(x, y); let (train, test) = ds.split_with_ratio(0.75); // One tree shown for brevity; bag B of these over bootstrap rows // and sqrt(p) feature subsets to assemble the forest, or use // smartcore::ensemble::random_forest_classifier directly. let model = DecisionTree::params().fit(&train).unwrap(); let pred = model.predict(&test); let acc = pred.confusion_matrix(&test).unwrap().accuracy(); println!("test accuracy: {:.3}", acc); } ``` Note: in the Rust ecosystem, `smartcore` provides a complete `RandomForestClassifier` out of the box, while `linfa-trees` currently exposes the decision tree primitive over which a forest is a short bagging wrapper. For production Rust prefer `smartcore`'s forest or, for maximum accuracy, the `xgboost`/`lightgbm` bindings. ::: ## 7. Practical Tuning Random forests are forgiving, which is much of their appeal. Still, a handful of settings repay attention. ### 7.1 Number of trees The number of trees $B$ is not a parameter that overfits. Adding more trees only refines the Monte Carlo average and monotonically improves, or at worst stabilizes, the estimate; it never increases the expected generalization error in the way that, say, depth does. The practical guidance is to use as many trees as your compute budget allows, then verify via the OOB error curve that performance has plateaued. A few hundred to a couple thousand trees is typical. The only cost of more trees is time and memory. ### 7.2 The split dimension and tree depth The split dimension $m$ is the lever with the largest effect on the bias variance tradeoff of the ensemble, so it is the first thing to tune, sweeping values around the default $\sqrt{p}$ or $p/3$ and reading off the OOB error. Tree depth and the related stopping rules, minimum samples per leaf and minimum samples to split, are usually left permissive so that trees grow deep and keep low bias, with the ensemble averaging away the variance. On very noisy data or very large datasets, mildly restricting leaf size can reduce variance further and speed up training, so a light sweep of minimum leaf size is sometimes worthwhile. ### 7.3 Sampling and class imbalance By default each tree is grown on a bootstrap sample of size $n$ drawn with replacement. Reducing the sample size or sampling without replacement, often called subsampling, can further decorrelate trees and accelerate training on large data. For imbalanced classification, where one class dominates, the forest will tend to favor the majority class. Effective fixes include balanced bootstrap sampling, in which each draw is stratified to equalize class counts, class weighting in the split criterion, and adjusting the decision threshold on the averaged class probabilities rather than defaulting to a half. Evaluate such models with metrics suited to imbalance, such as the area under the precision recall curve, not raw accuracy. ### 7.4 A sensible default recipe For a first pass on a new problem, a strong baseline is several hundred trees, $m$ at the standard default, deep unpruned trees, and the OOB error used to confirm convergence and to compare a small grid of $m$ values. This typically lands close to the best achievable accuracy with almost no manual effort, which is precisely why random forests remain a default choice for tabular data. ```text forest = RandomForest( n_estimators = 500, # raise until OOB error plateaus max_features = "sqrt", # tune around this value min_samples_leaf = 1, # raise slightly only if noisy oob_score = True, # free validation estimate class_weight = "balanced" # only for imbalanced targets ) ``` ### 7.5 Strengths and limits Random forests handle mixed feature types, missing values through surrogate splits in some implementations, nonlinearities, and interactions with little preprocessing, and they need no feature scaling. They are highly parallel, since trees are independent. Their main limitations are model size and prediction latency for very large forests, weaker extrapolation than parametric models since predictions are bounded by the range of the training targets, and reduced interpretability relative to a single tree. When the goal is maximal predictive accuracy on structured data and the priority is reliability over interpretability, the random forest is hard to beat, and gradient boosted trees are the usual next step when squeezing out further accuracy is worth additional tuning. ## References 1. Breiman, L. Random Forests. Machine Learning, 45(1), 5 to 32, 2001. https://doi.org/10.1023/A:1010933404324 2. Breiman, L. Bagging Predictors. Machine Learning, 24(2), 123 to 140, 1996. https://doi.org/10.1007/BF00058655 3. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd edition, Springer, 2009. https://doi.org/10.1007/978-0-387-84858-7 4. Louppe, G. Understanding Random Forests: From Theory to Practice. PhD thesis, University of Liege, 2014. https://arxiv.org/abs/1407.7502 5. Strobl, C., Boulesteix, A., Zeileis, A., and Hothorn, T. Bias in Random Forest Variable Importance Measures. BMC Bioinformatics, 8:25, 2007. https://doi.org/10.1186/1471-2105-8-25 6. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825 to 2830, 2011. https://jmlr.org/papers/v12/pedregosa11a.html 7. Probst, P., Wright, M., and Boulesteix, A. Hyperparameters and Tuning Strategies for Random Forest. WIREs Data Mining and Knowledge Discovery, 9(3), 2019. https://doi.org/10.1002/widm.1301 </content> </invoke>