113 LightGBM: Fast Gradient Boosting at Scale
LightGBM is a gradient boosting framework engineered for speed and memory efficiency on large, high dimensional datasets. It implements the same statistical model as other gradient boosted decision tree (GBDT) systems, yet it reaches comparable or better accuracy with a fraction of the training time. Three ideas drive this advantage: a histogram based representation of feature values, a leaf-wise tree growth policy, and two sampling techniques named gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB). This chapter develops each idea from first principles, situates LightGBM against XGBoost and CatBoost, and offers guidance on when to reach for it in practice.
113.1 1. Gradient Boosting Background
113.1.1 1.1 The Additive Model
Gradient boosting fits an additive ensemble of regression trees. Given a differentiable loss \(L(y, \hat{y})\), the model after \(t\) rounds is \(F_t(x) = F_{t-1}(x) + \eta f_t(x)\), where \(f_t\) is a newly fitted tree and \(\eta\) is the learning rate. At each round we expand the loss to second order around the current prediction. For example \(i\), define the gradient \(g_i = \partial_{\hat{y}} L(y_i, F_{t-1}(x_i))\) and the Hessian \(h_i = \partial^2_{\hat{y}} L(y_i, F_{t-1}(x_i))\). A node containing instance set \(I\) contributes the approximate objective
\[ \tilde{L} = -\frac{1}{2} \frac{\left(\sum_{i \in I} g_i\right)^2}{\sum_{i \in I} h_i + \lambda} + \gamma, \]
where \(\lambda\) is an L2 penalty on leaf weights and \(\gamma\) penalizes adding leaves. The optimal leaf weight is \(w^* = -\sum_{i \in I} g_i / (\sum_{i \in I} h_i + \lambda)\).
113.1.2 1.2 Split Finding as the Bottleneck
Training a tree means repeatedly choosing the split that maximizes the gain
\[ \text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right], \]
with \(G_L = \sum_{i \in I_L} g_i\) and so on. The cost of training is dominated by evaluating candidate splits. A naive exact algorithm sorts every feature and scans all possible thresholds, costing roughly \(O(\#\text{data} \times \#\text{feature})\) per level after the sort, plus the sort itself. LightGBM attacks this cost on two fronts: it shrinks the number of candidate thresholds with histograms, and it shrinks the effective number of rows and columns with GOSS and EFB.
113.2 2. Histogram Based Split Finding
113.2.1 2.1 Binning Continuous Features
Before training, LightGBM discretizes each continuous feature into a small number of bins, typically 255 so that a bin index fits in a single byte. The bin edges come from the empirical distribution of the feature, so dense regions receive finer bins. Once binning is done, the raw floating point values are discarded and the dataset is stored as compact integer bin indices. This alone cuts memory use by a large factor and improves cache locality.
113.2.2 2.2 The Histogram Algorithm
To find the best split on a feature, LightGBM accumulates a histogram whose bins hold the summed gradients and Hessians of the instances falling in each bin.
for each instance i in node:
b = bin_index(feature, i)
hist[b].g += g_i
hist[b].h += h_i
for each candidate threshold (between adjacent bins):
compute gain from prefix sums of hist
Building the histogram costs \(O(\#\text{data})\), and scanning thresholds costs \(O(\#\text{bins})\), which is constant. Contrast this with the pre-sorted exact method at \(O(\#\text{data} \times \log \#\text{data})\) per feature for sorting. Histograms trade a small amount of resolution for a large speedup, and because bin edges are chosen from the data distribution, the accuracy cost is usually negligible.
113.2.3 2.3 The Histogram Subtraction Trick
A decisive optimization exploits a simple identity. A parent node’s histogram equals the sum of its two children’s histograms. So once we build the histogram for the child with fewer instances, the sibling’s histogram is obtained by subtraction:
\[ \text{hist}_{\text{sibling}} = \text{hist}_{\text{parent}} - \text{hist}_{\text{smaller child}}. \]
Because the framework always builds the histogram for the smaller child directly and derives the larger one, it roughly halves the histogram construction work at every level. This trick is one reason LightGBM scales so well as trees deepen.
113.3 3. Leaf-Wise Tree Growth
113.3.1 3.1 Level-Wise versus Leaf-Wise
Most earlier GBDT implementations grow trees level-wise: they split every node at the current depth before moving deeper, producing balanced trees. LightGBM instead grows leaf-wise (also called best-first). At each step it expands the single leaf, anywhere in the tree, that promises the largest gain reduction.
Leaf-wise growth converges to lower loss for a fixed number of leaves because it spends splits where they help most rather than spreading them uniformly. The cost is that trees become deep and unbalanced, which raises the risk of overfitting on small datasets.
113.3.2 3.2 Controlling Complexity
Because depth is no longer the natural complexity knob, LightGBM parameterizes capacity primarily through num_leaves. A leaf-wise tree with \(2^d\) leaves can be far deeper than a level-wise tree of depth \(d\), so setting num_leaves near \(2^d\) tends to overfit. A common starting point keeps num_leaves well below \(2^{\text{max\_depth}}\). The key regularization parameters are summarized below.
num_leaves # main capacity control for leaf-wise growth
max_depth # hard cap on depth to bound the unbalanced tree
min_data_in_leaf # minimum instances per leaf, fights noise fitting
min_gain_to_split # gain floor below which a leaf stays a leaf
lambda_l1, lambda_l2 # L1 and L2 penalties on leaf weights
In practice, on datasets with tens of thousands of rows or fewer, leaf-wise growth should be paired with a modest num_leaves, a positive min_data_in_leaf, and a bounded max_depth. On large datasets the leaf-wise advantage in accuracy is most pronounced and the overfitting risk is smaller.
113.4 4. Gradient-Based One-Side Sampling (GOSS)
113.4.1 4.1 Motivation
Subsampling rows speeds training, but uniform random sampling treats all instances equally. In gradient boosting the gradient magnitude \(|g_i|\) measures how poorly an instance is currently fit: well trained instances have small gradients and contribute little to further split gains. GOSS exploits this by keeping all of the large gradient instances and randomly sampling among the small gradient ones, rather than discarding information uniformly.
113.4.2 4.2 The Algorithm
GOSS proceeds as follows. Sort instances by \(|g_i|\) in descending order. Keep the top fraction \(a\) of instances (the large gradient set \(A\)). From the remaining \((1-a)\) fraction, randomly sample a fraction \(b\) to form set \(B\). When computing gains over \(A \cup B\), multiply the gradients and Hessians of the sampled small gradient instances by the constant
\[ \frac{1 - a}{b}. \]
sort instances by |g| descending
top_set = first a * N instances # all kept
rand_set = sample b * N from the rest # amplified by (1-a)/b
use top_set + rand_set to build histograms
113.4.3 4.3 Why the Reweighting Is Unbiased
The amplification factor restores, in expectation, the contribution that the dropped instances would have made. Suppose the small gradient pool has \(N(1-a)\) instances and we sample \(N b\) of them. Each sampled instance stands in for \(N(1-a) / (Nb) = (1-a)/b\) original instances, so scaling its statistics by that factor keeps the estimated split gain approximately unbiased. The LightGBM paper proves that the variance introduced by GOSS is bounded and that, because large gradient instances are never dropped, the gain estimate remains close to the full-data estimate. The result is a training set that is smaller but information dense, often giving the same accuracy as full data while using a fraction of the instances per round.
113.5 5. Exclusive Feature Bundling (EFB)
113.5.1 5.1 Sparse High Dimensional Features
High dimensional data, especially after one-hot encoding of categorical variables, is typically very sparse. Many features are mutually exclusive: they rarely take nonzero values for the same instance. One-hot indicator columns are the extreme case, since exactly one is nonzero per category. EFB recognizes that such features can be packed into a single bundled feature without losing information, because their nonzero values never collide.
113.5.2 5.2 Bundling as Graph Coloring
Finding the best bundling is equivalent to a graph coloring problem. Build a graph whose vertices are features and whose edges connect features that conflict, meaning they are simultaneously nonzero for too many instances. The weight of an edge is the number of conflicts. Bundling features into the fewest groups so that within-group conflict stays under a tolerance is graph coloring, which is NP-hard. LightGBM uses a greedy heuristic: sort features by degree (total conflict), then assign each feature to an existing bundle if doing so keeps conflicts under a threshold max_conflict_rate, otherwise open a new bundle. Allowing a small conflict rate trades a tiny amount of accuracy for far fewer bundles.
113.5.3 5.3 Merging Values Within a Bundle
Once features are grouped, their values are merged into one feature by offsetting each member’s bin range so the ranges do not overlap. If feature \(A\) uses bins \([0, k_A)\) and feature \(B\) uses bins \([0, k_B)\), then \(B\) is shifted to occupy \([k_A, k_A + k_B)\) in the bundled feature.
offset = 0
for feature f in bundle:
bundled_bin(i) = bin(f, i) + offset if f nonzero for i
offset += num_bins(f)
A split on the bundled feature can therefore recover a split on any of its members. EFB reduces the effective number of features from #feature to #bundle, and histogram construction cost scales with the number of bundles rather than the original feature count. On sparse data this can be an order of magnitude reduction.
113.6 6. Categorical Feature Handling
LightGBM treats categorical features natively rather than requiring one-hot encoding. For a categorical feature it sorts the categories by the accumulated gradient statistic \(\sum g / \sum h\) within each category, then finds the best partition of categories into two groups along that ordering. This follows the classic Fisher result that the optimal partition for many splitting criteria lies along a sorted ordering, reducing an exponential search over subsets to a linear scan. Native handling avoids the dimensionality blowup of one-hot encoding and usually yields better splits, though high cardinality features still warrant care to avoid overfitting, for which min_data_per_group and cat_smooth provide control.
113.7 7. Parallel and Distributed Training
LightGBM supports several parallelism modes. Feature parallelism partitions features across workers, each of which finds the best split on its own features before the workers communicate to agree on the global best split. This works well when features outnumber rows. Data parallelism partitions instances across workers; each builds local histograms that are then merged, and the communication cost is reduced by merging only the smaller histograms and by reducing histograms per feature rather than per instance. Voting parallelism further cuts communication by having each worker propose only its top candidate features for a global vote, so only a subset of histograms is communicated. The choice depends on the shape of the data and the cluster, but voting parallelism scales best when both rows and features are large.
113.8 8. When LightGBM Is Preferred
113.8.1 8.1 Strengths
LightGBM is usually the first choice when training speed and memory matter. On large tabular datasets, from hundreds of thousands to many millions of rows, it trains substantially faster than level-wise alternatives at comparable accuracy, and its histogram representation keeps memory modest. It shines on wide sparse data thanks to EFB, and on data with many informative but redundant instances thanks to GOSS. Native categorical support removes a preprocessing step. For rapid iteration during feature engineering or hyperparameter search, the speed advantage compounds across many runs.
113.8.2 8.2 Cautions and Alternatives
The leaf-wise policy makes LightGBM more prone to overfitting on small datasets, so on a few thousand rows a level-wise method or careful regularization is advisable. CatBoost is often a stronger default when categorical features dominate and target leakage is a concern, since its ordered boosting and ordered target statistics specifically combat that leakage. XGBoost remains a robust, well documented baseline with a mature ecosystem and an exact greedy mode for smaller problems where the histogram approximation is unnecessary. The table below summarizes the trade-offs.
LightGBM fastest on large/wide/sparse data; tune num_leaves carefully
XGBoost robust default; exact mode for small data; large ecosystem
CatBoost best with heavy categoricals; resists target leakage
113.8.3 8.3 Practical Tuning Order
A reasonable tuning sequence starts with capacity, then regularization, then sampling. First set num_leaves and max_depth to control tree complexity. Next raise min_data_in_leaf and the L1 and L2 penalties until validation loss stabilizes. Then lower the learning rate and raise the number of boosting rounds together, using early stopping on a validation set. Finally enable feature_fraction and bagging, or switch the boosting type to GOSS, to recover speed once accuracy targets are met. Throughout, monitor the gap between training and validation loss, since the leaf-wise growth makes that gap the clearest signal of overfitting.
113.9 9. Summary
LightGBM achieves its speed through a coherent set of design choices. Histogram binning turns split finding into a constant-cost scan and enables the histogram subtraction trick. Leaf-wise growth invests each split where it most reduces loss. GOSS shrinks the row count while preserving the high gradient instances that drive learning, with a reweighting that keeps gain estimates unbiased. EFB shrinks the column count by bundling mutually exclusive sparse features. Together these let LightGBM match or exceed the accuracy of other gradient boosting systems while training faster and using less memory, which makes it the default choice for large scale tabular learning, tempered by attention to overfitting on small or noisy datasets.
113.10 References
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” NeurIPS 2017. https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
- LightGBM Documentation. “Features.” https://lightgbm.readthedocs.io/en/latest/Features.html
- LightGBM Documentation. “Parameters Tuning.” https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
- Chen, T. and Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” KDD 2016. https://dl.acm.org/doi/10.1145/2939672.2939785
- Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. “CatBoost: Unbiased Boosting with Categorical Features.” NeurIPS 2018. https://papers.nips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html
- Friedman, J. H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 2001. https://projecteuclid.org/euclid.aos/1013203451
- Microsoft. “LightGBM GitHub Repository.” https://github.com/microsoft/LightGBM