114 CatBoost: Ordered Boosting and Categorical Features Done Right

CatBoost is a gradient boosting library developed at Yandex that addresses two problems most gradient boosting toolkits handle poorly: the statistical leakage introduced when categorical features are encoded with target information, and a subtler prediction shift that affects gradient boosted models in general. Its two signature ideas, ordered target statistics and ordered boosting, both rest on the same principle of using only the past to estimate the present. Combined with a fast oblivious tree learner and strong default hyperparameters, CatBoost is often the strongest out of the box choice on tabular data with high cardinality categorical columns.

This chapter develops the theory behind these mechanisms, explains why the obvious alternatives leak information, and gives practical guidance for using the library effectively.

114.1 1. Why Categorical Features Are Hard

114.1.1 1.1 The encoding problem

Tree based learners split on numeric thresholds. A categorical feature such as user_id, merchant, or city has no natural ordering, so it must be turned into numbers before a tree can use it. The classical options each have a failure mode.

One hot encoding creates one binary column per category. For a feature with thousands of levels this explodes the feature space, dilutes the signal across many sparse columns, and makes splits shallow and weak. Label encoding assigns an arbitrary integer to each category, which imposes a meaningless ordering that the tree will happily exploit in misleading ways.

A far more powerful approach is target encoding, also called target statistics. Replace each category with a statistic of the target computed over the rows that share that category. For a binary classification target $y \in \{0,1\}$ and a categorical value $x^i_k$ for feature $i$ in row $k$, the natural estimate is the category mean

\[ \hat{x}^i_k = \frac{\sum_{j : x^i_j = x^i_k} y_j}{\sum_{j : x^i_j = x^i_k} 1}. \]

This is compact, scales to arbitrary cardinality, and captures exactly the predictive relationship we care about. It is also dangerously leaky.

114.1.2 1.2 Target leakage and the smoothing band aid

The estimate above uses $y_k$, the label of the very row we are encoding, inside the numerator. A category that appears only once will be encoded with its own target value, so the feature becomes a perfect copy of the label on the training set and carries no information at test time. Even for categories that appear a handful of times, the encoded value is contaminated by the current row, producing an optimistic training signal that does not generalize. This is target leakage, and it is the central pathology CatBoost is built to remove.

The standard mitigation is additive smoothing toward a prior $p$ with weight $a$:

\[ \hat{x}^i_k = \frac{\sum_{j} \mathbb{1}[x^i_j = x^i_k]\, y_j + a\,p}{\sum_{j} \mathbb{1}[x^i_j = x^i_k] + a}. \]

Smoothing reduces variance for rare categories but does not eliminate leakage, because $y_k$ still sits in the sum. Holdout encoding, where the statistic is computed on a separate fold, removes leakage but throws away data and increases variance. CatBoost wants the leakage gone without sacrificing the training rows, and this is what ordered target statistics achieve.

114.2 2. Ordered Target Statistics

114.2.1 2.1 The ordering principle

CatBoost borrows the idea behind online learning. Imagine the training examples arrive in a sequence. When we encode row $k$, we are only allowed to use the labels of rows that arrived earlier. By construction $y_k$ can never appear in its own encoding, so leakage is impossible.

Concretely, CatBoost samples a random permutation $\sigma$ of the training set. Let $\mathcal{D}_k = \{ j : \sigma(j) < \sigma(k) \}$ be the set of rows preceding $k$ in that permutation. The ordered target statistic for a category is

\[ \hat{x}^i_k = \frac{\sum_{j \in \mathcal{D}_k} \mathbb{1}[x^i_j = x^i_k]\, y_j + a\, p}{\sum_{j \in \mathcal{D}_k} \mathbb{1}[x^i_j = x^i_k] + a}. \]

The prior $p$ and weight $a$ still provide smoothing, which matters a great deal for the earliest rows in the permutation, where $\mathcal{D}_k$ is small or empty. For regression, $p$ is typically the global target mean; for classification, the prior is a tunable constant.

114.2.2 2.2 Why this is unbiased in expectation

The property that makes ordered statistics sound is that, for any fixed row, the expectation of its encoded value over random permutations matches the true category mean it is trying to estimate, while never including the row’s own label. Each row sees a different prefix of history, so no single row’s encoding is systematically inflated by its own target. The encoding behaves like a value computed on held out data, yet every training row still contributes to the encodings of rows that follow it. We get the leakage safety of holdout encoding and the data efficiency of full target encoding at the same time.

114.2.3 2.3 Multiple permutations

A single permutation has a drawback: rows near the front of the order have tiny histories and therefore high variance encodings, while rows near the back are stable. To keep this variance from biasing the model in a fixed direction, CatBoost samples several independent permutations and rotates among them across boosting iterations. Early rows in one permutation are late rows in another, so the noise averages out over the course of training. By default the library maintains a handful of permutations for this purpose.

114.2.4 2.4 Feature combinations

A major source of predictive power in tabular data is the interaction between categorical features, for example the pair (country, device_type). CatBoost constructs such combinations greedily during tree growth. At the root, no combinations exist. When a categorical feature is used at a split, the library considers combining the categoricals already on the current path with all remaining categoricals, encoding each resulting combination with the same ordered target statistic machinery. This lets the model discover high order interactions without the analyst enumerating them, and the ordered encoding keeps these combinations leakage free as well.

114.3 3. Prediction Shift and Ordered Boosting

114.3.1 3.1 A leak hiding inside gradient boosting

Ordered statistics fix leakage in feature encoding. CatBoost’s authors identified a second, more subtle leak that affects gradient boosting itself, independent of categorical features. They call the resulting bias prediction shift.

Recall the gradient boosting loop. At iteration $t$ we hold a model $F^{t-1}$. We compute the negative gradient of the loss at each training point,

\[ g_k = -\left.\frac{\partial L(y_k, s)}{\partial s}\right|_{s = F^{t-1}(x_k)}, \]

fit a new tree $h^t$ to approximate these gradients, and update $F^t = F^{t-1} + \eta\, h^t$. The problem is that $g_k$ is computed using $F^{t-1}(x_k)$, and $F^{t-1}$ was itself trained on a dataset that included row $k$. The model has already seen $y_k$ indirectly, so the gradient at $x_k$ is not representative of the gradient the model would produce on an unseen point with the same features. The distribution of $g_k \mid x_k$ on the training set differs from the distribution on test data. Trees fit to these shifted gradients inherit the bias, and the effect compounds across iterations. This is exactly the same leakage pattern as target encoding, now hiding one level deeper.

114.3.2 3.2 The ordered boosting algorithm

The fix mirrors ordered statistics. Fix a permutation $\sigma$. To compute the gradient for row $k$, use a model that was trained only on the rows preceding $k$ in $\sigma$. Then $F^{t-1}$ as applied to $x_k$ has never been exposed to $y_k$, and the gradient is unbiased.

Maintained literally, this requires a separate model $M_j$ for every prefix length, which is quadratic in cost. The conceptual algorithm is:

sample permutation sigma over n training rows
initialize models M_1 ... M_n to zero
for t in 1 .. number_of_trees:
    for each row k:
        # gradient uses only rows before k in sigma
        g_k = gradient(y_k, M_{sigma(k)-1}(x_k))
    fit tree h_t to the residuals g
    for each row k:
        M_{sigma(k)}(x_k) += learning_rate * h_t(x_k)

The key line is that the gradient for row $k$ is read from the model $M_{\sigma(k)-1}$, which was updated only by rows earlier in the permutation. No row contributes to its own gradient estimate.

114.3.3 3.3 Making it practical

The naive version keeps $n$ supporting models, which is infeasible. CatBoost approximates it by maintaining models for a geometric sequence of prefix lengths, roughly $\log n$ of them, so a row of rank $r$ reads its gradient from the supporting model whose prefix is the largest power of two not exceeding $r$. This brings the overhead down to a constant factor while preserving the no self leakage property closely enough to remove most of the prediction shift. The same permutations used for ordered statistics are reused for ordered boosting, which keeps bookkeeping coherent.

114.3.4 3.4 Ordered versus plain mode

CatBoost exposes this choice through the boosting_type parameter. The value Ordered runs the algorithm above and gives the strongest defense against prediction shift; it is most valuable on small datasets, where overfitting from the leak is severe. The value Plain uses the classical gradient boosting update and is faster and lighter on memory; CatBoost selects it automatically for large datasets, where the shift is negligible because each model is trained on enough data that the contribution of any single row is vanishing.

114.4 4. Symmetric (Oblivious) Trees

114.4.1 4.1 Structure

Most boosting libraries grow asymmetric trees: each internal node chooses its own splitting feature and threshold. CatBoost instead grows oblivious trees, also called symmetric trees. In an oblivious tree, every node at the same depth uses the same split condition. A tree of depth $d$ is therefore defined by an ordered list of $d$ split conditions, and all $2^d$ leaves are reached by evaluating those same $d$ tests in order.

A depth 3 oblivious tree might be:

level 0: age > 30 ?
level 1: country in {US, CA} ?   (same test for both children)
level 2: purchases > 5 ?         (same test for all four nodes)

Each example produces a 3 bit index from the three test outcomes, and that index selects one of 8 leaves directly.

114.4.2 4.2 Why obliviousness helps

This structure is more constrained than a general tree, and the constraint acts as a regularizer. An oblivious tree cannot overfit a narrow region of feature space by growing a deep idiosyncratic branch there, which reduces variance and improves robustness, a good match for the small to medium tabular datasets where CatBoost shines.

The structure is also extremely fast to evaluate. Because the same $d$ comparisons apply to every example, scoring reduces to computing a $d$ bit index and a single lookup into a leaf array of length $2^d$. This is branch free and cache friendly. The leaf index can be assembled as

\[ \text{index} = \sum_{l=0}^{d-1} 2^{l}\, b_l, \qquad b_l = \mathbb{1}[\text{condition } l \text{ is true}], \]

which vectorizes cleanly across a batch of examples. The result is prediction latency low enough for demanding serving environments, one reason CatBoost is popular in production ranking and recommendation systems.

114.4.3 4.3 The cost

Forcing one split per level is a real restriction. If the ideal model needs different features in different regions, an oblivious tree must spend extra depth or extra trees to express what an asymmetric tree captures in one branch. CatBoost compensates with the ensemble: many shallow symmetric trees, typically depth 6, combine to represent complex functions, and the per tree regularization tends to win on the dataset sizes CatBoost targets. The tradeoff is favorable often enough that the symmetric tree is the default, though it is not universally optimal.

114.5 5. Complexity and Cost

It helps to know where the time and memory go. Let $n$ be the number of training rows, $f$ the number of features, $T$ the number of boosting iterations, and $d$ the tree depth.

CatBoost bins each numeric feature into at most $B$ quantile buckets (controlled by border_count, default 254 on CPU), so a histogram for one feature has $B$ entries. Growing one oblivious level scans every feature once to pick the single best split, costing $O(n f)$ to accumulate gradient histograms plus $O(f B)$ to score candidate borders. A tree of depth $d$ repeats this $d$ times, and the ensemble has $T$ trees, giving training time roughly

\[ O\!\left(T \, d \, (n f + f B)\right) \approx O(T\, d\, n f) \]

when $n \gg B$, which is the usual regime. The leaf count per tree is $2^d$, so memory for the model itself is $O(T \, 2^d)$ and is the reason depth is kept small.

Ordered boosting adds a constant factor, not a worse asymptotic order. Instead of one model it maintains roughly $\log_2 n$ supporting models, and each row reads its gradient from one of them, so the per iteration cost grows by a factor of order $\log n$ in the worst bookkeeping but is engineered down to a small constant in practice. Ordered target statistics add an $O(\log n)$ factor per categorical encoding from maintaining running prefix sums in permutation order, multiplied by the number of permutations (a handful). Prediction is the cheapest part: scoring one example costs $O(T \, d)$, a few hundred branch free comparisons and lookups, which is why oblivious forests serve at low latency.

The practical takeaways are that depth is exponential in model size and should stay near 6, that ordered mode is a constant factor slower than plain mode rather than a different complexity class, and that the GPU path mainly accelerates the $O(n f)$ histogram pass that dominates training.

114.6 6. Failure Modes and Limits

CatBoost is robust, but it is not magic, and a few situations reliably trip it up.

The first is misdeclared categorical features. If you forget to list a string column in cat_features, CatBoost may refuse to train or silently hash it; if you pre encode a categorical into integers and then forget to declare it, the model treats those integers as an ordered numeric feature and the whole ordered statistics advantage evaporates. Declare categoricals as raw values and let the library encode them.

The second is the cost of ordered mode on large data. The supporting models and permutation bookkeeping carry real overhead, so on millions of rows Ordered boosting can be several times slower than Plain for a benefit that has shrunk to nothing, because each model now trains on so much data that any single row barely moves it. CatBoost defaults to Plain at scale for exactly this reason; overriding it back to Ordered on big data usually buys only a slower run.

The third is extreme cardinality combined with tiny support. A feature such as transaction_id that is nearly unique per row carries no generalizable signal, and even ordered encoding will mostly return the smoothed prior $p$ for it. Such features add noise and training cost; drop them rather than hoping the encoder rescues them.

The fourth is the oblivious tree’s rigidity. On problems where the right model genuinely needs different features in different regions, for example data with strong, localized interactions that do not factor through a shared split order, a leaf wise asymmetric learner such as LightGBM can fit the same accuracy with fewer trees. When CatBoost needs many more iterations than a competitor to match validation accuracy, the symmetric constraint is a plausible culprit.

Finally, like all gradient boosting, CatBoost extrapolates poorly. Predictions are piecewise constant and bounded by the leaf values seen in training, so a numeric feature pushed far outside its training range yields a flat, uninformed prediction rather than a sensible extrapolation. For genuinely extrapolative targets a parametric or additive model is the better tool.

114.7 7. Practical Use

114.7.1 7.1 The mature open-source tool

The reference implementation of these ideas is the CatBoost library itself, an Apache 2.0 licensed open-source project from Yandex, installable with pip install catboost. There is no reason to reimplement ordered boosting by hand: the published library is fast, GPU enabled, battle tested in production ranking systems, and exposes every mechanism described above through a clean API. This chapter teaches the method and then defers to that library, which is the right division of labor for an algorithm this intricate.

The package provides three primary estimators: CatBoostClassifier, CatBoostRegressor, and the ranking oriented CatBoost. The scikit-learn compatible API means fit, predict, and predict_proba behave as expected, and the models drop into pipelines and cross validation utilities.

The most important practical point is how you declare categorical features. You pass column indices or names through cat_features and leave the raw string values in place. Do not pre encode them; CatBoost’s whole advantage is that it applies ordered target statistics internally. The Pool object bundles features, labels, and the categorical declaration into the efficient container CatBoost prefers, and it is required for some advanced features such as text features and custom feature weights.

114.7.2 7.2 Key hyperparameters

CatBoost is known for sane defaults, but a few parameters reward attention.

iterations and learning_rate trade off in the usual way. A lower learning rate with more iterations and early stopping usually generalizes better. Set iterations generously and rely on od_type="Iter" with od_wait to stop when the validation metric stalls.

depth controls tree complexity. Because trees are oblivious, depth has an outsized effect; the default of 6 is a strong baseline and values above 10 are rarely needed and risk a $2^{10}$ leaf blowup.

l2_leaf_reg is the L2 penalty on leaf values and is the main explicit regularizer. Increase it when the validation gap is wide.

boosting_type chooses Ordered or Plain as discussed; let CatBoost default it unless you have a small dataset and want to force Ordered.

For high cardinality features, one_hot_max_size sets the threshold below which a categorical is one hot encoded rather than target encoded. Raising it applies one hot to slightly larger features, which can help when those features are not strongly predictive on their own.

114.7.3 7.3 GPU training and large data

CatBoost has a mature GPU implementation enabled with task_type="GPU". The oblivious tree structure is particularly well suited to GPUs because the uniform per level split lets the histogram computation be laid out as dense regular work. On large datasets the speedup over CPU is substantial, and because plain boosting is selected automatically at scale the GPU path carries little statistical penalty.

114.7.4 7.4 Interpretation

CatBoost provides feature importances through model.get_feature_importance, including the standard prediction values change importance and the more rigorous LossFunctionChange importance, which measures the loss degradation when a feature is removed. For local explanations it supports SHAP values via type="ShapValues", giving per prediction attributions consistent with the SHAP framework. These tools matter because target encoded categoricals can otherwise be opaque.

114.7.5 7.5 A worked example across three languages

The example below builds a small, self contained fraud style classification problem: 4000 transactions with a high cardinality merchant column (200 levels, each carrying its own latent risk), two low cardinality categoricals (country, device), two numeric features (amount, age_days), and a planted country by device interaction. This is exactly the shape of data CatBoost is built for, and it lets the high cardinality categorical do real work without any manual encoding. Everything is generated inline with NumPy and a fixed seed, so the run is deterministic.

The Python tab executes. The Julia and Rust tabs show the equivalent calls in those ecosystems and are not run here.

Code

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score

rng = np.random.default_rng(0)
n = 4000

# High-cardinality categorical: 200 merchants, each with a latent risk.
n_merch = 200
merch_risk = rng.normal(0, 1.3, size=n_merch)
merchant = rng.integers(0, n_merch, size=n)

# Low-cardinality categoricals with their own latent risks.
countries = np.array(["US", "CA", "GB", "DE", "IN", "BR"])
country_risk = {"US": -0.2, "CA": -0.1, "GB": 0.0, "DE": 0.1, "IN": 0.4, "BR": 0.5}
country = countries[rng.integers(0, len(countries), size=n)]

devices = np.array(["mobile", "desktop", "tablet"])
device_risk = {"mobile": 0.3, "desktop": -0.2, "tablet": 0.05}
device = devices[rng.integers(0, len(devices), size=n)]

# Numeric features.
amount = rng.lognormal(mean=3.0, sigma=1.0, size=n)
age_days = rng.integers(1, 2000, size=n)

# Latent logit with a planted country-by-device interaction, then a label.
inter = np.where((country == "BR") & (device == "mobile"), 0.8, 0.0)
logit = (
    merch_risk[merchant]
    + np.array([country_risk[c] for c in country])
    + np.array([device_risk[d] for d in device])
    + 0.45 * (np.log(amount) - 3.0)
    - 0.0008 * age_days
    + inter
    - 0.3
)
p = 1.0 / (1.0 + np.exp(-logit))
y = (rng.uniform(size=n) < p).astype(int)

df = pd.DataFrame({
    "merchant": merchant.astype(str),   # raw categorical, NOT pre-encoded
    "country": country,
    "device": device,
    "amount": amount,
    "age_days": age_days,
})
cat_features = ["merchant", "country", "device"]

X_tr, X_te, y_tr, y_te = train_test_split(
    df, y, test_size=0.25, random_state=42, stratify=y
)

train_pool = Pool(X_tr, y_tr, cat_features=cat_features)
test_pool = Pool(X_te, y_te, cat_features=cat_features)

model = CatBoostClassifier(
    iterations=400,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3.0,
    loss_function="Logloss",
    eval_metric="AUC",
    boosting_type="Ordered",   # ordered boosting; valuable on small data
    random_seed=42,
    verbose=False,
)
model.fit(train_pool, eval_set=test_pool, use_best_model=True)

proba = model.predict_proba(test_pool)[:, 1]
pred = model.predict(test_pool)

print(f"train rows: {len(X_tr)}, test rows: {len(X_te)}, features: {df.shape[1]}")
print(f"positive rate (train): {y_tr.mean():.3f}")
print(f"best iteration:        {model.get_best_iteration()}")
print(f"test AUC:              {roc_auc_score(y_te, proba):.4f}")
print(f"test accuracy:         {accuracy_score(y_te, pred):.4f}")

imp = model.get_feature_importance(test_pool, type="PredictionValuesChange")
order = np.argsort(imp)[::-1]
print("feature importance (PredictionValuesChange):")
for i in order:
    print(f"  {df.columns[i]:<10} {imp[i]:6.2f}")

train rows: 3000, test rows: 1000, features: 5
positive rate (train): 0.350
best iteration:        140
test AUC:              0.7528
test accuracy:         0.7200
feature importance (PredictionValuesChange):
  merchant    48.86
  age_days    17.74
  amount      17.35
  device       9.55
  country      6.50

The high cardinality merchant column dominates the importance ranking, which is the intended result: ordered target statistics turn 200 raw merchant labels into a clean leakage free risk signal that no one hot or naive target encoding could match. The numeric amount and age_days features and the low cardinality categoricals contribute the rest, and early stopping halts well before the 400 iteration budget.

# CatBoost.jl wraps the same native CatBoost library used by Python.
# Pkg.add("CatBoost"); Pkg.add("DataFrames")
using CatBoost
using DataFrames
using Random

Random.seed!(0)
n = 4000
n_merch = 200
merch_risk = randn(n_merch) .* 1.3
merchant = rand(0:n_merch-1, n)

countries = ["US", "CA", "GB", "DE", "IN", "BR"]
country_risk = Dict("US"=>-0.2, "CA"=>-0.1, "GB"=>0.0,
                    "DE"=>0.1, "IN"=>0.4, "BR"=>0.5)
country = countries[rand(1:length(countries), n)]

devices = ["mobile", "desktop", "tablet"]
device_risk = Dict("mobile"=>0.3, "desktop"=>-0.2, "tablet"=>0.05)
device = devices[rand(1:length(devices), n)]

amount = exp.(3.0 .+ randn(n))
age_days = rand(1:2000, n)

inter = [(country[i] == "BR" && device[i] == "mobile") ? 0.8 : 0.0 for i in 1:n]
logit = merch_risk[merchant .+ 1] .+
        [country_risk[c] for c in country] .+
        [device_risk[d] for d in device] .+
        0.45 .* (log.(amount) .- 3.0) .- 0.0008 .* age_days .+ inter .- 0.3
p = 1.0 ./ (1.0 .+ exp.(-logit))
y = Int.(rand(n) .< p)

df = DataFrame(merchant = string.(merchant), country = country,
               device = device, amount = amount, age_days = age_days)
cat_features = ["merchant", "country", "device"]

# Hold out the last 25% as a test set.
ntr = Int(round(0.75 * n))
train_pool = Pool(df[1:ntr, :], label = y[1:ntr], cat_features = cat_features)
test_pool  = Pool(df[ntr+1:n, :], label = y[ntr+1:n], cat_features = cat_features)

model = CatBoostClassifier(iterations = 400, learning_rate = 0.05,
                           depth = 6, l2_leaf_reg = 3.0,
                           loss_function = "Logloss", eval_metric = "AUC",
                           boosting_type = "Ordered", random_seed = 42,
                           verbose = false)
fit!(model, train_pool, eval_set = test_pool, use_best_model = true)

proba = predict_proba(model, test_pool)[:, 2]
println("best iteration: ", get_best_iteration(model))
println("test AUC:       ", eval_metrics(model, test_pool, ["AUC"]))
println("importances:    ", get_feature_importance(model))

// There is no mature pure-Rust CatBoost trainer. The honest option is the
// official CatBoost C API for INFERENCE of a model trained in Python/Julia,
// exposed through the `catboost` crate (thin FFI bindings, model.predict only).
// Training in Rust still goes through the C/C++ library, not idiomatic Rust.
//
// Cargo.toml:  catboost = "0.1"
use catboost::Model;

fn main() {
    // Load a model previously trained and saved with `model.save_model(...)`.
    let model = Model::load("fraud_catboost.cbm").expect("load model");

    // One transaction: numeric features then categorical features (as &str).
    // Order must match the training Pool's column layout.
    let numeric = vec![vec![42.0_f32, 365.0_f32]];          // amount, age_days
    let categorical = vec![vec!["57", "BR", "mobile"]];     // merchant, country, device

    let preds = model
        .calc_model_prediction(numeric, categorical)
        .expect("predict");
    println!("raw model output (logit): {:?}", preds);
    // Apply a logistic transform for a probability:
    let prob = 1.0 / (1.0 + (-preds[0]).exp());
    println!("fraud probability: {:.4}", prob);
}

For Rust the honest situation is that no mature crate trains CatBoost in pure Rust. The catboost crate provides FFI bindings to the official C API and covers fast model inference, so the production pattern is to train in Python or Julia and serve from Rust through these bindings.

114.7.6 7.6 When to reach for CatBoost

CatBoost is the strongest default when your data has many categorical columns, especially high cardinality ones such as user or item identifiers, and when the dataset is small enough that prediction shift and encoding leakage would otherwise hurt. It frequently wins with little tuning, which makes it an excellent baseline even when you intend to try LightGBM or XGBoost afterward. On purely numeric data with very large row counts, the gap narrows and the choice among the three libraries comes down to speed and tuning effort rather than the encoding machinery that distinguishes CatBoost. In every case, the discipline that makes the library work, using only the past to estimate the present, is the idea worth carrying to any modeling problem where leakage threatens.

114.8 References

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. CatBoost: unbiased boosting with categorical features. NeurIPS 2018. https://arxiv.org/abs/1706.09516
Dorogush, A. V., Ershov, V., and Gulin, A. CatBoost: gradient boosting with categorical features support. 2018. https://arxiv.org/abs/1810.11363
CatBoost official documentation. https://catboost.ai/docs/
Micci-Barreca, D. A preprocessing scheme for high cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations, 2001. https://dl.acm.org/doi/10.1145/507533.507538
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001. https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full
Lundberg, S. M. and Lee, S. A unified approach to interpreting model predictions. NeurIPS 2017. https://arxiv.org/abs/1705.07874
CatBoost GitHub repository. https://github.com/catboost/catboost
CatBoost.jl, Julia bindings to the CatBoost library. https://github.com/JuliaAI/CatBoost.jl
catboost-rs, Rust bindings to the CatBoost C API for model inference. https://github.com/catboost/catboost/tree/master/catboost/rust-package

# CatBoost: Ordered Boosting and Categorical Features Done Right CatBoost is a gradient boosting library developed at Yandex that addresses two problems most gradient boosting toolkits handle poorly: the statistical leakage introduced when categorical features are encoded with target information, and a subtler prediction shift that affects gradient boosted models in general. Its two signature ideas, ordered target statistics and ordered boosting, both rest on the same principle of using only the past to estimate the present. Combined with a fast oblivious tree learner and strong default hyperparameters, CatBoost is often the strongest out of the box choice on tabular data with high cardinality categorical columns. This chapter develops the theory behind these mechanisms, explains why the obvious alternatives leak information, and gives practical guidance for using the library effectively. ## 1. Why Categorical Features Are Hard ### 1.1 The encoding problem Tree based learners split on numeric thresholds. A categorical feature such as `user_id`, `merchant`, or `city` has no natural ordering, so it must be turned into numbers before a tree can use it. The classical options each have a failure mode. One hot encoding creates one binary column per category. For a feature with thousands of levels this explodes the feature space, dilutes the signal across many sparse columns, and makes splits shallow and weak. Label encoding assigns an arbitrary integer to each category, which imposes a meaningless ordering that the tree will happily exploit in misleading ways. A far more powerful approach is target encoding, also called target statistics. Replace each category with a statistic of the target computed over the rows that share that category. For a binary classification target $y \in \{0,1\}$ and a categorical value $x^i_k$ for feature $i$ in row $k$, the natural estimate is the category mean $$ \hat{x}^i_k = \frac{\sum_{j : x^i_j = x^i_k} y_j}{\sum_{j : x^i_j = x^i_k} 1}. $$ This is compact, scales to arbitrary cardinality, and captures exactly the predictive relationship we care about. It is also dangerously leaky. ### 1.2 Target leakage and the smoothing band aid The estimate above uses $y_k$, the label of the very row we are encoding, inside the numerator. A category that appears only once will be encoded with its own target value, so the feature becomes a perfect copy of the label on the training set and carries no information at test time. Even for categories that appear a handful of times, the encoded value is contaminated by the current row, producing an optimistic training signal that does not generalize. This is target leakage, and it is the central pathology CatBoost is built to remove. The standard mitigation is additive smoothing toward a prior $p$ with weight $a$: $$ \hat{x}^i_k = \frac{\sum_{j} \mathbb{1}[x^i_j = x^i_k]\, y_j + a\,p}{\sum_{j} \mathbb{1}[x^i_j = x^i_k] + a}. $$ Smoothing reduces variance for rare categories but does not eliminate leakage, because $y_k$ still sits in the sum. Holdout encoding, where the statistic is computed on a separate fold, removes leakage but throws away data and increases variance. CatBoost wants the leakage gone without sacrificing the training rows, and this is what ordered target statistics achieve. ## 2. Ordered Target Statistics ### 2.1 The ordering principle CatBoost borrows the idea behind online learning. Imagine the training examples arrive in a sequence. When we encode row $k$, we are only allowed to use the labels of rows that arrived earlier. By construction $y_k$ can never appear in its own encoding, so leakage is impossible. Concretely, CatBoost samples a random permutation $\sigma$ of the training set. Let $\mathcal{D}_k = \{ j : \sigma(j) < \sigma(k) \}$ be the set of rows preceding $k$ in that permutation. The ordered target statistic for a category is $$ \hat{x}^i_k = \frac{\sum_{j \in \mathcal{D}_k} \mathbb{1}[x^i_j = x^i_k]\, y_j + a\, p}{\sum_{j \in \mathcal{D}_k} \mathbb{1}[x^i_j = x^i_k] + a}. $$ The prior $p$ and weight $a$ still provide smoothing, which matters a great deal for the earliest rows in the permutation, where $\mathcal{D}_k$ is small or empty. For regression, $p$ is typically the global target mean; for classification, the prior is a tunable constant. ### 2.2 Why this is unbiased in expectation The property that makes ordered statistics sound is that, for any fixed row, the expectation of its encoded value over random permutations matches the true category mean it is trying to estimate, while never including the row's own label. Each row sees a different prefix of history, so no single row's encoding is systematically inflated by its own target. The encoding behaves like a value computed on held out data, yet every training row still contributes to the encodings of rows that follow it. We get the leakage safety of holdout encoding and the data efficiency of full target encoding at the same time. ### 2.3 Multiple permutations A single permutation has a drawback: rows near the front of the order have tiny histories and therefore high variance encodings, while rows near the back are stable. To keep this variance from biasing the model in a fixed direction, CatBoost samples several independent permutations and rotates among them across boosting iterations. Early rows in one permutation are late rows in another, so the noise averages out over the course of training. By default the library maintains a handful of permutations for this purpose. ### 2.4 Feature combinations A major source of predictive power in tabular data is the interaction between categorical features, for example the pair (`country`, `device_type`). CatBoost constructs such combinations greedily during tree growth. At the root, no combinations exist. When a categorical feature is used at a split, the library considers combining the categoricals already on the current path with all remaining categoricals, encoding each resulting combination with the same ordered target statistic machinery. This lets the model discover high order interactions without the analyst enumerating them, and the ordered encoding keeps these combinations leakage free as well. ## 3. Prediction Shift and Ordered Boosting ### 3.1 A leak hiding inside gradient boosting Ordered statistics fix leakage in feature encoding. CatBoost's authors identified a second, more subtle leak that affects gradient boosting itself, independent of categorical features. They call the resulting bias prediction shift. Recall the gradient boosting loop. At iteration $t$ we hold a model $F^{t-1}$. We compute the negative gradient of the loss at each training point, $$ g_k = -\left.\frac{\partial L(y_k, s)}{\partial s}\right|_{s = F^{t-1}(x_k)}, $$ fit a new tree $h^t$ to approximate these gradients, and update $F^t = F^{t-1} + \eta\, h^t$. The problem is that $g_k$ is computed using $F^{t-1}(x_k)$, and $F^{t-1}$ was itself trained on a dataset that included row $k$. The model has already seen $y_k$ indirectly, so the gradient at $x_k$ is not representative of the gradient the model would produce on an unseen point with the same features. The distribution of $g_k \mid x_k$ on the training set differs from the distribution on test data. Trees fit to these shifted gradients inherit the bias, and the effect compounds across iterations. This is exactly the same leakage pattern as target encoding, now hiding one level deeper. ### 3.2 The ordered boosting algorithm The fix mirrors ordered statistics. Fix a permutation $\sigma$. To compute the gradient for row $k$, use a model that was trained only on the rows preceding $k$ in $\sigma$. Then $F^{t-1}$ as applied to $x_k$ has never been exposed to $y_k$, and the gradient is unbiased. Maintained literally, this requires a separate model $M_j$ for every prefix length, which is quadratic in cost. The conceptual algorithm is: ```text sample permutation sigma over n training rows initialize models M_1 ... M_n to zero for t in 1 .. number_of_trees: for each row k: # gradient uses only rows before k in sigma g_k = gradient(y_k, M_{sigma(k)-1}(x_k)) fit tree h_t to the residuals g for each row k: M_{sigma(k)}(x_k) += learning_rate * h_t(x_k) ``` The key line is that the gradient for row $k$ is read from the model $M_{\sigma(k)-1}$, which was updated only by rows earlier in the permutation. No row contributes to its own gradient estimate. ### 3.3 Making it practical The naive version keeps $n$ supporting models, which is infeasible. CatBoost approximates it by maintaining models for a geometric sequence of prefix lengths, roughly $\log n$ of them, so a row of rank $r$ reads its gradient from the supporting model whose prefix is the largest power of two not exceeding $r$. This brings the overhead down to a constant factor while preserving the no self leakage property closely enough to remove most of the prediction shift. The same permutations used for ordered statistics are reused for ordered boosting, which keeps bookkeeping coherent. ### 3.4 Ordered versus plain mode CatBoost exposes this choice through the `boosting_type` parameter. The value `Ordered` runs the algorithm above and gives the strongest defense against prediction shift; it is most valuable on small datasets, where overfitting from the leak is severe. The value `Plain` uses the classical gradient boosting update and is faster and lighter on memory; CatBoost selects it automatically for large datasets, where the shift is negligible because each model is trained on enough data that the contribution of any single row is vanishing. ## 4. Symmetric (Oblivious) Trees ### 4.1 Structure Most boosting libraries grow asymmetric trees: each internal node chooses its own splitting feature and threshold. CatBoost instead grows oblivious trees, also called symmetric trees. In an oblivious tree, every node at the same depth uses the same split condition. A tree of depth $d$ is therefore defined by an ordered list of $d$ split conditions, and all $2^d$ leaves are reached by evaluating those same $d$ tests in order. A depth 3 oblivious tree might be: ```text level 0: age > 30 ? level 1: country in {US, CA} ? (same test for both children) level 2: purchases > 5 ? (same test for all four nodes) ``` Each example produces a 3 bit index from the three test outcomes, and that index selects one of 8 leaves directly. ### 4.2 Why obliviousness helps This structure is more constrained than a general tree, and the constraint acts as a regularizer. An oblivious tree cannot overfit a narrow region of feature space by growing a deep idiosyncratic branch there, which reduces variance and improves robustness, a good match for the small to medium tabular datasets where CatBoost shines. The structure is also extremely fast to evaluate. Because the same $d$ comparisons apply to every example, scoring reduces to computing a $d$ bit index and a single lookup into a leaf array of length $2^d$. This is branch free and cache friendly. The leaf index can be assembled as $$ \text{index} = \sum_{l=0}^{d-1} 2^{l}\, b_l, \qquad b_l = \mathbb{1}[\text{condition } l \text{ is true}], $$ which vectorizes cleanly across a batch of examples. The result is prediction latency low enough for demanding serving environments, one reason CatBoost is popular in production ranking and recommendation systems. ### 4.3 The cost Forcing one split per level is a real restriction. If the ideal model needs different features in different regions, an oblivious tree must spend extra depth or extra trees to express what an asymmetric tree captures in one branch. CatBoost compensates with the ensemble: many shallow symmetric trees, typically depth 6, combine to represent complex functions, and the per tree regularization tends to win on the dataset sizes CatBoost targets. The tradeoff is favorable often enough that the symmetric tree is the default, though it is not universally optimal. ## 5. Complexity and Cost It helps to know where the time and memory go. Let $n$ be the number of training rows, $f$ the number of features, $T$ the number of boosting iterations, and $d$ the tree depth. CatBoost bins each numeric feature into at most $B$ quantile buckets (controlled by `border_count`, default 254 on CPU), so a histogram for one feature has $B$ entries. Growing one oblivious level scans every feature once to pick the single best split, costing $O(n f)$ to accumulate gradient histograms plus $O(f B)$ to score candidate borders. A tree of depth $d$ repeats this $d$ times, and the ensemble has $T$ trees, giving training time roughly $$ O\!\left(T \, d \, (n f + f B)\right) \approx O(T\, d\, n f) $$ when $n \gg B$, which is the usual regime. The leaf count per tree is $2^d$, so memory for the model itself is $O(T \, 2^d)$ and is the reason depth is kept small. Ordered boosting adds a constant factor, not a worse asymptotic order. Instead of one model it maintains roughly $\log_2 n$ supporting models, and each row reads its gradient from one of them, so the per iteration cost grows by a factor of order $\log n$ in the worst bookkeeping but is engineered down to a small constant in practice. Ordered target statistics add an $O(\log n)$ factor per categorical encoding from maintaining running prefix sums in permutation order, multiplied by the number of permutations (a handful). Prediction is the cheapest part: scoring one example costs $O(T \, d)$, a few hundred branch free comparisons and lookups, which is why oblivious forests serve at low latency. The practical takeaways are that depth is exponential in model size and should stay near 6, that ordered mode is a constant factor slower than plain mode rather than a different complexity class, and that the GPU path mainly accelerates the $O(n f)$ histogram pass that dominates training. ## 6. Failure Modes and Limits CatBoost is robust, but it is not magic, and a few situations reliably trip it up. The first is misdeclared categorical features. If you forget to list a string column in `cat_features`, CatBoost may refuse to train or silently hash it; if you pre encode a categorical into integers and then forget to declare it, the model treats those integers as an ordered numeric feature and the whole ordered statistics advantage evaporates. Declare categoricals as raw values and let the library encode them. The second is the cost of ordered mode on large data. The supporting models and permutation bookkeeping carry real overhead, so on millions of rows `Ordered` boosting can be several times slower than `Plain` for a benefit that has shrunk to nothing, because each model now trains on so much data that any single row barely moves it. CatBoost defaults to `Plain` at scale for exactly this reason; overriding it back to `Ordered` on big data usually buys only a slower run. The third is extreme cardinality combined with tiny support. A feature such as `transaction_id` that is nearly unique per row carries no generalizable signal, and even ordered encoding will mostly return the smoothed prior $p$ for it. Such features add noise and training cost; drop them rather than hoping the encoder rescues them. The fourth is the oblivious tree's rigidity. On problems where the right model genuinely needs different features in different regions, for example data with strong, localized interactions that do not factor through a shared split order, a leaf wise asymmetric learner such as LightGBM can fit the same accuracy with fewer trees. When CatBoost needs many more iterations than a competitor to match validation accuracy, the symmetric constraint is a plausible culprit. Finally, like all gradient boosting, CatBoost extrapolates poorly. Predictions are piecewise constant and bounded by the leaf values seen in training, so a numeric feature pushed far outside its training range yields a flat, uninformed prediction rather than a sensible extrapolation. For genuinely extrapolative targets a parametric or additive model is the better tool. ## 7. Practical Use ### 7.1 The mature open-source tool The reference implementation of these ideas is the CatBoost library itself, an Apache 2.0 licensed open-source project from Yandex, installable with `pip install catboost`. There is no reason to reimplement ordered boosting by hand: the published library is fast, GPU enabled, battle tested in production ranking systems, and exposes every mechanism described above through a clean API. This chapter teaches the method and then defers to that library, which is the right division of labor for an algorithm this intricate. The package provides three primary estimators: `CatBoostClassifier`, `CatBoostRegressor`, and the ranking oriented `CatBoost`. The scikit-learn compatible API means `fit`, `predict`, and `predict_proba` behave as expected, and the models drop into pipelines and cross validation utilities. The most important practical point is how you declare categorical features. You pass column indices or names through `cat_features` and leave the raw string values in place. Do not pre encode them; CatBoost's whole advantage is that it applies ordered target statistics internally. The `Pool` object bundles features, labels, and the categorical declaration into the efficient container CatBoost prefers, and it is required for some advanced features such as text features and custom feature weights. ### 7.2 Key hyperparameters CatBoost is known for sane defaults, but a few parameters reward attention. `iterations` and `learning_rate` trade off in the usual way. A lower learning rate with more iterations and early stopping usually generalizes better. Set `iterations` generously and rely on `od_type="Iter"` with `od_wait` to stop when the validation metric stalls. `depth` controls tree complexity. Because trees are oblivious, depth has an outsized effect; the default of 6 is a strong baseline and values above 10 are rarely needed and risk a $2^{10}$ leaf blowup. `l2_leaf_reg` is the L2 penalty on leaf values and is the main explicit regularizer. Increase it when the validation gap is wide. `boosting_type` chooses `Ordered` or `Plain` as discussed; let CatBoost default it unless you have a small dataset and want to force `Ordered`. For high cardinality features, `one_hot_max_size` sets the threshold below which a categorical is one hot encoded rather than target encoded. Raising it applies one hot to slightly larger features, which can help when those features are not strongly predictive on their own. ### 7.3 GPU training and large data CatBoost has a mature GPU implementation enabled with `task_type="GPU"`. The oblivious tree structure is particularly well suited to GPUs because the uniform per level split lets the histogram computation be laid out as dense regular work. On large datasets the speedup over CPU is substantial, and because plain boosting is selected automatically at scale the GPU path carries little statistical penalty. ### 7.4 Interpretation CatBoost provides feature importances through `model.get_feature_importance`, including the standard prediction values change importance and the more rigorous `LossFunctionChange` importance, which measures the loss degradation when a feature is removed. For local explanations it supports SHAP values via `type="ShapValues"`, giving per prediction attributions consistent with the SHAP framework. These tools matter because target encoded categoricals can otherwise be opaque. ### 7.5 A worked example across three languages The example below builds a small, self contained fraud style classification problem: 4000 transactions with a high cardinality `merchant` column (200 levels, each carrying its own latent risk), two low cardinality categoricals (`country`, `device`), two numeric features (`amount`, `age_days`), and a planted `country` by `device` interaction. This is exactly the shape of data CatBoost is built for, and it lets the high cardinality categorical do real work without any manual encoding. Everything is generated inline with NumPy and a fixed seed, so the run is deterministic. The Python tab executes. The Julia and Rust tabs show the equivalent calls in those ecosystems and are not run here. ::: {.panel-tabset} ## Python ```{python} import numpy as np import pandas as pd from catboost import CatBoostClassifier, Pool from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score, accuracy_score rng = np.random.default_rng(0) n = 4000 # High-cardinality categorical: 200 merchants, each with a latent risk. n_merch = 200 merch_risk = rng.normal(0, 1.3, size=n_merch) merchant = rng.integers(0, n_merch, size=n) # Low-cardinality categoricals with their own latent risks. countries = np.array(["US", "CA", "GB", "DE", "IN", "BR"]) country_risk = {"US": -0.2, "CA": -0.1, "GB": 0.0, "DE": 0.1, "IN": 0.4, "BR": 0.5} country = countries[rng.integers(0, len(countries), size=n)] devices = np.array(["mobile", "desktop", "tablet"]) device_risk = {"mobile": 0.3, "desktop": -0.2, "tablet": 0.05} device = devices[rng.integers(0, len(devices), size=n)] # Numeric features. amount = rng.lognormal(mean=3.0, sigma=1.0, size=n) age_days = rng.integers(1, 2000, size=n) # Latent logit with a planted country-by-device interaction, then a label. inter = np.where((country == "BR") & (device == "mobile"), 0.8, 0.0) logit = ( merch_risk[merchant] + np.array([country_risk[c] for c in country]) + np.array([device_risk[d] for d in device]) + 0.45 * (np.log(amount) - 3.0) - 0.0008 * age_days + inter - 0.3 ) p = 1.0 / (1.0 + np.exp(-logit)) y = (rng.uniform(size=n) < p).astype(int) df = pd.DataFrame({ "merchant": merchant.astype(str), # raw categorical, NOT pre-encoded "country": country, "device": device, "amount": amount, "age_days": age_days, }) cat_features = ["merchant", "country", "device"] X_tr, X_te, y_tr, y_te = train_test_split( df, y, test_size=0.25, random_state=42, stratify=y ) train_pool = Pool(X_tr, y_tr, cat_features=cat_features) test_pool = Pool(X_te, y_te, cat_features=cat_features) model = CatBoostClassifier( iterations=400, learning_rate=0.05, depth=6, l2_leaf_reg=3.0, loss_function="Logloss", eval_metric="AUC", boosting_type="Ordered", # ordered boosting; valuable on small data random_seed=42, verbose=False, ) model.fit(train_pool, eval_set=test_pool, use_best_model=True) proba = model.predict_proba(test_pool)[:, 1] pred = model.predict(test_pool) print(f"train rows: {len(X_tr)}, test rows: {len(X_te)}, features: {df.shape[1]}") print(f"positive rate (train): {y_tr.mean():.3f}") print(f"best iteration: {model.get_best_iteration()}") print(f"test AUC: {roc_auc_score(y_te, proba):.4f}") print(f"test accuracy: {accuracy_score(y_te, pred):.4f}") imp = model.get_feature_importance(test_pool, type="PredictionValuesChange") order = np.argsort(imp)[::-1] print("feature importance (PredictionValuesChange):") for i in order: print(f" {df.columns[i]:<10} {imp[i]:6.2f}") ``` The high cardinality `merchant` column dominates the importance ranking, which is the intended result: ordered target statistics turn 200 raw merchant labels into a clean leakage free risk signal that no one hot or naive target encoding could match. The numeric `amount` and `age_days` features and the low cardinality categoricals contribute the rest, and early stopping halts well before the 400 iteration budget. ## Julia ```julia # CatBoost.jl wraps the same native CatBoost library used by Python. # Pkg.add("CatBoost"); Pkg.add("DataFrames") using CatBoost using DataFrames using Random Random.seed!(0) n = 4000 n_merch = 200 merch_risk = randn(n_merch) .* 1.3 merchant = rand(0:n_merch-1, n) countries = ["US", "CA", "GB", "DE", "IN", "BR"] country_risk = Dict("US"=>-0.2, "CA"=>-0.1, "GB"=>0.0, "DE"=>0.1, "IN"=>0.4, "BR"=>0.5) country = countries[rand(1:length(countries), n)] devices = ["mobile", "desktop", "tablet"] device_risk = Dict("mobile"=>0.3, "desktop"=>-0.2, "tablet"=>0.05) device = devices[rand(1:length(devices), n)] amount = exp.(3.0 .+ randn(n)) age_days = rand(1:2000, n) inter = [(country[i] == "BR" && device[i] == "mobile") ? 0.8 : 0.0 for i in 1:n] logit = merch_risk[merchant .+ 1] .+ [country_risk[c] for c in country] .+ [device_risk[d] for d in device] .+ 0.45 .* (log.(amount) .- 3.0) .- 0.0008 .* age_days .+ inter .- 0.3 p = 1.0 ./ (1.0 .+ exp.(-logit)) y = Int.(rand(n) .< p) df = DataFrame(merchant = string.(merchant), country = country, device = device, amount = amount, age_days = age_days) cat_features = ["merchant", "country", "device"] # Hold out the last 25% as a test set. ntr = Int(round(0.75 * n)) train_pool = Pool(df[1:ntr, :], label = y[1:ntr], cat_features = cat_features) test_pool = Pool(df[ntr+1:n, :], label = y[ntr+1:n], cat_features = cat_features) model = CatBoostClassifier(iterations = 400, learning_rate = 0.05, depth = 6, l2_leaf_reg = 3.0, loss_function = "Logloss", eval_metric = "AUC", boosting_type = "Ordered", random_seed = 42, verbose = false) fit!(model, train_pool, eval_set = test_pool, use_best_model = true) proba = predict_proba(model, test_pool)[:, 2] println("best iteration: ", get_best_iteration(model)) println("test AUC: ", eval_metrics(model, test_pool, ["AUC"])) println("importances: ", get_feature_importance(model)) ``` ## Rust ```rust // There is no mature pure-Rust CatBoost trainer. The honest option is the // official CatBoost C API for INFERENCE of a model trained in Python/Julia, // exposed through the `catboost` crate (thin FFI bindings, model.predict only). // Training in Rust still goes through the C/C++ library, not idiomatic Rust. // // Cargo.toml: catboost = "0.1" use catboost::Model; fn main() { // Load a model previously trained and saved with `model.save_model(...)`. let model = Model::load("fraud_catboost.cbm").expect("load model"); // One transaction: numeric features then categorical features (as &str). // Order must match the training Pool's column layout. let numeric = vec![vec![42.0_f32, 365.0_f32]]; // amount, age_days let categorical = vec![vec!["57", "BR", "mobile"]]; // merchant, country, device let preds = model .calc_model_prediction(numeric, categorical) .expect("predict"); println!("raw model output (logit): {:?}", preds); // Apply a logistic transform for a probability: let prob = 1.0 / (1.0 + (-preds[0]).exp()); println!("fraud probability: {:.4}", prob); } ``` For Rust the honest situation is that no mature crate trains CatBoost in pure Rust. The `catboost` crate provides FFI bindings to the official C API and covers fast model inference, so the production pattern is to train in Python or Julia and serve from Rust through these bindings. ::: ### 7.6 When to reach for CatBoost CatBoost is the strongest default when your data has many categorical columns, especially high cardinality ones such as user or item identifiers, and when the dataset is small enough that prediction shift and encoding leakage would otherwise hurt. It frequently wins with little tuning, which makes it an excellent baseline even when you intend to try LightGBM or XGBoost afterward. On purely numeric data with very large row counts, the gap narrows and the choice among the three libraries comes down to speed and tuning effort rather than the encoding machinery that distinguishes CatBoost. In every case, the discipline that makes the library work, using only the past to estimate the present, is the idea worth carrying to any modeling problem where leakage threatens. ## References 1. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. CatBoost: unbiased boosting with categorical features. NeurIPS 2018. https://arxiv.org/abs/1706.09516 2. Dorogush, A. V., Ershov, V., and Gulin, A. CatBoost: gradient boosting with categorical features support. 2018. https://arxiv.org/abs/1810.11363 3. CatBoost official documentation. https://catboost.ai/docs/ 4. Micci-Barreca, D. A preprocessing scheme for high cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations, 2001. https://dl.acm.org/doi/10.1145/507533.507538 5. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001. https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full 6. Lundberg, S. M. and Lee, S. A unified approach to interpreting model predictions. NeurIPS 2017. https://arxiv.org/abs/1705.07874 7. CatBoost GitHub repository. https://github.com/catboost/catboost 8. CatBoost.jl, Julia bindings to the CatBoost library. https://github.com/JuliaAI/CatBoost.jl 9. catboost-rs, Rust bindings to the CatBoost C API for model inference. https://github.com/catboost/catboost/tree/master/catboost/rust-package