178 Bayesian Hyperparameter Tuning

Hyperparameter optimization is the outer loop that wraps every machine learning model. Learning rates, tree depths, regularization strengths, layer widths, and dropout probabilities are not learned from data by the training procedure itself. They are chosen by the practitioner, and the quality of that choice often matters more than the choice of model family. The central difficulty is that evaluating a single hyperparameter configuration requires training a model to completion, which can take minutes, hours, or days. The objective is therefore expensive, noisy, and a black box with no usable gradient. Bayesian optimization is the dominant principled response to this setting. It builds a cheap probabilistic surrogate of the expensive objective, uses that surrogate to decide where to evaluate next, and refines the surrogate as evidence accumulates. This chapter develops the theory of surrogate-based optimization, contrasts the two surrogate families that dominate practice, derives the acquisition functions that drive the search, and grounds everything in the tools practitioners actually use.

178.1 1. The Hyperparameter Optimization Problem

Let $f: \mathcal{X} \to \mathbb{R}$ denote the validation performance of a model as a function of a hyperparameter configuration $\mathbf{x}$ drawn from a search space $\mathcal{X}$. We seek

\[ \mathbf{x}^\star = \arg\min_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x}), \]

where by convention we minimize a loss such as validation error or negative log likelihood. The search space $\mathcal{X}$ is typically a mixed product of continuous dimensions (learning rate on a log scale), integer dimensions (number of layers), and categorical dimensions (optimizer choice). Several properties make this hard. First, each evaluation of $f$ is expensive, so the total budget is measured in tens or low hundreds of evaluations rather than millions. Second, $f$ is noisy because validation scores depend on random initialization, data shuffling, and stochastic optimization. Third, $f$ provides no gradient with respect to $\mathbf{x}$, since the dependence runs through an entire training run.

Grid search and random search are the naive baselines. Grid search scales exponentially with dimension and wastes evaluations on unimportant axes. Random search, as Bergstra and Bengio showed, is surprisingly strong because most hyperparameters have low effective dimensionality, so random sampling explores the few important axes more efficiently than a rigid grid [1]. Bayesian optimization improves on random search by using the information from past evaluations to guide future ones, which is the defining idea of surrogate-based optimization.

178.2 2. Surrogate-Based Optimization

The core insight is to replace the expensive function $f$ with a cheap probabilistic model $\hat{f}$ called a surrogate. The surrogate is fit to the observed evaluations $\mathcal{D}_t = \{(\mathbf{x}_i, y_i)\}_{i=1}^{t}$ and provides not just a prediction but a calibrated estimate of uncertainty at every point in $\mathcal{X}$. A separate acquisition function $\alpha(\mathbf{x} \mid \mathcal{D}_t)$ converts the surrogate’s predictive distribution into a scalar that scores how desirable it would be to evaluate $\mathbf{x}$ next. The optimization loop is then:

1. fit surrogate to observed data D_t
2. x_{t+1} = argmax_x  acquisition(x | D_t)
3. evaluate y_{t+1} = f(x_{t+1})   # expensive
4. append (x_{t+1}, y_{t+1}) to D_t
5. repeat until budget exhausted

The two expensive operations of the original problem, namely searching over $\mathcal{X}$ and reasoning about uncertainty, are pushed onto the cheap surrogate. The acquisition maximization in step two is itself an optimization problem, but it runs against the cheap surrogate rather than the costly true objective, so it can use gradients, dense sampling, or evolutionary search freely. The surrogate must balance two competing pressures. It should exploit regions known to perform well, and it should explore regions of high uncertainty where a better optimum might hide. The acquisition function is precisely the device that encodes this exploration versus exploitation tradeoff.

178.3 3. Gaussian Process Surrogates

The classical surrogate is the Gaussian process (GP). A GP places a prior over functions such that any finite collection of function values is jointly Gaussian. It is specified by a mean function $m(\mathbf{x})$, usually taken to be zero or constant, and a covariance or kernel function $k(\mathbf{x}, \mathbf{x}')$ that encodes assumptions about smoothness and length scale [2].

178.3.1 3.1 Posterior Predictive Distribution

Given training inputs $X$ with observed values $\mathbf{y}$ and a test point $\mathbf{x}_\star$, the GP posterior is Gaussian with closed form mean and variance:

\[ \mu(\mathbf{x}_\star) = \mathbf{k}_\star^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}, \]

\[ \sigma^2(\mathbf{x}_\star) = k(\mathbf{x}_\star, \mathbf{x}_\star) - \mathbf{k}_\star^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_\star, \]

where $K$ is the kernel matrix over training points, $\mathbf{k}_\star$ is the vector of covariances between $\mathbf{x}_\star$ and the training points, and $\sigma_n^2$ is the observation noise variance. The mean gives the surrogate prediction and the variance gives the uncertainty that acquisition functions consume. A common kernel is the Matern 5/2 kernel,

\[ k(r) = \left(1 + \frac{\sqrt{5}\,r}{\ell} + \frac{5 r^2}{3 \ell^2}\right) \exp\!\left(-\frac{\sqrt{5}\,r}{\ell}\right), \]

with $r = \lVert \mathbf{x} - \mathbf{x}' \rVert$ and length scale $\ell$. The Matern 5/2 kernel is preferred over the smoother squared exponential because it produces functions that are twice differentiable but not infinitely smooth, a more realistic assumption for loss surfaces. Kernel hyperparameters such as $\ell$ and $\sigma_n^2$ are fit by maximizing the marginal likelihood.

178.3.2 3.2 Strengths and Limitations

GPs are sample efficient and offer principled, well calibrated uncertainty, which makes them the gold standard for low dimensional continuous problems with tight evaluation budgets. Their weaknesses are equally clear. The matrix inversion costs $O(t^3)$ in the number of observations, which becomes painful past a few thousand evaluations. Standard GPs assume continuous inputs and a stationary kernel, so conditional and categorical search spaces require careful kernel engineering. Performance degrades in high dimensions because distances concentrate and the kernel loses discriminative power. These limitations motivate the second major surrogate family.

178.4 4. Tree-Structured Parzen Estimator Surrogates

The tree-structured Parzen estimator (TPE) takes a different route [3]. Rather than modeling $p(y \mid \mathbf{x})$ directly as a GP does, TPE models the inverse, $p(\mathbf{x} \mid y)$, by splitting observations into a good group and a bad group.

178.4.1 4.1 Density Ratio Formulation

TPE chooses a quantile threshold $y^\star$, often the best 15 to 25 percent of observed scores, and defines two densities:

\[ p(\mathbf{x} \mid y) = \begin{cases} \ell(\mathbf{x}) & \text{if } y < y^\star, \\ g(\mathbf{x}) & \text{if } y \ge y^\star, \end{cases} \]

where $\ell(\mathbf{x})$ is the density of configurations that produced good (low loss) results and $g(\mathbf{x})$ is the density of configurations that produced poor results. Both densities are estimated with Parzen windows, that is mixtures of kernels placed on the observed points. It can be shown that maximizing the expected improvement acquisition under this model reduces to maximizing the density ratio

\[ \frac{\ell(\mathbf{x})}{g(\mathbf{x})}. \]

The interpretation is direct. TPE proposes configurations that are likely under the distribution of good observations and unlikely under the distribution of poor observations. The word “tree-structured” refers to how TPE handles conditional spaces. Hyperparameters that exist only when a parent takes a certain value (the number of units in a layer that exists only if that layer is present) are naturally represented by a tree of dependencies, and TPE estimates densities along each branch.

178.4.2 4.2 Why TPE Scales to Practical Search Spaces

TPE handles mixed continuous, integer, and categorical dimensions without special kernels, treats conditional hyperparameters gracefully, and costs roughly linear time in the number of observations rather than cubic. It does not model correlations across dimensions as richly as a GP, so it can be less sample efficient on smooth low dimensional problems. In exchange it copes far better with the messy, high dimensional, conditional spaces that real model tuning produces, which is why it is the default in widely used libraries.

178.5 5. Acquisition Functions

The acquisition function turns the surrogate’s predictive distribution into a decision. Let $f^+ = \min_i y_i$ be the best value observed so far. For a GP surrogate with posterior mean $\mu(\mathbf{x})$ and standard deviation $\sigma(\mathbf{x})$, the standard choices are the following.

178.5.1 5.1 Probability of Improvement

Probability of improvement (PI) scores the chance that a point beats the incumbent by a margin $\xi$:

\[ \alpha_{\text{PI}}(\mathbf{x}) = \Phi\!\left(\frac{f^+ - \mu(\mathbf{x}) - \xi}{\sigma(\mathbf{x})}\right), \]

where $\Phi$ is the standard normal cumulative distribution function. PI is simple but tends to be greedy, since it rewards any improvement regardless of magnitude and so under explores.

178.5.2 5.2 Expected Improvement

Expected improvement (EI) corrects this by weighting the probability of improvement by how large the improvement is expected to be. With $z = (f^+ - \mu(\mathbf{x}) - \xi)/\sigma(\mathbf{x})$,

\[ \alpha_{\text{EI}}(\mathbf{x}) = \sigma(\mathbf{x})\left[ z\,\Phi(z) + \phi(z) \right], \]

where $\phi$ is the standard normal density. The two terms expose the exploration versus exploitation balance cleanly. The $z\,\Phi(z)$ term rewards configurations whose mean beats the incumbent, which is exploitation, while the $\phi(z)$ term rewards high variance configurations, which is exploration. EI is the most widely used acquisition function because it requires no tuning beyond the small margin $\xi$ and behaves robustly across problems.

178.5.3 5.3 Upper Confidence Bound

The lower confidence bound, the minimization counterpart of the upper confidence bound (UCB), scores points by an optimistic estimate:

\[ \alpha_{\text{LCB}}(\mathbf{x}) = \mu(\mathbf{x}) - \kappa\,\sigma(\mathbf{x}). \]

The parameter $\kappa$ explicitly controls the tradeoff. Large $\kappa$ favors uncertain regions and explores, while small $\kappa$ trusts the mean and exploits. The GP-UCB analysis of Srinivas and colleagues provides regret bounds that justify scheduling $\kappa$ to grow slowly over iterations, which yields a no regret algorithm under suitable assumptions [4].

178.5.4 5.4 Entropy and Information-Theoretic Criteria

A more sophisticated family targets information directly. Entropy search and predictive entropy search choose the point that most reduces the entropy of the posterior over the location of the optimum $\mathbf{x}^\star$ [5]. These criteria often need fewer evaluations than EI because they reason about the optimum itself rather than about improvement at a single point, but they are more expensive to compute and harder to implement, so they remain less common in production tooling.

178.6 6. Practical Tools

178.6.1 6.1 Optuna

Optuna is the most widely adopted modern framework. Its default sampler is TPE, and it offers a define by run API where the search space is expressed in ordinary Python control flow, which makes conditional spaces natural to write.

import optuna

def objective(trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    optimizer = trial.suggest_categorical("opt", ["adam", "sgd"])
    return train_and_validate(lr, n_layers, optimizer)

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

Optuna also integrates pruning, which stops unpromising trials early using intermediate validation scores, and supports multi objective search and distributed execution. Pruning composes with the sampler and often delivers larger speedups than the choice of surrogate alone.

178.6.2 6.2 Hyperopt

Hyperopt is the older library that popularized TPE for hyperparameter search. The space is defined declaratively rather than by run.

from hyperopt import fmin, tpe, hp

space = {
    "lr": hp.loguniform("lr", -11, -2),
    "n_layers": hp.quniform("n_layers", 1, 5, 1),
}

best = fmin(fn=objective, space=space,
            algo=tpe.suggest, max_evals=100)

Hyperopt remains useful and influential, though its declarative space syntax is less ergonomic than Optuna’s for deeply conditional problems, and the project sees less active development.

178.6.3 6.3 GP-Based and Other Frameworks

When the problem is low dimensional, continuous, and the evaluation budget is very small, a GP surrogate can be more sample efficient than TPE. Libraries such as scikit-optimize, BoTorch built on GPyTorch, and the GP backend of Ax provide GP-based Bayesian optimization with EI, LCB, and information-theoretic acquisitions. BoTorch in particular supports batch acquisition through Monte Carlo formulations, which matters when several evaluations can run in parallel.

178.6.4 6.4 Selecting a Surrogate and Tool

A practical rule of thumb is to use a GP surrogate when the search space is continuous, the dimension is below roughly twenty, and evaluations are extremely expensive so that sample efficiency dominates. Use TPE when the space is high dimensional, mixed type, or conditional, which describes most deep learning tuning, and when a few hundred evaluations are affordable. In both cases, combine the surrogate with early stopping or pruning of weak trials, warm start the search with a handful of random or low discrepancy Sobol points to seed the surrogate, and budget enough initial random evaluations that the surrogate is not misled by a tiny biased sample. Always log every trial, because the accumulated evaluation history is itself a valuable asset for analysis and for warm starting future searches.

178.7 7. A Worked Bayesian Optimization Loop

The right way to use Bayesian optimization in production is to reach for a mature library, not to reimplement the loop. The reference code below is therefore deliberately split. The Python tab builds the loop explicitly on top of scikit-learn’s GaussianProcessRegressor so that every step from Sections 2 through 5, fitting the GP surrogate, evaluating expected improvement, and selecting the next point, is visible and runnable in a few dozen lines. This is the version to read when you want to understand the machinery. For real work you would instead drive Optuna, scikit-optimize, or BoTorch, which add pruning, parallelism, robust kernel handling, and persistence; the call into Optuna shown in Section 6.1 is all the user code those libraries require.

The task is a small, self-contained tuning problem with everything generated inline. We synthesize a moderately hard binary classification dataset with scikit-learn’s make_classification, then tune two hyperparameters of a gradient-boosted tree: the learning rate on a log scale and the maximum tree depth. The objective is the negative five-fold cross-validated accuracy, which is exactly the expensive, noisy, gradient-free black box the chapter has been describing. We optimize over the unit square and map each coordinate to its raw range inside the objective, which is the standard trick for keeping a stationary kernel well conditioned across heterogeneous axes. To make the value of the surrogate concrete, the loop reports the best configuration found and compares it against pure random search given the identical evaluation budget.

Code

import warnings
import numpy as np
from scipy.stats import norm
from scipy.stats.qmc import LatinHypercube
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, ConstantKernel, WhiteKernel
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings("ignore", category=ConvergenceWarning)

# A self-contained, moderately hard binary classification task.
X, y = make_classification(
    n_samples=600, n_features=20, n_informative=8, n_redundant=4,
    class_sep=0.8, random_state=42,
)

# Tune a gradient-boosted tree over two hyperparameters. We search the unit
# square and map it to the raw ranges inside the objective:
#   u0 -> log10(learning_rate) in [-3, 0]   (lr in [1e-3, 1])
#   u1 -> max_depth            in [1, 6]
SPACE_LO = np.array([-3.0, 1.0])
SPACE_HI = np.array([0.0, 6.0])

def to_raw(u):
    return SPACE_LO + u * (SPACE_HI - SPACE_LO)

def objective(u):
    """Expensive, noisy, gradient-free black box: negative CV accuracy."""
    log_lr, depth = to_raw(u)
    clf = GradientBoostingClassifier(
        learning_rate=10.0 ** log_lr,
        max_depth=int(round(depth)),
        n_estimators=60,
        random_state=0,
    )
    return -cross_val_score(clf, X, y, cv=5, scoring="accuracy").mean()

def expected_improvement(mu, sigma, f_best, xi=0.01):
    """EI for minimization; see Section 5.2."""
    sigma = np.maximum(sigma, 1e-9)
    z = (f_best - mu - xi) / sigma
    return sigma * (z * norm.cdf(z) + norm.pdf(z))

n_init, n_iter = 5, 12
U = LatinHypercube(d=2, seed=1).random(n_init)   # space-filling warm start
Y = np.array([objective(u) for u in U])

# Matern 5/2 surrogate with a learned noise term (the WhiteKernel) so the
# GP does not treat the noisy CV scores as exact.
kernel = (ConstantKernel(1.0, (1e-2, 1e2))
          * Matern(length_scale=[0.5, 0.5], nu=2.5)
          + WhiteKernel(1e-2, (1e-4, 1e-1)))
cand = LatinHypercube(d=2, seed=7).random(2048)   # cheap candidate pool

for _ in range(n_iter):
    gp = GaussianProcessRegressor(
        kernel=kernel, normalize_y=True, n_restarts_optimizer=2,
        random_state=0,
    ).fit(U, Y)
    mu, sd = gp.predict(cand, return_std=True)
    u_next = cand[np.argmax(expected_improvement(mu, sd, Y.min()))]
    U = np.vstack([U, u_next])
    Y = np.append(Y, objective(u_next))

best = int(np.argmin(Y))
log_lr, depth = to_raw(U[best])
print(f"Observations U,Y shapes: {U.shape}, {Y.shape}")
print(f"Evaluations: {len(Y)}  (init {n_init} + BO {n_iter})")
print(f"Best CV accuracy: {-Y[best]:.4f}")
print(f"  learning_rate = {10.0 ** log_lr:.4f}")
print(f"  max_depth     = {int(round(depth))}")

# The fitted kernel exposes the learned length scales (Section 3.1) and the
# noise variance the WhiteKernel attributed to CV jitter.
ls = gp.kernel_.k1.k2.length_scale
noise = gp.kernel_.k2.noise_level
print(f"Learned length scales: log10(lr)={ls[0]:.3f}, depth={ls[1]:.3f}")
print(f"Learned CV-noise variance: {noise:.5f}")

# Same total budget spent on pure random search, for reference.
Ur = LatinHypercube(d=2, seed=99).random(len(Y))
rand_best = -np.array([objective(u) for u in Ur]).min()
print(f"Random search best (same budget): {rand_best:.4f}")
trace = -np.minimum.accumulate(Y)
print("Best-so-far accuracy: " + " ".join(f"{v:.3f}" for v in trace))

Observations U,Y shapes: (17, 2), (17,)
Evaluations: 17  (init 5 + BO 12)
Best CV accuracy: 0.8300
  learning_rate = 0.2058
  max_depth     = 4
Learned length scales: log10(lr)=0.254, depth=0.169
Learned CV-noise variance: 0.00010
Random search best (same budget): 0.8200
Best-so-far accuracy: 0.767 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830 0.830

The GP-guided search reaches its best configuration within the first couple of informed iterations and edges out random search at the same budget, which is the expected payoff in this low dimensional, expensive regime. The printed length scales make the surrogate’s reasoning legible: a short length scale on an axis means the GP believes the objective changes quickly there and so it samples that axis more carefully. The WhiteKernel term is doing real work too: the learned noise variance is the slice of score variation the surrogate attributes to cross-validation jitter rather than to the mean surface, and modeling it is what stops the GP from overfitting the noise, the single most common mistake when hand-rolling a GP loop. In anything beyond a teaching example, replace this entire cell with the few lines of Optuna or scikit-optimize from Section 6, which manage the surrogate, the acquisition optimization, and trial bookkeeping for you.

# ] add BayesianOptimization GaussianProcesses Distributions
using BayesianOptimization, GaussianProcesses, Distributions

# Same objective shape: map the unit square to (log10 lr, max_depth) and
# return the cross-validated loss from your training routine.
function objective(u)
    log_lr = -3.0 + u[1] * 3.0          # log10(lr) in [-3, 0]
    depth  = round(Int, 1.0 + u[2] * 5) # max_depth in [1, 6]
    return train_and_cv_loss(10.0^log_lr, depth)   # your model, returns a loss
end

# Matern 5/2 GP surrogate with a learned mean and noise term.
model = ElasticGPE(2;
    mean = MeanConst(0.0),
    kernel = Mat52Ard([0.0, 0.0], 0.0),
    logNoise = -2.0)
set_priors!(model.mean, [Normal(0, 1)])

# Expected-improvement acquisition, optimized over [0,1]^2.
opt = BOpt(objective, model,
    ExpectedImprovement(),
    MAPGPOptimizer(every = 5),
    [0.0, 0.0], [1.0, 1.0];
    maxiterations = 17, sense = Min,
    initializer = ScaledSobolIterator([0.0, 0.0], [1.0, 1.0], 5))

result = boptimize!(opt)
println("Best loss:    ", result.observed_optimum)
println("Best config:  ", result.observed_optimizer)

// Cargo.toml: friedrich = "0.5"   (Gaussian-process regression)
// `friedrich` provides the GP surrogate; the BO loop (EI + candidate
// selection) is written by hand, mirroring the Python tab. There is no
// widely adopted, full-featured Bayesian-optimization crate in Rust today,
// so a thin GP crate plus an explicit loop is the closest mature option.
use friedrich::gaussian_process::GaussianProcess;
use friedrich::kernel::Matern2_5;
use friedrich::prior::ConstantPrior;

fn expected_improvement(mu: f64, sigma: f64, f_best: f64, xi: f64) -> f64 {
    let sigma = sigma.max(1e-9);
    let z = (f_best - mu - xi) / sigma;
    let normal = statrs::distribution::Normal::new(0.0, 1.0).unwrap();
    use statrs::distribution::{Continuous, ContinuousCDF};
    sigma * (z * normal.cdf(z) + normal.pdf(z))
}

fn main() {
    // inputs: Vec<Vec<f64>> in [0,1]^2, outputs: Vec<f64> = objective values
    let (mut inputs, mut outputs) = sobol_warm_start(5); // your space-filling init
    for _ in 0..12 {
        let gp = GaussianProcess::builder(inputs.clone(), outputs.clone())
            .set_kernel(Matern2_5::default())
            .set_prior(ConstantPrior::default())
            .fit_kernel()      // optimize length scales by marginal likelihood
            .fit_prior()
            .train();
        // Score a candidate pool with EI and pick the argmax.
        let cand = sobol_candidates(2048);
        let next = cand.into_iter().max_by(|a, b| {
            let (ma, va) = (gp.predict(a), gp.predict_variance(a));
            let (mb, vb) = (gp.predict(b), gp.predict_variance(b));
            let f_best = outputs.iter().cloned().fold(f64::INFINITY, f64::min);
            expected_improvement(ma, va.sqrt(), f_best, 0.01)
                .partial_cmp(&expected_improvement(mb, vb.sqrt(), f_best, 0.01))
                .unwrap()
        }).unwrap();
        let y = objective(&next);   // your expensive evaluation
        inputs.push(next);
        outputs.push(y);
    }
}

178.8 8. Practical Considerations and Pitfalls

Several issues recur in deployment. Noise must be modeled rather than ignored, since treating a noisy score as exact makes the surrogate overconfident and the search brittle. Parallel evaluation breaks the strict sequential loop, so practitioners use constant liar or fantasizing heuristics that impute pending evaluations to keep proposing diverse points. Search space design matters more than algorithm choice. Putting learning rate and regularization strengths on a log scale, and bounding ranges tightly around physically plausible values, often improves results more than switching surrogates. Multi fidelity methods such as Hyperband and BOHB exploit cheap proxies, for example training for fewer epochs or on subsets of data, to discard poor configurations early and reallocate budget toward promising ones, and they frequently outperform pure Bayesian optimization under a fixed wall clock budget [6]. Finally, the benefit of Bayesian optimization over random search shrinks as dimensionality grows and as the evaluation budget grows large, so the method is most valuable precisely in the regime it was designed for, namely expensive objectives evaluated tens to low hundreds of times.

178.9 9. Summary

Bayesian hyperparameter tuning frames model configuration as the optimization of an expensive, noisy, gradient free black box. It fits a cheap probabilistic surrogate to past evaluations and uses an acquisition function to decide where to look next, balancing exploitation of good regions against exploration of uncertain ones. Gaussian process surrogates offer principled uncertainty and strong sample efficiency in low dimensional continuous spaces but scale poorly in observations and dimensions. Tree-structured Parzen estimators model the density ratio of good to bad configurations, handle conditional and mixed type spaces gracefully, and power the default samplers in Optuna and Hyperopt. Expected improvement remains the workhorse acquisition function, with upper confidence bound and information-theoretic criteria offering alternatives. In practice the right move is to match the surrogate to the search space, combine it with pruning and multi fidelity strategies, and design the search space with as much care as the optimizer.

178.10 References

Bergstra, J. and Bengio, Y. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 2012. https://www.jmlr.org/papers/v13/bergstra12a.html
Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. https://gaussianprocess.org/gpml/
Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B. Algorithms for Hyper-Parameter Optimization. NeurIPS, 2011. https://papers.nips.cc/paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html
Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010. https://arxiv.org/abs/0912.3995
Hernandez-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. NeurIPS, 2014. https://arxiv.org/abs/1406.2541
Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. ICML, 2018. https://arxiv.org/abs/1807.01774
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. KDD, 2019. https://arxiv.org/abs/1907.10902
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 2016. https://ieeexplore.ieee.org/document/7352306

# Bayesian Hyperparameter Tuning Hyperparameter optimization is the outer loop that wraps every machine learning model. Learning rates, tree depths, regularization strengths, layer widths, and dropout probabilities are not learned from data by the training procedure itself. They are chosen by the practitioner, and the quality of that choice often matters more than the choice of model family. The central difficulty is that evaluating a single hyperparameter configuration requires training a model to completion, which can take minutes, hours, or days. The objective is therefore expensive, noisy, and a black box with no usable gradient. Bayesian optimization is the dominant principled response to this setting. It builds a cheap probabilistic surrogate of the expensive objective, uses that surrogate to decide where to evaluate next, and refines the surrogate as evidence accumulates. This chapter develops the theory of surrogate-based optimization, contrasts the two surrogate families that dominate practice, derives the acquisition functions that drive the search, and grounds everything in the tools practitioners actually use. ## 1. The Hyperparameter Optimization Problem Let $f: \mathcal{X} \to \mathbb{R}$ denote the validation performance of a model as a function of a hyperparameter configuration $\mathbf{x}$ drawn from a search space $\mathcal{X}$. We seek $$ \mathbf{x}^\star = \arg\min_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x}), $$ where by convention we minimize a loss such as validation error or negative log likelihood. The search space $\mathcal{X}$ is typically a mixed product of continuous dimensions (learning rate on a log scale), integer dimensions (number of layers), and categorical dimensions (optimizer choice). Several properties make this hard. First, each evaluation of $f$ is expensive, so the total budget is measured in tens or low hundreds of evaluations rather than millions. Second, $f$ is noisy because validation scores depend on random initialization, data shuffling, and stochastic optimization. Third, $f$ provides no gradient with respect to $\mathbf{x}$, since the dependence runs through an entire training run. Grid search and random search are the naive baselines. Grid search scales exponentially with dimension and wastes evaluations on unimportant axes. Random search, as Bergstra and Bengio showed, is surprisingly strong because most hyperparameters have low effective dimensionality, so random sampling explores the few important axes more efficiently than a rigid grid [1]. Bayesian optimization improves on random search by using the information from past evaluations to guide future ones, which is the defining idea of surrogate-based optimization. ## 2. Surrogate-Based Optimization The core insight is to replace the expensive function $f$ with a cheap probabilistic model $\hat{f}$ called a surrogate. The surrogate is fit to the observed evaluations $\mathcal{D}_t = \{(\mathbf{x}_i, y_i)\}_{i=1}^{t}$ and provides not just a prediction but a calibrated estimate of uncertainty at every point in $\mathcal{X}$. A separate acquisition function $\alpha(\mathbf{x} \mid \mathcal{D}_t)$ converts the surrogate's predictive distribution into a scalar that scores how desirable it would be to evaluate $\mathbf{x}$ next. The optimization loop is then: ```text 1. fit surrogate to observed data D_t 2. x_{t+1} = argmax_x acquisition(x | D_t) 3. evaluate y_{t+1} = f(x_{t+1}) # expensive 4. append (x_{t+1}, y_{t+1}) to D_t 5. repeat until budget exhausted ``` The two expensive operations of the original problem, namely searching over $\mathcal{X}$ and reasoning about uncertainty, are pushed onto the cheap surrogate. The acquisition maximization in step two is itself an optimization problem, but it runs against the cheap surrogate rather than the costly true objective, so it can use gradients, dense sampling, or evolutionary search freely. The surrogate must balance two competing pressures. It should exploit regions known to perform well, and it should explore regions of high uncertainty where a better optimum might hide. The acquisition function is precisely the device that encodes this exploration versus exploitation tradeoff. ## 3. Gaussian Process Surrogates The classical surrogate is the Gaussian process (GP). A GP places a prior over functions such that any finite collection of function values is jointly Gaussian. It is specified by a mean function $m(\mathbf{x})$, usually taken to be zero or constant, and a covariance or kernel function $k(\mathbf{x}, \mathbf{x}')$ that encodes assumptions about smoothness and length scale [2]. ### 3.1 Posterior Predictive Distribution Given training inputs $X$ with observed values $\mathbf{y}$ and a test point $\mathbf{x}_\star$, the GP posterior is Gaussian with closed form mean and variance: $$ \mu(\mathbf{x}_\star) = \mathbf{k}_\star^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}, $$ $$ \sigma^2(\mathbf{x}_\star) = k(\mathbf{x}_\star, \mathbf{x}_\star) - \mathbf{k}_\star^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_\star, $$ where $K$ is the kernel matrix over training points, $\mathbf{k}_\star$ is the vector of covariances between $\mathbf{x}_\star$ and the training points, and $\sigma_n^2$ is the observation noise variance. The mean gives the surrogate prediction and the variance gives the uncertainty that acquisition functions consume. A common kernel is the Matern 5/2 kernel, $$ k(r) = \left(1 + \frac{\sqrt{5}\,r}{\ell} + \frac{5 r^2}{3 \ell^2}\right) \exp\!\left(-\frac{\sqrt{5}\,r}{\ell}\right), $$ with $r = \lVert \mathbf{x} - \mathbf{x}' \rVert$ and length scale $\ell$. The Matern 5/2 kernel is preferred over the smoother squared exponential because it produces functions that are twice differentiable but not infinitely smooth, a more realistic assumption for loss surfaces. Kernel hyperparameters such as $\ell$ and $\sigma_n^2$ are fit by maximizing the marginal likelihood. ### 3.2 Strengths and Limitations GPs are sample efficient and offer principled, well calibrated uncertainty, which makes them the gold standard for low dimensional continuous problems with tight evaluation budgets. Their weaknesses are equally clear. The matrix inversion costs $O(t^3)$ in the number of observations, which becomes painful past a few thousand evaluations. Standard GPs assume continuous inputs and a stationary kernel, so conditional and categorical search spaces require careful kernel engineering. Performance degrades in high dimensions because distances concentrate and the kernel loses discriminative power. These limitations motivate the second major surrogate family. ## 4. Tree-Structured Parzen Estimator Surrogates The tree-structured Parzen estimator (TPE) takes a different route [3]. Rather than modeling $p(y \mid \mathbf{x})$ directly as a GP does, TPE models the inverse, $p(\mathbf{x} \mid y)$, by splitting observations into a good group and a bad group. ### 4.1 Density Ratio Formulation TPE chooses a quantile threshold $y^\star$, often the best 15 to 25 percent of observed scores, and defines two densities: $$ p(\mathbf{x} \mid y) = \begin{cases} \ell(\mathbf{x}) & \text{if } y < y^\star, \\ g(\mathbf{x}) & \text{if } y \ge y^\star, \end{cases} $$ where $\ell(\mathbf{x})$ is the density of configurations that produced good (low loss) results and $g(\mathbf{x})$ is the density of configurations that produced poor results. Both densities are estimated with Parzen windows, that is mixtures of kernels placed on the observed points. It can be shown that maximizing the expected improvement acquisition under this model reduces to maximizing the density ratio $$ \frac{\ell(\mathbf{x})}{g(\mathbf{x})}. $$ The interpretation is direct. TPE proposes configurations that are likely under the distribution of good observations and unlikely under the distribution of poor observations. The word "tree-structured" refers to how TPE handles conditional spaces. Hyperparameters that exist only when a parent takes a certain value (the number of units in a layer that exists only if that layer is present) are naturally represented by a tree of dependencies, and TPE estimates densities along each branch. ### 4.2 Why TPE Scales to Practical Search Spaces TPE handles mixed continuous, integer, and categorical dimensions without special kernels, treats conditional hyperparameters gracefully, and costs roughly linear time in the number of observations rather than cubic. It does not model correlations across dimensions as richly as a GP, so it can be less sample efficient on smooth low dimensional problems. In exchange it copes far better with the messy, high dimensional, conditional spaces that real model tuning produces, which is why it is the default in widely used libraries. ## 5. Acquisition Functions The acquisition function turns the surrogate's predictive distribution into a decision. Let $f^+ = \min_i y_i$ be the best value observed so far. For a GP surrogate with posterior mean $\mu(\mathbf{x})$ and standard deviation $\sigma(\mathbf{x})$, the standard choices are the following. ### 5.1 Probability of Improvement Probability of improvement (PI) scores the chance that a point beats the incumbent by a margin $\xi$: $$ \alpha_{\text{PI}}(\mathbf{x}) = \Phi\!\left(\frac{f^+ - \mu(\mathbf{x}) - \xi}{\sigma(\mathbf{x})}\right), $$ where $\Phi$ is the standard normal cumulative distribution function. PI is simple but tends to be greedy, since it rewards any improvement regardless of magnitude and so under explores. ### 5.2 Expected Improvement Expected improvement (EI) corrects this by weighting the probability of improvement by how large the improvement is expected to be. With $z = (f^+ - \mu(\mathbf{x}) - \xi)/\sigma(\mathbf{x})$, $$ \alpha_{\text{EI}}(\mathbf{x}) = \sigma(\mathbf{x})\left[ z\,\Phi(z) + \phi(z) \right], $$ where $\phi$ is the standard normal density. The two terms expose the exploration versus exploitation balance cleanly. The $z\,\Phi(z)$ term rewards configurations whose mean beats the incumbent, which is exploitation, while the $\phi(z)$ term rewards high variance configurations, which is exploration. EI is the most widely used acquisition function because it requires no tuning beyond the small margin $\xi$ and behaves robustly across problems. ### 5.3 Upper Confidence Bound The lower confidence bound, the minimization counterpart of the upper confidence bound (UCB), scores points by an optimistic estimate: $$ \alpha_{\text{LCB}}(\mathbf{x}) = \mu(\mathbf{x}) - \kappa\,\sigma(\mathbf{x}). $$ The parameter $\kappa$ explicitly controls the tradeoff. Large $\kappa$ favors uncertain regions and explores, while small $\kappa$ trusts the mean and exploits. The GP-UCB analysis of Srinivas and colleagues provides regret bounds that justify scheduling $\kappa$ to grow slowly over iterations, which yields a no regret algorithm under suitable assumptions [4]. ### 5.4 Entropy and Information-Theoretic Criteria A more sophisticated family targets information directly. Entropy search and predictive entropy search choose the point that most reduces the entropy of the posterior over the location of the optimum $\mathbf{x}^\star$ [5]. These criteria often need fewer evaluations than EI because they reason about the optimum itself rather than about improvement at a single point, but they are more expensive to compute and harder to implement, so they remain less common in production tooling. ## 6. Practical Tools ### 6.1 Optuna Optuna is the most widely adopted modern framework. Its default sampler is TPE, and it offers a define by run API where the search space is expressed in ordinary Python control flow, which makes conditional spaces natural to write. ```python import optuna def objective(trial): lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True) n_layers = trial.suggest_int("n_layers", 1, 5) optimizer = trial.suggest_categorical("opt", ["adam", "sgd"]) return train_and_validate(lr, n_layers, optimizer) study = optuna.create_study(direction="minimize") study.optimize(objective, n_trials=100) ``` Optuna also integrates pruning, which stops unpromising trials early using intermediate validation scores, and supports multi objective search and distributed execution. Pruning composes with the sampler and often delivers larger speedups than the choice of surrogate alone. ### 6.2 Hyperopt Hyperopt is the older library that popularized TPE for hyperparameter search. The space is defined declaratively rather than by run. ```python from hyperopt import fmin, tpe, hp space = { "lr": hp.loguniform("lr", -11, -2), "n_layers": hp.quniform("n_layers", 1, 5, 1), } best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=100) ``` Hyperopt remains useful and influential, though its declarative space syntax is less ergonomic than Optuna's for deeply conditional problems, and the project sees less active development. ### 6.3 GP-Based and Other Frameworks When the problem is low dimensional, continuous, and the evaluation budget is very small, a GP surrogate can be more sample efficient than TPE. Libraries such as scikit-optimize, BoTorch built on GPyTorch, and the GP backend of Ax provide GP-based Bayesian optimization with EI, LCB, and information-theoretic acquisitions. BoTorch in particular supports batch acquisition through Monte Carlo formulations, which matters when several evaluations can run in parallel. ### 6.4 Selecting a Surrogate and Tool A practical rule of thumb is to use a GP surrogate when the search space is continuous, the dimension is below roughly twenty, and evaluations are extremely expensive so that sample efficiency dominates. Use TPE when the space is high dimensional, mixed type, or conditional, which describes most deep learning tuning, and when a few hundred evaluations are affordable. In both cases, combine the surrogate with early stopping or pruning of weak trials, warm start the search with a handful of random or low discrepancy Sobol points to seed the surrogate, and budget enough initial random evaluations that the surrogate is not misled by a tiny biased sample. Always log every trial, because the accumulated evaluation history is itself a valuable asset for analysis and for warm starting future searches. ## 7. A Worked Bayesian Optimization Loop The right way to use Bayesian optimization in production is to reach for a mature library, not to reimplement the loop. The reference code below is therefore deliberately split. The Python tab builds the loop explicitly on top of scikit-learn's `GaussianProcessRegressor` so that every step from Sections 2 through 5, fitting the GP surrogate, evaluating expected improvement, and selecting the next point, is visible and runnable in a few dozen lines. This is the version to read when you want to understand the machinery. For real work you would instead drive [Optuna](https://optuna.org), [scikit-optimize](https://scikit-optimize.github.io), or [BoTorch](https://botorch.org), which add pruning, parallelism, robust kernel handling, and persistence; the call into Optuna shown in Section 6.1 is all the user code those libraries require. The task is a small, self-contained tuning problem with everything generated inline. We synthesize a moderately hard binary classification dataset with scikit-learn's `make_classification`, then tune two hyperparameters of a gradient-boosted tree: the learning rate on a log scale and the maximum tree depth. The objective is the negative five-fold cross-validated accuracy, which is exactly the expensive, noisy, gradient-free black box the chapter has been describing. We optimize over the unit square and map each coordinate to its raw range inside the objective, which is the standard trick for keeping a stationary kernel well conditioned across heterogeneous axes. To make the value of the surrogate concrete, the loop reports the best configuration found and compares it against pure random search given the identical evaluation budget. ::: {.panel-tabset} ## Python ```{python} import warnings import numpy as np from scipy.stats import norm from scipy.stats.qmc import LatinHypercube from sklearn.datasets import make_classification from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import Matern, ConstantKernel, WhiteKernel from sklearn.exceptions import ConvergenceWarning warnings.filterwarnings("ignore", category=ConvergenceWarning) # A self-contained, moderately hard binary classification task. X, y = make_classification( n_samples=600, n_features=20, n_informative=8, n_redundant=4, class_sep=0.8, random_state=42, ) # Tune a gradient-boosted tree over two hyperparameters. We search the unit # square and map it to the raw ranges inside the objective: # u0 -> log10(learning_rate) in [-3, 0] (lr in [1e-3, 1]) # u1 -> max_depth in [1, 6] SPACE_LO = np.array([-3.0, 1.0]) SPACE_HI = np.array([0.0, 6.0]) def to_raw(u): return SPACE_LO + u * (SPACE_HI - SPACE_LO) def objective(u): """Expensive, noisy, gradient-free black box: negative CV accuracy.""" log_lr, depth = to_raw(u) clf = GradientBoostingClassifier( learning_rate=10.0 ** log_lr, max_depth=int(round(depth)), n_estimators=60, random_state=0, ) return -cross_val_score(clf, X, y, cv=5, scoring="accuracy").mean() def expected_improvement(mu, sigma, f_best, xi=0.01): """EI for minimization; see Section 5.2.""" sigma = np.maximum(sigma, 1e-9) z = (f_best - mu - xi) / sigma return sigma * (z * norm.cdf(z) + norm.pdf(z)) n_init, n_iter = 5, 12 U = LatinHypercube(d=2, seed=1).random(n_init) # space-filling warm start Y = np.array([objective(u) for u in U]) # Matern 5/2 surrogate with a learned noise term (the WhiteKernel) so the # GP does not treat the noisy CV scores as exact. kernel = (ConstantKernel(1.0, (1e-2, 1e2)) * Matern(length_scale=[0.5, 0.5], nu=2.5) + WhiteKernel(1e-2, (1e-4, 1e-1))) cand = LatinHypercube(d=2, seed=7).random(2048) # cheap candidate pool for _ in range(n_iter): gp = GaussianProcessRegressor( kernel=kernel, normalize_y=True, n_restarts_optimizer=2, random_state=0, ).fit(U, Y) mu, sd = gp.predict(cand, return_std=True) u_next = cand[np.argmax(expected_improvement(mu, sd, Y.min()))] U = np.vstack([U, u_next]) Y = np.append(Y, objective(u_next)) best = int(np.argmin(Y)) log_lr, depth = to_raw(U[best]) print(f"Observations U,Y shapes: {U.shape}, {Y.shape}") print(f"Evaluations: {len(Y)} (init {n_init} + BO {n_iter})") print(f"Best CV accuracy: {-Y[best]:.4f}") print(f" learning_rate = {10.0 ** log_lr:.4f}") print(f" max_depth = {int(round(depth))}") # The fitted kernel exposes the learned length scales (Section 3.1) and the # noise variance the WhiteKernel attributed to CV jitter. ls = gp.kernel_.k1.k2.length_scale noise = gp.kernel_.k2.noise_level print(f"Learned length scales: log10(lr)={ls[0]:.3f}, depth={ls[1]:.3f}") print(f"Learned CV-noise variance: {noise:.5f}") # Same total budget spent on pure random search, for reference. Ur = LatinHypercube(d=2, seed=99).random(len(Y)) rand_best = -np.array([objective(u) for u in Ur]).min() print(f"Random search best (same budget): {rand_best:.4f}") trace = -np.minimum.accumulate(Y) print("Best-so-far accuracy: " + " ".join(f"{v:.3f}" for v in trace)) ``` The GP-guided search reaches its best configuration within the first couple of informed iterations and edges out random search at the same budget, which is the expected payoff in this low dimensional, expensive regime. The printed length scales make the surrogate's reasoning legible: a short length scale on an axis means the GP believes the objective changes quickly there and so it samples that axis more carefully. The `WhiteKernel` term is doing real work too: the learned noise variance is the slice of score variation the surrogate attributes to cross-validation jitter rather than to the mean surface, and modeling it is what stops the GP from overfitting the noise, the single most common mistake when hand-rolling a GP loop. In anything beyond a teaching example, replace this entire cell with the few lines of Optuna or scikit-optimize from Section 6, which manage the surrogate, the acquisition optimization, and trial bookkeeping for you. ## Julia ```julia # ] add BayesianOptimization GaussianProcesses Distributions using BayesianOptimization, GaussianProcesses, Distributions # Same objective shape: map the unit square to (log10 lr, max_depth) and # return the cross-validated loss from your training routine. function objective(u) log_lr = -3.0 + u[1] * 3.0 # log10(lr) in [-3, 0] depth = round(Int, 1.0 + u[2] * 5) # max_depth in [1, 6] return train_and_cv_loss(10.0^log_lr, depth) # your model, returns a loss end # Matern 5/2 GP surrogate with a learned mean and noise term. model = ElasticGPE(2; mean = MeanConst(0.0), kernel = Mat52Ard([0.0, 0.0], 0.0), logNoise = -2.0) set_priors!(model.mean, [Normal(0, 1)]) # Expected-improvement acquisition, optimized over [0,1]^2. opt = BOpt(objective, model, ExpectedImprovement(), MAPGPOptimizer(every = 5), [0.0, 0.0], [1.0, 1.0]; maxiterations = 17, sense = Min, initializer = ScaledSobolIterator([0.0, 0.0], [1.0, 1.0], 5)) result = boptimize!(opt) println("Best loss: ", result.observed_optimum) println("Best config: ", result.observed_optimizer) ``` ## Rust ```rust // Cargo.toml: friedrich = "0.5" (Gaussian-process regression) // `friedrich` provides the GP surrogate; the BO loop (EI + candidate // selection) is written by hand, mirroring the Python tab. There is no // widely adopted, full-featured Bayesian-optimization crate in Rust today, // so a thin GP crate plus an explicit loop is the closest mature option. use friedrich::gaussian_process::GaussianProcess; use friedrich::kernel::Matern2_5; use friedrich::prior::ConstantPrior; fn expected_improvement(mu: f64, sigma: f64, f_best: f64, xi: f64) -> f64 { let sigma = sigma.max(1e-9); let z = (f_best - mu - xi) / sigma; let normal = statrs::distribution::Normal::new(0.0, 1.0).unwrap(); use statrs::distribution::{Continuous, ContinuousCDF}; sigma * (z * normal.cdf(z) + normal.pdf(z)) } fn main() { // inputs: Vec<Vec<f64>> in [0,1]^2, outputs: Vec<f64> = objective values let (mut inputs, mut outputs) = sobol_warm_start(5); // your space-filling init for _ in 0..12 { let gp = GaussianProcess::builder(inputs.clone(), outputs.clone()) .set_kernel(Matern2_5::default()) .set_prior(ConstantPrior::default()) .fit_kernel() // optimize length scales by marginal likelihood .fit_prior() .train(); // Score a candidate pool with EI and pick the argmax. let cand = sobol_candidates(2048); let next = cand.into_iter().max_by(|a, b| { let (ma, va) = (gp.predict(a), gp.predict_variance(a)); let (mb, vb) = (gp.predict(b), gp.predict_variance(b)); let f_best = outputs.iter().cloned().fold(f64::INFINITY, f64::min); expected_improvement(ma, va.sqrt(), f_best, 0.01) .partial_cmp(&expected_improvement(mb, vb.sqrt(), f_best, 0.01)) .unwrap() }).unwrap(); let y = objective(&next); // your expensive evaluation inputs.push(next); outputs.push(y); } } ``` ::: ## 8. Practical Considerations and Pitfalls Several issues recur in deployment. Noise must be modeled rather than ignored, since treating a noisy score as exact makes the surrogate overconfident and the search brittle. Parallel evaluation breaks the strict sequential loop, so practitioners use constant liar or fantasizing heuristics that impute pending evaluations to keep proposing diverse points. Search space design matters more than algorithm choice. Putting learning rate and regularization strengths on a log scale, and bounding ranges tightly around physically plausible values, often improves results more than switching surrogates. Multi fidelity methods such as Hyperband and BOHB exploit cheap proxies, for example training for fewer epochs or on subsets of data, to discard poor configurations early and reallocate budget toward promising ones, and they frequently outperform pure Bayesian optimization under a fixed wall clock budget [6]. Finally, the benefit of Bayesian optimization over random search shrinks as dimensionality grows and as the evaluation budget grows large, so the method is most valuable precisely in the regime it was designed for, namely expensive objectives evaluated tens to low hundreds of times. ## 9. Summary Bayesian hyperparameter tuning frames model configuration as the optimization of an expensive, noisy, gradient free black box. It fits a cheap probabilistic surrogate to past evaluations and uses an acquisition function to decide where to look next, balancing exploitation of good regions against exploration of uncertain ones. Gaussian process surrogates offer principled uncertainty and strong sample efficiency in low dimensional continuous spaces but scale poorly in observations and dimensions. Tree-structured Parzen estimators model the density ratio of good to bad configurations, handle conditional and mixed type spaces gracefully, and power the default samplers in Optuna and Hyperopt. Expected improvement remains the workhorse acquisition function, with upper confidence bound and information-theoretic criteria offering alternatives. In practice the right move is to match the surrogate to the search space, combine it with pruning and multi fidelity strategies, and design the search space with as much care as the optimizer. ## References 1. Bergstra, J. and Bengio, Y. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 2012. https://www.jmlr.org/papers/v13/bergstra12a.html 2. Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. https://gaussianprocess.org/gpml/ 3. Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B. Algorithms for Hyper-Parameter Optimization. NeurIPS, 2011. https://papers.nips.cc/paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html 4. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010. https://arxiv.org/abs/0912.3995 5. Hernandez-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. NeurIPS, 2014. https://arxiv.org/abs/1406.2541 6. Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. ICML, 2018. https://arxiv.org/abs/1807.01774 7. Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. KDD, 2019. https://arxiv.org/abs/1907.10902 8. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 2016. https://ieeexplore.ieee.org/document/7352306