177 Random Search for Hyperparameters
177.1 1. Introduction
Every learning algorithm carries a set of knobs that are not fit from data. Learning rates, regularization strengths, tree depths, kernel bandwidths, layer widths, and dropout fractions are all chosen by the practitioner before optimization begins. These knobs are the hyperparameters, and the quality of a model often depends more on setting them well than on the choice of algorithm itself. The task of selecting them is a black box optimization problem. We can evaluate a configuration by training a model and measuring validation performance, but we have no gradient, no closed form, and each evaluation is expensive.
Grid search has long been the default protocol. The practitioner discretizes each hyperparameter into a small set of values and evaluates the full Cartesian product. It is simple, embarrassingly parallel, and reproducible. Yet for problems with more than a few hyperparameters, grid search wastes most of its budget. The central insight of this chapter, due to Bergstra and Bengio [1], is that random search over the same space is not merely competitive with grid search but is provably and empirically superior in the regimes that matter for modern machine learning. We develop why this is true, how to choose sampling distributions, and how to allocate a fixed budget of evaluations.
177.2 2. The Geometry of Grid Search
177.2.1 2.1 The Cartesian Trap
Suppose we tune \(d\) hyperparameters and place \(k\) grid points along each axis. The grid contains \(k^d\) configurations. This exponential growth is the curse of dimensionality in its plainest form. With \(k = 5\) and \(d = 6\) we already face \(15{,}625\) evaluations. If each evaluation takes an hour, the full grid takes nearly two years of serial compute. To stay within budget the practitioner is forced to shrink \(k\), which coarsens the resolution along every axis simultaneously.
177.2.2 2.2 Low Effective Dimensionality
The deeper problem is not cost but coverage. Empirically, validation performance for most models depends strongly on only a few hyperparameters and weakly on the rest. Bergstra and Bengio call this property low effective dimensionality. The function we are optimizing,
\[ \Psi(\lambda) = \mathbb{E}_{x}\,[\,\mathcal{L}(x; \lambda)\,], \]
where \(\lambda\) is the hyperparameter vector and \(\mathcal{L}\) the validation loss, varies almost entirely along a low dimensional subspace of the configuration space.
Now consider how grid search samples an important axis. With a \(k \times k\) grid over two parameters where only the first matters, the grid tests just \(k\) distinct values of the important parameter, even though it spends \(k^2\) evaluations. The other \(k^2 - k\) evaluations are duplicates along the axis that controls performance. Random search, by contrast, almost surely tests \(n\) distinct values of every parameter when it draws \(n\) configurations. This difference in marginal coverage is the heart of the argument.
177.3 3. Why Random Search Wins
177.3.1 3.1 The Marginal Coverage Argument
Project the search onto the single most important hyperparameter. Grid search with \(n = k^d\) total points yields only \(k = n^{1/d}\) distinct values along that axis. Random search yields \(n\) distinct values, one per draw. As \(d\) grows, \(n^{1/d}\) collapses toward one while \(n\) stays fixed, so random search interrogates the dimension that controls the loss far more finely for the same total cost. The wasted evaluations of grid search are wasted precisely because they vary only the irrelevant coordinates.
177.3.2 3.2 A Probabilistic Guarantee
Random search also admits a clean budget guarantee that requires no assumptions about smoothness. Define the target region as the top \(\alpha\) fraction of the configuration space ranked by performance, so a uniformly drawn configuration lands in it with probability \(\alpha\). After \(n\) independent draws, the probability that at least one lands in the target region is
\[ P(\text{hit}) = 1 - (1 - \alpha)^{n}. \]
To reach confidence \(1 - \delta\) of hitting the top \(\alpha\) region we need
\[ n \ge \frac{\log \delta}{\log(1 - \alpha)}. \]
For the top \(5\%\) region with \(\delta = 0.05\) this gives \(n \ge 59\). Strikingly, this bound is independent of \(d\). Whether we tune three hyperparameters or thirty, about sixty random draws place us in the best \(5\%\) with \(95\%\) confidence. Grid search has no comparable guarantee because its coverage of any single axis degrades as \(d\) increases.
177.3.3 3.3 Empirical Confirmation
Bergstra and Bengio demonstrated on neural network and deep belief network benchmarks that random search with a modest budget matched or exceeded grid search that used many times more evaluations, and that it did so while exploring more of the space along the influential axes [1]. The practical lesson is blunt. For more than two or three hyperparameters, prefer random search.
177.4 4. Sampling Distributions
A random search is only as good as the distribution it samples from. Choosing distributions thoughtfully is where domain knowledge enters.
177.4.1 4.1 Match the Scale of the Parameter
Parameters that act multiplicatively should be sampled on a logarithmic scale. A learning rate of \(0.001\) and one of \(0.01\) differ by an order of magnitude, and the interesting range often spans several decades. Sampling uniformly in \([10^{-5}, 10^{-1}]\) would place ninety percent of draws above \(10^{-2}\) and starve the small values. Instead sample the exponent uniformly,
\[ \log_{10} \eta \sim \mathrm{Uniform}(-5, -1), \qquad \eta = 10^{u}. \]
This is the log uniform or reciprocal distribution, and it is the right default for learning rates, regularization coefficients, and any positive scale parameter.
import numpy as np
def sample_config(rng):
return {
"learning_rate": 10 ** rng.uniform(-5, -1), # log-uniform
"weight_decay": 10 ** rng.uniform(-6, -2), # log-uniform
"dropout": rng.uniform(0.0, 0.6), # linear
"hidden_units": int(2 ** rng.integers(6, 11)),# 64..1024
"optimizer": rng.choice(["adam", "sgd"]), # categorical
}177.4.2 4.2 A Taxonomy of Choices
Use a linear uniform distribution for parameters bounded on both sides with no multiplicative character, such as a dropout fraction in \([0, 0.6]\). Use a discrete uniform over integers for counts such as layers or trees. Use a categorical distribution for unordered choices such as the optimizer or activation function. When prior runs suggest a good neighborhood, a Gaussian centered there focuses the search while still exploring.
177.4.3 4.3 Quantization and Conditional Spaces
Some parameters take only specific values, for example a batch size that must be a power of two. Sample the exponent as an integer and exponentiate. Other parameters exist only when a parent is set a certain way. The number of units in a third layer is meaningful only if the network has at least three layers. Such conditional or tree structured spaces are sampled hierarchically by first drawing the parent, then drawing children only when they apply. Naively gridding a conditional space wastes enormous budget on configurations whose extra parameters are inert.
177.5 5. Budget Allocation
A fixed compute budget can be spent in many ways, and how it is spent matters as much as the sampling distribution.
177.5.1 5.1 Setting the Number of Trials
The guarantee of Section 3.2 turns a confidence target into a trial count. If you want the top \(\alpha\) fraction with confidence \(1 - \delta\), draw \(n = \lceil \log \delta / \log(1 - \alpha) \rceil\) configurations. This number does not grow with the dimension of the search space, so adding a hyperparameter you are unsure about costs nothing in trial count. It only widens the region random search can reach.
177.5.2 5.2 Parallelism and Reproducibility
Random search is embarrassingly parallel. Because draws are independent, \(n\) trials run across \(w\) workers in \(\lceil n / w \rceil\) wall clock rounds with no coordination. Seed the generator so the sequence of configurations is reproducible, and log every configuration with its score so the run can be audited and extended. A run that doubles its budget simply draws more samples from the same distribution and appends them.
177.5.3 5.3 Early Stopping and Successive Halving
Pure random search spends the same effort on every configuration, including obviously poor ones. Multi fidelity methods fix this by treating training length, dataset size, or epoch count as a cheap proxy for final performance. Successive halving allocates a small budget to many random configurations, keeps the top fraction, multiplies their budget, and repeats. Given \(n\) configurations and total budget \(B\), each surviving rung receives roughly
\[ b_i = \frac{B}{n_i \,\lceil \log_2 n \rceil} \]
resource per configuration as the population \(n_i\) shrinks geometrically. Hyperband wraps successive halving in an outer loop that hedges across different initial population sizes, trading breadth against the risk of cutting a slow starter too early [2]. These methods keep random sampling as their proposal mechanism and add only a principled rule for where to stop.
177.5.4 5.4 When to Reach for Something Smarter
Random search is uninformed. It does not learn from the trials it has already run. When evaluations are very expensive and the budget is small, a model based approach can do better by fitting a surrogate to past results and proposing the next configuration where expected improvement is high. Sequential model based optimization, including the Tree structured Parzen Estimator and Gaussian process Bayesian optimization, formalizes this idea [3]. Yet random search remains the right baseline and the right default. It is trivial to implement, parallelizes perfectly, has a dimension free guarantee, and is the protocol against which every adaptive method must justify its added complexity.
177.6 6. Practical Protocol
A reliable workflow combines the pieces above. Define each hyperparameter with an explicit distribution that respects its scale and type. Choose the number of trials from the confidence bound rather than from habit. Run the trials in parallel under a fixed seed, logging every configuration and score. Inspect the marginal scatter of score against each hyperparameter to learn which axes actually matter, then optionally narrow the ranges and run a second round. If evaluations are costly, layer successive halving on top so the budget concentrates on promising configurations. This protocol is cheap to build, easy to reason about, and hard to beat without considerably more machinery.
177.7 7. Summary
Grid search fails in high dimensions because its budget grows as \(k^d\) while its coverage of any single influential axis grows only as \(n^{1/d}\). Random search inverts this. It tests a distinct value of every parameter on every draw, it carries a dimension free probabilistic guarantee of landing in the best region, and it parallelizes without coordination. Its effectiveness rests on choosing sampling distributions that match the scale and type of each parameter, especially the log uniform distribution for multiplicative quantities. With a sensible trial count and optional multi fidelity early stopping, random search is the practical default for hyperparameter tuning and the baseline every more sophisticated method should be measured against.
177.8 References
- Bergstra, J., and Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281 to 305. https://www.jmlr.org/papers/v13/bergstra12a.html
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. Journal of Machine Learning Research, 18, 1 to 52. https://www.jmlr.org/papers/v18/16-558.html
- Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B. (2011). Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems 24. https://papers.nips.cc/paper/2011/hash/86e8f7ab32cfd12577bc2619bc635690-Abstract.html
- Feurer, M., and Hutter, F. (2019). Hyperparameter Optimization. In Automated Machine Learning, Springer, 3 to 33. https://link.springer.com/chapter/10.1007/978-3-030-05318-5_1