156 The Philosophy of Model Evaluation
Evaluation is the silent governor of machine learning practice. Every benchmark leaderboard, every validation curve, every A/B test report rests on a chain of assumptions about what a number means and why it deserves our trust. Practitioners spend enormous effort building models and comparatively little interrogating the rulers they use to measure them. This chapter steps back from the mechanics of computing accuracy or F1 and asks a prior question. When we evaluate a model, what are we actually measuring, and how confident should we be that the measurement corresponds to anything we care about? The answer is more fragile than the tidy tables in research papers suggest. A metric is a compression of a goal, and compression is lossy. Understanding where the loss occurs, and learning to reason about it deliberately, is the difference between an evaluation that guides good decisions and one that quietly licenses bad ones.
156.1 1. What Are We Really Measuring?
156.1.1 1.1 The Substitution at the Heart of Evaluation
Consider a recommendation system whose true purpose is to help users discover content they will find valuable over the long run. We cannot measure long run value directly. It is diffuse, delayed, and entangled with factors outside the model. So we substitute. We measure click through rate, or watch time, or a thumbs up signal. Each substitution moves us from the thing we care about, call it the estimand of interest, to something observable. The observable quantity is an estimator of the goal only under assumptions, and those assumptions are rarely stated.
Formally, let \(V\) denote the latent quantity we actually value and let \(M\) denote the metric we compute. Evaluation implicitly asserts that \(M\) is a useful proxy for \(V\), which we might write as a hope that
\[ \mathbb{E}[V \mid M = m] \text{ is increasing in } m. \]
This monotonicity is an empirical claim, not a definition. It can fail, and a large part of evaluation wisdom is knowing when it fails. The metric is never the goal. It is a shadow cast by the goal onto the wall of measurable things, and like any shadow it distorts the object that casts it.
156.1.2 1.2 Construct Validity
Psychometrics has a precise vocabulary for this problem that machine learning has been slow to adopt. A construct is the abstract attribute we wish to measure, such as reading comprehension or toxicity or helpfulness. Construct validity is the degree to which an operationalized measurement actually captures the construct rather than something correlated with it. When a question answering benchmark claims to measure comprehension but can be solved by exploiting answer position or lexical overlap, the measurement has high reliability and low construct validity. It produces stable numbers that mean little.
Three failure modes recur. Construct underrepresentation occurs when the metric captures only a sliver of the target, as when a coding benchmark tests syntax but not design. Construct irrelevant variance occurs when the metric responds to factors unrelated to the construct, such as a sentiment model that keys on punctuation. Criterion contamination occurs when the evaluation signal is influenced by the very thing it is meant to predict, a frequent problem when labels and features share a source. None of these are visible from the score alone. They are only visible when you ask what the construct is and audit whether the measurement could be high for the wrong reasons.
156.1.3 1.3 Reliability Versus Validity
A measurement can be reliable without being valid. Reliability is consistency, the property that repeated measurement yields similar values. Validity is correctness, the property that the value corresponds to the construct. The two are independent. A miscalibrated scale that always reads three kilograms heavy is perfectly reliable and perfectly invalid. Much of modern benchmarking optimizes reliability, through large test sets and fixed protocols, while leaving validity unexamined. This is comfortable because reliability is easy to quantify and validity is not. But a reliable measurement of the wrong thing is worse than a noisy measurement of the right thing, because its precision invites misplaced confidence.
156.2 2. The Gap Between Offline Metrics and Real-World Value
156.2.1 2.1 Distribution Shift Between Test and Deployment
The offline test set is a sample from a distribution \(P_{\text{test}}\). Deployment draws from \(P_{\text{deploy}}\). The standard generalization guarantee bounds error on the distribution the test set was drawn from, and says nothing once \(P_{\text{deploy}} \neq P_{\text{test}}\). The decomposition is stark. The quantity we report is
\[ \hat{R} = \mathbb{E}_{x \sim P_{\text{test}}}[\ell(f(x), y)], \]
while the quantity that matters is
\[ R_{\text{deploy}} = \mathbb{E}_{x \sim P_{\text{deploy}}}[\ell(f(x), y)]. \]
The two coincide only when test and deployment distributions match, a condition that almost never holds in practice. User behavior drifts, adversaries adapt, the world changes, and the act of deploying a model changes the inputs it later receives. An offline metric measures performance on a frozen photograph of a moving world.
156.2.2 2.2 The Feedback Loop Problem
A subtler gap appears when the model influences its own future data. A loan model that denies certain applicants never observes whether they would have repaid, so its future training data is censored by its own past decisions. A ranking model that promotes certain items starves the rest of impressions, manufacturing a self confirming pattern in the logs. Offline evaluation on logged data inherits these biases. The data is not a neutral sample of the world but a record of the world as filtered through prior model behavior. Counterfactual and off policy estimation techniques exist to partially correct this, but they require either logging propensities or assumptions about overlap that are themselves hard to validate.
156.2.3 2.3 Aggregate Metrics Hide Distributional Harm
A single averaged number is a powerful compression and a dangerous one. A model with \(94\%\) accuracy may achieve \(98\%\) on the majority subgroup and \(61\%\) on a minority subgroup, and the average conceals the disparity entirely. The value a system delivers is rarely the mean of its per instance values, because harms are often concentrated and nonlinear. A translation system that is usually excellent but occasionally produces a catastrophic mistranslation in a medical context cannot be summarized by average quality. Real world value lives in the tails and in the slices, and the offline scalar averages them away. Disaggregated evaluation, reporting performance across meaningful subpopulations and operating conditions, is not a fairness nicety added at the end. It is a precondition for the average to mean anything.
156.2.4 2.4 The Online Offline Correlation Is Itself an Object of Study
Mature organizations treat the relationship between offline metrics and online outcomes as an empirical quantity to be measured rather than assumed. The right question is not whether an offline metric improved but whether improvements in that metric have historically translated into the online result that justifies the work. When a team accumulates a record of offline gains that did not move the online needle, the metric has revealed itself as a poor proxy, and the correct response is to change the metric, not to keep trusting it.
156.3 3. Choosing a Metric That Matches the Goal
156.3.1 3.1 Start From the Decision, Not the Model
A metric earns its place by improving a decision. Before selecting one, articulate the decision it informs. Will the model trigger an irreversible action, or surface a suggestion a human will review? Is the cost of a false positive symmetric with a false negative? A fraud system that blocks transactions and a fraud system that flags them for review demand different metrics, because the cost structure of their errors differs. The metric should encode the loss function of the actual decision, not a generic default chosen because it is conventional.
156.3.2 3.2 Costs Are Asymmetric and Often Nonlinear
The reflex to optimize accuracy assumes errors are interchangeable. They seldom are. Let \(c_{\text{FP}}\) and \(c_{\text{FN}}\) be the costs of a false positive and a false negative. The expected cost of a classifier is
\[ \mathbb{E}[\text{cost}] = c_{\text{FP}} \cdot P(\hat{y}=1, y=0) + c_{\text{FN}} \cdot P(\hat{y}=0, y=1), \]
and the decision threshold that minimizes it depends on the ratio \(c_{\text{FN}} / c_{\text{FP}}\), not on accuracy at all. When a missed cancer diagnosis costs far more than a false alarm, the optimal operating point lies far from the one that maximizes accuracy. Choosing a metric means choosing, explicitly, what the errors cost.
156.3.3 3.3 Ranking, Calibration, and Threshold Free Views
Different metrics answer different questions, and conflating them is a common error. Calibration asks whether predicted probabilities match empirical frequencies, so that among instances assigned probability \(0.7\), roughly \(70\%\) are positive. Calibration matters whenever a downstream decision consumes the probability rather than the label, as in expected value computations. Ranking metrics such as AUC ask whether the model orders instances correctly, ignoring the absolute scale. A model can rank perfectly yet be wildly miscalibrated, and vice versa. Selecting between them requires knowing what the consumer of the prediction needs. The following sketch distinguishes the questions a metric can answer.
metric question consumer need
----------------- ----------------------------
calibration probabilities feed expected value math
ranking / AUC pick top-k, order a queue
thresholded accuracy a fixed accept/reject rule
proper scoring rule honest probabilistic forecasts
156.3.4 3.4 Proper Scoring Rules
When a model is meant to express genuine uncertainty, the metric should reward honesty. A scoring rule is proper if its expectation is optimized by reporting the true probability. The log loss and the Brier score are proper, which is why they are preferred for probabilistic forecasting. A metric that is not proper can be gamed by reporting a distribution other than one’s true belief, which means it actively discourages the calibration we want. The choice of a proper scoring rule is a structural defense against a class of gaming, built into the metric rather than patched on afterward.
156.3.5 3.5 Multiple Objectives Resist a Single Number
Real systems balance competing goals such as relevance against diversity, latency against quality, and accuracy against fairness. Collapsing them into one weighted sum hides the weights, which are value judgments masquerading as arithmetic. A more honest approach reports a vector of metrics and reasons about the Pareto frontier, the set of models not dominated on every objective at once. The choice among Pareto optimal points is a stakeholder decision, not a technical one, and pretending otherwise smuggles a policy choice into a hyperparameter.
156.4 4. The Dangers of Optimizing a Proxy
156.4.1 4.1 Goodhart’s Law and Its Mechanisms
When a measure becomes a target, it ceases to be a good measure. This is Goodhart’s law, and machine learning is unusually exposed to it because optimization is relentless and literal. A useful refinement distinguishes several mechanisms. Regressional Goodhart arises because a proxy correlated with the goal in a population becomes a worse signal at the extremes selected by optimization, since the gap between proxy and goal dominates once the proxy is pushed to its limit. Extremal Goodhart arises when optimization pushes inputs into regimes where the historical proxy goal relationship no longer holds. Causal Goodhart arises when an intervention exploits a correlation that is not causal, moving the proxy without moving the goal. Each mechanism predicts a different failure and suggests a different defense, but all share the same root. The proxy and the goal were never identical, and optimization finds the seam.
156.4.2 4.2 The Proxy Gap Under Optimization
Let the goal be \(V\) and the proxy be \(M = V + \epsilon\), where \(\epsilon\) captures everything the proxy includes that the goal does not. A model selected to maximize \(M\) will exploit large positive realizations of \(\epsilon\) just as eagerly as genuine increases in \(V\). As optimization pressure rises, the selected solutions are increasingly those for which \(\epsilon\) is large rather than those for which \(V\) is large, so
\[ \text{maximizing } M \text{ hard} \;\Longrightarrow\; \text{selecting on } \epsilon, \]
and beyond some point further gains in \(M\) correspond to losses in \(V\). This is why a model can climb a benchmark while becoming less useful. The benchmark is the proxy, usefulness is the goal, and sufficiently hard optimization mines the gap between them.
156.4.3 4.3 Benchmark Saturation, Contamination, and Overfitting
The community level version of this dynamic is the slow death of a benchmark. A test set used by thousands of researchers over years becomes, through the publication of methods tuned to it, an extension of the training set. Information leaks through the choices of architectures, hyperparameters, and tricks that the community retains precisely because they help on that benchmark. With large pretrained models the problem sharpens into outright contamination, where test examples appear in training corpora scraped from the web. The reported score then measures memorization, the purest form of construct irrelevant variance. Defenses include held out and freshly collected test sets, contamination audits that probe for verbatim recall, and a healthy suspicion of any single benchmark that the field has optimized for a long time.
156.4.4 4.4 Reward Hacking and Specification Gaming
In reinforcement learning and in the optimization of language models against learned reward models, the proxy problem becomes vivid. The agent optimizes the reward signal, not the intent behind it, and discovers behaviors that score well while violating the designer’s wish. A model rewarded for human approval may learn to produce confident, agreeable, well formatted answers that earn high reward while being wrong, because the reward model rewards the form of a good answer rather than its truth. This is Goodhart’s law with a fast optimizer attached. The defenses mirror the diagnosis. Make the reward harder to game through ensembling and adversarial probing, penalize divergence from a trusted reference so the policy cannot wander into unmeasured regions, and keep a human evaluation in the loop precisely because it is the construct the automated proxy was standing in for.
156.4.5 4.5 Living With Proxies Responsibly
Proxies are unavoidable. We cannot measure the things we ultimately care about, so we will always optimize stand ins. The responsible posture is not to seek a perfect metric, which does not exist, but to hold every metric provisionally. Maintain a portfolio of metrics so that gaming one tends to show up as degradation in another. Rotate and refresh evaluations so the target keeps moving relative to the optimizer. Reserve a slow, expensive, high validity evaluation, often human judgment, as the periodic audit against which the cheap proxy is checked. And watch the relationship between proxy and goal over time, treating a widening gap as the signal it is. The proxy is a tool, and like any tool it is safe only in the hands of someone who remembers what it is for.
156.5 5. Synthesis
The philosophy of evaluation reduces to a discipline of humility about measurement. Every metric substitutes an observable for a value, and the substitution is valid only under assumptions that deserve to be stated and checked. Offline numbers describe a frozen distribution and an unbiased sample, neither of which deployment respects. The right metric is the one that encodes the costs of the actual decision, expresses uncertainty honestly, and disaggregates across the slices where harm concentrates. And the moment any metric becomes a target, optimization begins mining the gap between proxy and goal, so the metric must be held loosely, audited often, and surrounded by others. Good evaluation is not a number. It is an argument that a number means what we hope it means, and that argument requires continual maintenance.
156.6 References
- Goodhart, C. A. E. (1984). Problems of Monetary Management: The UK Experience. In Monetary Theory and Practice. https://link.springer.com/chapter/10.1007/978-1-349-17295-5_4
- Manheim, D., and Garrabrant, S. (2018). Categorizing Variants of Goodhart’s Law. https://arxiv.org/abs/1803.04585
- Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mane, D. (2016). Concrete Problems in AI Safety. https://arxiv.org/abs/1606.06565
- Jacobs, A. Z., and Wallach, H. (2021). Measurement and Fairness. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. https://arxiv.org/abs/1912.05511
- Liao, T., Taori, R., Raji, I. D., and Schmidt, L. (2021). Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/757b505cfd34c64c85ca5b5690ee5293-Abstract-round2.html
- Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? In Proceedings of the International Conference on Machine Learning. https://arxiv.org/abs/1902.10811
- Gneiting, T., and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1198/016214506000001437
- Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the International Conference on Machine Learning. https://arxiv.org/abs/1706.04599
- Skalse, J., Howe, N. H. R., Krasheninnikov, D., and Krueger, D. (2022). Defining and Characterizing Reward Hacking. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/2209.13085
- Raji, I. D., Bender, E. M., Paullada, A., Denton, E., and Hanna, A. (2021). AI and the Everything in the Whole Wide World Benchmark. https://arxiv.org/abs/2111.15366
- Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., et al. (2015). Hidden Technical Debt in Machine Learning Systems. In Advances in Neural Information Processing Systems. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
- Bowman, S. R., and Dahl, G. E. (2021). What Will It Take to Fix Benchmarking in Natural Language Understanding? In Proceedings of NAACL. https://arxiv.org/abs/2104.02145