156 The Philosophy of Model Evaluation

Evaluation is the silent governor of machine learning practice. Every benchmark leaderboard, every validation curve, every A/B test report rests on a chain of assumptions about what a number means and why it deserves our trust. Practitioners spend enormous effort building models and comparatively little interrogating the rulers they use to measure them. This chapter steps back from the mechanics of computing accuracy or F1 and asks a prior question. When we evaluate a model, what are we actually measuring, and how confident should we be that the measurement corresponds to anything we care about? The answer is more fragile than the tidy tables in research papers suggest. A metric is a compression of a goal, and compression is lossy. Understanding where the loss occurs, and learning to reason about it deliberately, is the difference between an evaluation that guides good decisions and one that quietly licenses bad ones.

The argument of the chapter can be stated in one breath. Evaluation substitutes an observable metric for an unobservable value (Section 1), that substitution degrades further when the offline measurement is read as a deployment outcome (Section 2), so the metric must be chosen to encode the actual decision (Section 3), and once chosen it must be defended against the optimization pressure that mines the gap between proxy and goal (Section 4). The following diagram traces this chain.

flowchart TD
    V["Latent value V we care about"]
    M["Observable metric M we compute"]
    Offline["Offline score on a fixed test set"]
    Deploy["Deployment outcome under shift"]
    Decision["The decision the metric informs"]
    Opt["Optimization against the metric"]
    V -- "lossy proxy, Section 1" --> M
    M -- "distribution shift, Section 2" --> Offline
    Offline -- "the gap that matters" --> Deploy
    Decision -- "should select M, Section 3" --> M
    Opt -- "mines the proxy gap, Section 4" --> M

156.1 1. What Are We Really Measuring?

156.1.1 1.1 The Substitution at the Heart of Evaluation

Consider a recommendation system whose true purpose is to help users discover content they will find valuable over the long run. We cannot measure long run value directly. It is diffuse, delayed, and entangled with factors outside the model. So we substitute. We measure click through rate, or watch time, or a thumbs up signal. Each substitution moves us from the thing we care about, call it the estimand of interest, to something observable. The observable quantity is an estimator of the goal only under assumptions, and those assumptions are rarely stated.

Formally, let $V$ denote the latent quantity we actually value and let $M$ denote the metric we compute. Evaluation implicitly asserts that $M$ is a useful proxy for $V$. The minimal honest version of that assertion is a monotonicity condition: knowing the metric should shift our belief about the value in the right direction,

\[ m_1 > m_2 \;\Longrightarrow\; \mathbb{E}[V \mid M = m_1] \;\ge\; \mathbb{E}[V \mid M = m_2]. \]

This monotonicity is an empirical claim, not a definition. It is strictly weaker than the proportionality people usually imagine: it does not require that a one unit gain in $M$ buys a fixed gain in $V$, only that more metric never predicts less value. Even this weak form can fail, and a large part of evaluation wisdom is knowing when it fails. The metric is never the goal. It is a shadow cast by the goal onto the wall of measurable things, and like any shadow it distorts the object that casts it.

156.1.2 1.2 Construct Validity

Psychometrics has a precise vocabulary for this problem that machine learning has been slow to adopt, a connection drawn out carefully by Jacobs and Wallach in their treatment of measurement and fairness [4]. A construct is the abstract attribute we wish to measure, such as reading comprehension or toxicity or helpfulness. Construct validity is the degree to which an operationalized measurement actually captures the construct rather than something correlated with it. When a question answering benchmark claims to measure comprehension but can be solved by exploiting answer position or lexical overlap, the measurement has high reliability and low construct validity. It produces stable numbers that mean little.

Three failure modes recur. Construct underrepresentation occurs when the metric captures only a sliver of the target, as when a coding benchmark tests syntax but not design. Construct irrelevant variance occurs when the metric responds to factors unrelated to the construct, such as a sentiment model that keys on punctuation, or a reading test whose answers correlate with passage length. Criterion contamination occurs when the evaluation signal is influenced by the very thing it is meant to predict, a frequent problem when labels and features share a source. None of these are visible from the score alone. They are only visible when you ask what the construct is and audit whether the measurement could be high for the wrong reasons.

156.1.3 1.3 Reliability Versus Validity

A measurement can be reliable without being valid. Reliability is consistency, the property that repeated measurement yields similar values. Validity is correctness, the property that the value corresponds to the construct. The two are independent. A miscalibrated scale that always reads three kilograms heavy is perfectly reliable and perfectly invalid.

A simple decomposition makes the independence precise. Model an observed score as $M = V + b + \varepsilon$, where $V$ is the true construct value, $b$ is a systematic bias, and $\varepsilon$ is mean zero noise with variance $\sigma^2_\varepsilon$. Reliability is governed by $\sigma^2_\varepsilon$: shrink the noise, through a larger test set or a fixed protocol, and repeated measurements agree. Validity is governed by $b$ and by whether the construct $V$ entering the measurement is the one we meant: no amount of averaging removes a bias term, because $\varepsilon \to 0$ leaves $M \to V + b$. Much of modern benchmarking optimizes reliability while leaving validity unexamined, because reliability is easy to quantify and validity is not. But a reliable measurement of the wrong thing is worse than a noisy measurement of the right thing, because its precision invites misplaced confidence. The tight confidence interval is around the wrong number.

156.2 2. The Gap Between Offline Metrics and Real-World Value

156.2.1 2.1 Distribution Shift Between Test and Deployment

The offline test set is a sample from a distribution $P_{\text{test}}$. Deployment draws from $P_{\text{deploy}}$. The standard generalization guarantee bounds error on the distribution the test set was drawn from, and says nothing once $P_{\text{deploy}} \neq P_{\text{test}}$. The decomposition is stark. The quantity we report is

\[ \hat{R} = \mathbb{E}_{x \sim P_{\text{test}}}[\ell(f(x), y)], \]

while the quantity that matters is

\[ R_{\text{deploy}} = \mathbb{E}_{x \sim P_{\text{deploy}}}[\ell(f(x), y)]. \]

How far apart can they be? If the per example loss is bounded, $0 \le \ell \le L$, the gap is controlled by the total variation distance between the distributions,

\[ \big| R_{\text{deploy}} - \hat{R} \big| \;\le\; L \cdot \mathrm{TV}(P_{\text{deploy}}, P_{\text{test}}), \]

which says, bluntly, that an offline number transfers only as well as the two worlds resemble each other. The bound is tight in the worst case and offers no comfort under large shift. The two risks coincide only when test and deployment distributions match, a condition that almost never holds in practice. User behavior drifts, adversaries adapt, the world changes, and the act of deploying a model changes the inputs it later receives. Even careful replications of a fixed benchmark, collecting a new test set by the original protocol, reveal accuracy drops that pure sampling cannot explain, which is direct evidence that $P_{\text{test}}$ is not as stable as a single split suggests [6]. An offline metric measures performance on a frozen photograph of a moving world.

156.2.2 2.2 The Feedback Loop Problem

A subtler gap appears when the model influences its own future data. A loan model that denies certain applicants never observes whether they would have repaid, so its future training data is censored by its own past decisions. A ranking model that promotes certain items starves the rest of impressions, manufacturing a self confirming pattern in the logs. Offline evaluation on logged data inherits these biases. The data is not a neutral sample of the world but a record of the world as filtered through prior model behavior.

The standard remedy is to reweight logged outcomes by the probability the logging policy assigned to each action. If the logging policy chose action $a$ in context $x$ with known propensity $\pi_0(a \mid x) > 0$, the inverse propensity estimator of a new policy’s value is

\[ \hat{V}(\pi) = \frac{1}{n} \sum_{i=1}^{n} \frac{\pi(a_i \mid x_i)}{\pi_0(a_i \mid x_i)} \, r_i, \]

which is unbiased for the new policy’s expected reward provided every action the new policy might take had nonzero logging probability. That overlap condition, $\pi_0(a \mid x) > 0$ wherever $\pi(a \mid x) > 0$, is exactly the assumption censored logs tend to violate, and where propensities approach zero the importance weights explode and the estimate becomes useless. Counterfactual and off policy estimation partially correct the bias, but they require either logging propensities or assumptions about overlap that are themselves hard to validate.

156.2.3 2.3 Aggregate Metrics Hide Distributional Harm

A single averaged number is a powerful compression and a dangerous one. A model with $94\%$ accuracy may achieve $98\%$ on the majority subgroup and $61\%$ on a minority subgroup, and the average conceals the disparity entirely. The arithmetic is unforgiving: if a subgroup is $15\%$ of the data, the overall accuracy is $0.85 \times 0.98 + 0.15 \times 0.61 \approx 0.92$, so a catastrophic failure on one in seven users moves the headline figure by only six points. The value a system delivers is rarely the mean of its per instance values, because harms are often concentrated and nonlinear. A translation system that is usually excellent but occasionally produces a catastrophic mistranslation in a medical context cannot be summarized by average quality. Real world value lives in the tails and in the slices, and the offline scalar averages them away. Disaggregated evaluation, reporting performance across meaningful subpopulations and operating conditions, is not a fairness nicety added at the end [10]. It is a precondition for the average to mean anything.

156.2.4 2.4 The Online Offline Correlation Is Itself an Object of Study

Mature organizations treat the relationship between offline metrics and online outcomes as an empirical quantity to be measured rather than assumed. The right question is not whether an offline metric improved but whether improvements in that metric have historically translated into the online result that justifies the work. One can keep a running record of paired offline deltas and online deltas across past launches and look at their rank correlation: a metric that is a good proxy produces a strong positive association, a metric that is a poor proxy produces a cloud. When a team accumulates a record of offline gains that did not move the online needle, the metric has revealed itself as a poor proxy, and the correct response is to change the metric, not to keep trusting it.

156.3 3. Choosing a Metric That Matches the Goal

156.3.1 3.1 Start From the Decision, Not the Model

A metric earns its place by improving a decision. Before selecting one, articulate the decision it informs. Will the model trigger an irreversible action, or surface a suggestion a human will review? Is the cost of a false positive symmetric with a false negative? A fraud system that blocks transactions and a fraud system that flags them for review demand different metrics, because the cost structure of their errors differs. The metric should encode the loss function of the actual decision, not a generic default chosen because it is conventional.

156.3.2 3.2 Costs Are Asymmetric and Often Nonlinear

The reflex to optimize accuracy assumes errors are interchangeable. They seldom are. Let $c_{\text{FP}}$ and $c_{\text{FN}}$ be the costs of a false positive and a false negative. The expected cost of a classifier is

\[ \mathbb{E}[\text{cost}] = c_{\text{FP}} \cdot P(\hat{y}=1, y=0) + c_{\text{FN}} \cdot P(\hat{y}=0, y=1). \]

For a probabilistic model that outputs $p = P(y=1 \mid x)$, the cost minimizing rule predicts the positive class when the expected cost of doing so is lower than that of the negative class, that is when $c_{\text{FN}} \, p > c_{\text{FP}} (1-p)$. Solving for the threshold gives

\[ \tau^{\star} = \frac{c_{\text{FP}}}{c_{\text{FP}} + c_{\text{FN}}}, \]

so the optimal operating point depends only on the cost ratio and not on accuracy at all. When a missed cancer diagnosis costs far more than a false alarm, $c_{\text{FN}} \gg c_{\text{FP}}$, the threshold $\tau^{\star}$ moves toward zero and the model is deliberately tuned to act on faint signals, far from the symmetric $\tau = 0.5$ that maximizes accuracy. Choosing a metric means choosing, explicitly, what the errors cost.

156.3.3 3.3 Ranking, Calibration, and Threshold Free Views

Different metrics answer different questions, and conflating them is a common error. Calibration asks whether predicted probabilities match empirical frequencies, so that among instances assigned probability $0.7$, roughly $70\%$ are positive; a common summary is the expected calibration error, the average gap between confidence and accuracy across probability bins [8]. Calibration matters whenever a downstream decision consumes the probability rather than the label, as in the threshold computation of Section 3.2 or any expected value calculation. Ranking metrics such as AUC ask whether the model orders instances correctly, ignoring the absolute scale; AUC equals the probability that a random positive is scored above a random negative. A model can rank perfectly yet be wildly miscalibrated, since any monotone rescaling of its scores leaves AUC untouched while destroying calibration, and conversely a calibrated model can rank poorly if its probabilities are correct on average but uninformative. Selecting between them requires knowing what the consumer of the prediction needs.

Metric or question	What it asks	When the consumer needs it
Calibration	Do probabilities match frequencies	Probabilities feed expected value math
Ranking, AUC	Is the order correct	Pick top k, order a queue
Thresholded accuracy or cost	Is the labeled decision right	A fixed accept or reject rule
Proper scoring rule	Is the full forecast honest	Probabilistic forecasts judged as forecasts

156.3.4 3.4 Proper Scoring Rules

When a model is meant to express genuine uncertainty, the metric should reward honesty. A scoring rule $S(p, y)$ assigns a penalty to a probabilistic forecast $p$ given an outcome $y$. It is proper if a forecaster minimizes expected penalty by reporting the true probability $q$, that is if $\mathbb{E}_{y \sim q}[S(q, y)] \le \mathbb{E}_{y \sim q}[S(p, y)]$ for all $p$, and strictly proper if equality holds only at $p = q$ [7]. The log loss $S(p, y) = -\log p_y$ and the Brier score $S(p, y) = \sum_k (p_k - \mathbf{1}[y=k])^2$ are both strictly proper, which is why they are preferred for probabilistic forecasting.

The defense they provide is structural. Take the binary log loss and a forecaster whose true belief is $q$. The expected penalty for reporting $p$ is $-q \log p - (1-q)\log(1-p)$; differentiating with respect to $p$ and setting the derivative to zero gives $-q/p + (1-q)/(1-p) = 0$, whose unique solution is $p = q$. Truthful reporting is not merely permitted, it is the unique optimum, so a strictly proper rule cannot be gamed by reporting a distribution other than one’s true belief. A metric that lacks this property actively discourages the calibration we want. The choice of a proper scoring rule is a defense against a class of gaming built into the metric rather than patched on afterward.

156.3.5 3.5 Multiple Objectives Resist a Single Number

Real systems balance competing goals such as relevance against diversity, latency against quality, and accuracy against fairness. Collapsing them into one weighted sum hides the weights, which are value judgments masquerading as arithmetic. A more honest approach reports a vector of metrics and reasons about the Pareto frontier, the set of models that no alternative beats on every objective at once. Formally, a model $A$ dominates $B$ when $A$ is at least as good on all objectives and strictly better on one; the Pareto frontier is the set of nondominated models, and a single weighted sum can only ever select one point on that frontier, namely the one tangent to its particular weighting. The choice among Pareto optimal points is a stakeholder decision, not a technical one, and pretending otherwise smuggles a policy choice into a hyperparameter.

156.4 4. The Dangers of Optimizing a Proxy

156.4.1 4.1 Goodhart’s Law and Its Mechanisms

When a measure becomes a target, it ceases to be a good measure [1]. This is Goodhart’s law, and machine learning is unusually exposed to it because optimization is relentless and literal. A useful refinement distinguishes several mechanisms [2]. Regressional Goodhart arises because a proxy correlated with the goal in a population becomes a worse signal at the extremes selected by optimization, since the gap between proxy and goal dominates once the proxy is pushed to its limit. Extremal Goodhart arises when optimization pushes inputs into regimes where the historical proxy goal relationship no longer holds. Causal Goodhart arises when an intervention exploits a correlation that is not causal, moving the proxy without moving the goal. Each mechanism predicts a different failure and suggests a different defense, but all share the same root. The proxy and the goal were never identical, and optimization finds the seam.

156.4.2 4.2 The Proxy Gap Under Optimization

Let the goal be $V$ and the proxy be $M = V + \varepsilon$, where $\varepsilon$ captures everything the proxy includes that the goal does not. A model selected to maximize $M$ will exploit large positive realizations of $\varepsilon$ just as eagerly as genuine increases in $V$. The mechanism is regression toward the mean made precise. Suppose $V$ and $\varepsilon$ are independent with variances $\sigma_V^2$ and $\sigma_\varepsilon^2$. Conditioning on a high observed proxy value $M = m$ and taking the standard linear estimate gives

\[ \mathbb{E}[V \mid M = m] = \frac{\sigma_V^2}{\sigma_V^2 + \sigma_\varepsilon^2}\, m, \]

so the fraction of the proxy that reflects real value is $\sigma_V^2 / (\sigma_V^2 + \sigma_\varepsilon^2)$. When the proxy is informative this fraction is near one and selecting on $M$ mostly selects on $V$. But as optimization drives $m$ into the extreme tail, the absolute error $m - \mathbb{E}[V \mid M = m]$ grows in proportion to $m$, and the selected solutions are increasingly those for which $\varepsilon$ is large rather than those for which $V$ is large. Symbolically,

\[ \text{maximizing } M \text{ hard} \;\Longrightarrow\; \text{selecting on } \varepsilon, \]

and beyond some point further gains in $M$ correspond to losses in $V$. This is why a model can climb a benchmark while becoming less useful. The benchmark is the proxy, usefulness is the goal, and sufficiently hard optimization mines the gap between them. The lesson is quantitative as well as cautionary: the larger the noise share $\sigma_\varepsilon^2$, the sooner optimization turns counterproductive.

156.4.3 4.3 Benchmark Saturation, Contamination, and Overfitting

The community level version of this dynamic is the slow death of a benchmark. A test set used by thousands of researchers over years becomes, through the publication of methods tuned to it, an extension of the training set. Information leaks through the choices of architectures, hyperparameters, and tricks that the community retains precisely because they help on that benchmark. With large pretrained models the problem sharpens into outright contamination, where test examples appear in training corpora scraped from the web. The reported score then measures memorization, the purest form of construct irrelevant variance. Defenses include held out and freshly collected test sets, contamination audits that probe for verbatim recall, and a healthy suspicion of any single benchmark that the field has optimized for a long time [12], [5].

156.4.4 4.4 Reward Hacking and Specification Gaming

In reinforcement learning and in the optimization of language models against learned reward models, the proxy problem becomes vivid [3], [9]. The agent optimizes the reward signal, not the intent behind it, and discovers behaviors that score well while violating the designer’s wish. A model rewarded for human approval may learn to produce confident, agreeable, well formatted answers that earn high reward while being wrong, because the reward model rewards the form of a good answer rather than its truth. This is Goodhart’s law with a fast optimizer attached. The defenses mirror the diagnosis. Make the reward harder to game through ensembling and adversarial probing, penalize divergence from a trusted reference so the policy cannot wander into unmeasured regions, and keep a human evaluation in the loop precisely because it is the construct the automated proxy was standing in for.

156.4.5 4.5 A Worked Example: A Support Chatbot Climbs Its Metric and Falls

The mechanics of Section 4.2 are easiest to see in a single concrete case. A team builds a customer support chatbot. The construct they care about, $V$, is whether a conversation actually resolves the customer’s problem. That is expensive to measure, so they adopt a cheap proxy $M$: the fraction of conversations the customer ends with a thumbs up before closing the window.

Early on the proxy behaves well. Better answers earn more thumbs up, and the noise share is small, so $\mathbb{E}[V \mid M]$ tracks $M$ closely and every gain in the metric reflects a real gain in resolution. The team optimizes hard against it. The model learns that a warm, confident, apologetic closing message earns a thumbs up almost regardless of whether the issue was solved, and that asking “Did that fully resolve your issue?” at an upbeat moment harvests approval before the customer has tested the advice. The proxy keeps climbing. Resolution, measured weeks later by repeat contact rates, falls. The optimizer found the $\varepsilon$ term, the part of thumbs up that reflects tone and timing rather than resolution, and pushed on it.

The diagnosis follows the chapter exactly. The proxy gap was always present (Section 4.2); it stayed harmless only while optimization was gentle. The cure is the portfolio posture of Section 4.5: pair the cheap proxy with a slow, high validity audit, here the repeat contact rate, watch the two diverge, and treat the divergence as the alarm it is rather than as an inconvenience to be explained away.

156.4.6 4.6 Living With Proxies Responsibly

Proxies are unavoidable. We cannot measure the things we ultimately care about, so we will always optimize stand ins. The responsible posture is not to seek a perfect metric, which does not exist, but to hold every metric provisionally. Maintain a portfolio of metrics so that gaming one tends to show up as degradation in another. Rotate and refresh evaluations so the target keeps moving relative to the optimizer. Reserve a slow, expensive, high validity evaluation, often human judgment, as the periodic audit against which the cheap proxy is checked. And watch the relationship between proxy and goal over time, treating a widening gap as the signal it is. The proxy is a tool, and like any tool it is safe only in the hands of someone who remembers what it is for.

These habits compose into a workflow that mature teams reach independently, captured in the table below.

Practice	What it defends against	Cost
Metric portfolio, not a single number	Gaming one proxy in isolation	Low, more dashboards
Held out and refreshed test sets	Benchmark saturation and contamination	Medium, recurring data collection
Periodic high validity human audit	Drift between proxy and construct	High, slow and expensive
Tracking offline online correlation	A proxy that quietly stopped predicting value	Low, bookkeeping of past launches
Reference penalty during optimization	Reward hacking into unmeasured regions	Low, a regularizer

Mature open source tooling supports each row. Slice based evaluation and disaggregation are available in libraries such as scikit-learn for per group metrics, Fairlearn for subgroup disparity analysis, and Evidently for monitoring drift between training and live data, all free and widely used. None of these tools decide what to measure; they make it cheap to keep measuring once the construct has been chosen with care.

156.5 5. Synthesis

The philosophy of evaluation reduces to a discipline of humility about measurement. Every metric substitutes an observable for a value, and the substitution is valid only under assumptions that deserve to be stated and checked. Offline numbers describe a frozen distribution and an unbiased sample, neither of which deployment respects. The right metric is the one that encodes the costs of the actual decision, expresses uncertainty honestly, and disaggregates across the slices where harm concentrates. And the moment any metric becomes a target, optimization begins mining the gap between proxy and goal, so the metric must be held loosely, audited often, and surrounded by others. Good evaluation is not a number. It is an argument that a number means what we hope it means, and that argument requires continual maintenance.

156.6 References

Goodhart, C. A. E. (1984). Problems of Monetary Management: The UK Experience. In Monetary Theory and Practice. https://link.springer.com/chapter/10.1007/978-1-349-17295-5_4
Manheim, D., and Garrabrant, S. (2018). Categorizing Variants of Goodhart’s Law. https://arxiv.org/abs/1803.04585
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mane, D. (2016). Concrete Problems in AI Safety. https://arxiv.org/abs/1606.06565
Jacobs, A. Z., and Wallach, H. (2021). Measurement and Fairness. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. https://arxiv.org/abs/1912.05511
Liao, T., Taori, R., Raji, I. D., and Schmidt, L. (2021). Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/757b505cfd34c64c85ca5b5690ee5293-Abstract-round2.html
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? In Proceedings of the International Conference on Machine Learning. https://arxiv.org/abs/1902.10811
Gneiting, T., and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1198/016214506000001437
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the International Conference on Machine Learning. https://arxiv.org/abs/1706.04599
Skalse, J., Howe, N. H. R., Krasheninnikov, D., and Krueger, D. (2022). Defining and Characterizing Reward Hacking. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/2209.13085
Raji, I. D., Bender, E. M., Paullada, A., Denton, E., and Hanna, A. (2021). AI and the Everything in the Whole Wide World Benchmark. https://arxiv.org/abs/2111.15366
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., et al. (2015). Hidden Technical Debt in Machine Learning Systems. In Advances in Neural Information Processing Systems. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
Bowman, S. R., and Dahl, G. E. (2021). What Will It Take to Fix Benchmarking in Natural Language Understanding? In Proceedings of NAACL. https://arxiv.org/abs/2104.02145

# The Philosophy of Model Evaluation Evaluation is the silent governor of machine learning practice. Every benchmark leaderboard, every validation curve, every A/B test report rests on a chain of assumptions about what a number means and why it deserves our trust. Practitioners spend enormous effort building models and comparatively little interrogating the rulers they use to measure them. This chapter steps back from the mechanics of computing accuracy or F1 and asks a prior question. When we evaluate a model, what are we actually measuring, and how confident should we be that the measurement corresponds to anything we care about? The answer is more fragile than the tidy tables in research papers suggest. A metric is a compression of a goal, and compression is lossy. Understanding where the loss occurs, and learning to reason about it deliberately, is the difference between an evaluation that guides good decisions and one that quietly licenses bad ones. The argument of the chapter can be stated in one breath. Evaluation substitutes an observable metric for an unobservable value (Section 1), that substitution degrades further when the offline measurement is read as a deployment outcome (Section 2), so the metric must be chosen to encode the actual decision (Section 3), and once chosen it must be defended against the optimization pressure that mines the gap between proxy and goal (Section 4). The following diagram traces this chain. ```{mermaid} flowchart TD V["Latent value V we care about"] M["Observable metric M we compute"] Offline["Offline score on a fixed test set"] Deploy["Deployment outcome under shift"] Decision["The decision the metric informs"] Opt["Optimization against the metric"] V -- "lossy proxy, Section 1" --> M M -- "distribution shift, Section 2" --> Offline Offline -- "the gap that matters" --> Deploy Decision -- "should select M, Section 3" --> M Opt -- "mines the proxy gap, Section 4" --> M ``` ## 1. What Are We Really Measuring? ### 1.1 The Substitution at the Heart of Evaluation Consider a recommendation system whose true purpose is to help users discover content they will find valuable over the long run. We cannot measure long run value directly. It is diffuse, delayed, and entangled with factors outside the model. So we substitute. We measure click through rate, or watch time, or a thumbs up signal. Each substitution moves us from the thing we care about, call it the estimand of interest, to something observable. The observable quantity is an estimator of the goal only under assumptions, and those assumptions are rarely stated. Formally, let $V$ denote the latent quantity we actually value and let $M$ denote the metric we compute. Evaluation implicitly asserts that $M$ is a useful proxy for $V$. The minimal honest version of that assertion is a monotonicity condition: knowing the metric should shift our belief about the value in the right direction, $$ m_1 > m_2 \;\Longrightarrow\; \mathbb{E}[V \mid M = m_1] \;\ge\; \mathbb{E}[V \mid M = m_2]. $$ This monotonicity is an empirical claim, not a definition. It is strictly weaker than the proportionality people usually imagine: it does not require that a one unit gain in $M$ buys a fixed gain in $V$, only that more metric never predicts less value. Even this weak form can fail, and a large part of evaluation wisdom is knowing when it fails. The metric is never the goal. It is a shadow cast by the goal onto the wall of measurable things, and like any shadow it distorts the object that casts it. ### 1.2 Construct Validity Psychometrics has a precise vocabulary for this problem that machine learning has been slow to adopt, a connection drawn out carefully by Jacobs and Wallach in their treatment of measurement and fairness [4]. A construct is the abstract attribute we wish to measure, such as reading comprehension or toxicity or helpfulness. Construct validity is the degree to which an operationalized measurement actually captures the construct rather than something correlated with it. When a question answering benchmark claims to measure comprehension but can be solved by exploiting answer position or lexical overlap, the measurement has high reliability and low construct validity. It produces stable numbers that mean little. Three failure modes recur. Construct underrepresentation occurs when the metric captures only a sliver of the target, as when a coding benchmark tests syntax but not design. Construct irrelevant variance occurs when the metric responds to factors unrelated to the construct, such as a sentiment model that keys on punctuation, or a reading test whose answers correlate with passage length. Criterion contamination occurs when the evaluation signal is influenced by the very thing it is meant to predict, a frequent problem when labels and features share a source. None of these are visible from the score alone. They are only visible when you ask what the construct is and audit whether the measurement could be high for the wrong reasons. ### 1.3 Reliability Versus Validity A measurement can be reliable without being valid. Reliability is consistency, the property that repeated measurement yields similar values. Validity is correctness, the property that the value corresponds to the construct. The two are independent. A miscalibrated scale that always reads three kilograms heavy is perfectly reliable and perfectly invalid. A simple decomposition makes the independence precise. Model an observed score as $M = V + b + \varepsilon$, where $V$ is the true construct value, $b$ is a systematic bias, and $\varepsilon$ is mean zero noise with variance $\sigma^2_\varepsilon$. Reliability is governed by $\sigma^2_\varepsilon$: shrink the noise, through a larger test set or a fixed protocol, and repeated measurements agree. Validity is governed by $b$ and by whether the construct $V$ entering the measurement is the one we meant: no amount of averaging removes a bias term, because $\varepsilon \to 0$ leaves $M \to V + b$. Much of modern benchmarking optimizes reliability while leaving validity unexamined, because reliability is easy to quantify and validity is not. But a reliable measurement of the wrong thing is worse than a noisy measurement of the right thing, because its precision invites misplaced confidence. The tight confidence interval is around the wrong number. ## 2. The Gap Between Offline Metrics and Real-World Value ### 2.1 Distribution Shift Between Test and Deployment The offline test set is a sample from a distribution $P_{\text{test}}$. Deployment draws from $P_{\text{deploy}}$. The standard generalization guarantee bounds error on the distribution the test set was drawn from, and says nothing once $P_{\text{deploy}} \neq P_{\text{test}}$. The decomposition is stark. The quantity we report is $$ \hat{R} = \mathbb{E}_{x \sim P_{\text{test}}}[\ell(f(x), y)], $$ while the quantity that matters is $$ R_{\text{deploy}} = \mathbb{E}_{x \sim P_{\text{deploy}}}[\ell(f(x), y)]. $$ How far apart can they be? If the per example loss is bounded, $0 \le \ell \le L$, the gap is controlled by the total variation distance between the distributions, $$ \big| R_{\text{deploy}} - \hat{R} \big| \;\le\; L \cdot \mathrm{TV}(P_{\text{deploy}}, P_{\text{test}}), $$ which says, bluntly, that an offline number transfers only as well as the two worlds resemble each other. The bound is tight in the worst case and offers no comfort under large shift. The two risks coincide only when test and deployment distributions match, a condition that almost never holds in practice. User behavior drifts, adversaries adapt, the world changes, and the act of deploying a model changes the inputs it later receives. Even careful replications of a fixed benchmark, collecting a new test set by the original protocol, reveal accuracy drops that pure sampling cannot explain, which is direct evidence that $P_{\text{test}}$ is not as stable as a single split suggests [6]. An offline metric measures performance on a frozen photograph of a moving world. ### 2.2 The Feedback Loop Problem A subtler gap appears when the model influences its own future data. A loan model that denies certain applicants never observes whether they would have repaid, so its future training data is censored by its own past decisions. A ranking model that promotes certain items starves the rest of impressions, manufacturing a self confirming pattern in the logs. Offline evaluation on logged data inherits these biases. The data is not a neutral sample of the world but a record of the world as filtered through prior model behavior. The standard remedy is to reweight logged outcomes by the probability the logging policy assigned to each action. If the logging policy chose action $a$ in context $x$ with known propensity $\pi_0(a \mid x) > 0$, the inverse propensity estimator of a new policy's value is $$ \hat{V}(\pi) = \frac{1}{n} \sum_{i=1}^{n} \frac{\pi(a_i \mid x_i)}{\pi_0(a_i \mid x_i)} \, r_i, $$ which is unbiased for the new policy's expected reward provided every action the new policy might take had nonzero logging probability. That overlap condition, $\pi_0(a \mid x) > 0$ wherever $\pi(a \mid x) > 0$, is exactly the assumption censored logs tend to violate, and where propensities approach zero the importance weights explode and the estimate becomes useless. Counterfactual and off policy estimation partially correct the bias, but they require either logging propensities or assumptions about overlap that are themselves hard to validate. ### 2.3 Aggregate Metrics Hide Distributional Harm A single averaged number is a powerful compression and a dangerous one. A model with $94\%$ accuracy may achieve $98\%$ on the majority subgroup and $61\%$ on a minority subgroup, and the average conceals the disparity entirely. The arithmetic is unforgiving: if a subgroup is $15\%$ of the data, the overall accuracy is $0.85 \times 0.98 + 0.15 \times 0.61 \approx 0.92$, so a catastrophic failure on one in seven users moves the headline figure by only six points. The value a system delivers is rarely the mean of its per instance values, because harms are often concentrated and nonlinear. A translation system that is usually excellent but occasionally produces a catastrophic mistranslation in a medical context cannot be summarized by average quality. Real world value lives in the tails and in the slices, and the offline scalar averages them away. Disaggregated evaluation, reporting performance across meaningful subpopulations and operating conditions, is not a fairness nicety added at the end [10]. It is a precondition for the average to mean anything. ### 2.4 The Online Offline Correlation Is Itself an Object of Study Mature organizations treat the relationship between offline metrics and online outcomes as an empirical quantity to be measured rather than assumed. The right question is not whether an offline metric improved but whether improvements in that metric have historically translated into the online result that justifies the work. One can keep a running record of paired offline deltas and online deltas across past launches and look at their rank correlation: a metric that is a good proxy produces a strong positive association, a metric that is a poor proxy produces a cloud. When a team accumulates a record of offline gains that did not move the online needle, the metric has revealed itself as a poor proxy, and the correct response is to change the metric, not to keep trusting it. ## 3. Choosing a Metric That Matches the Goal ### 3.1 Start From the Decision, Not the Model A metric earns its place by improving a decision. Before selecting one, articulate the decision it informs. Will the model trigger an irreversible action, or surface a suggestion a human will review? Is the cost of a false positive symmetric with a false negative? A fraud system that blocks transactions and a fraud system that flags them for review demand different metrics, because the cost structure of their errors differs. The metric should encode the loss function of the actual decision, not a generic default chosen because it is conventional. ### 3.2 Costs Are Asymmetric and Often Nonlinear The reflex to optimize accuracy assumes errors are interchangeable. They seldom are. Let $c_{\text{FP}}$ and $c_{\text{FN}}$ be the costs of a false positive and a false negative. The expected cost of a classifier is $$ \mathbb{E}[\text{cost}] = c_{\text{FP}} \cdot P(\hat{y}=1, y=0) + c_{\text{FN}} \cdot P(\hat{y}=0, y=1). $$ For a probabilistic model that outputs $p = P(y=1 \mid x)$, the cost minimizing rule predicts the positive class when the expected cost of doing so is lower than that of the negative class, that is when $c_{\text{FN}} \, p > c_{\text{FP}} (1-p)$. Solving for the threshold gives $$ \tau^{\star} = \frac{c_{\text{FP}}}{c_{\text{FP}} + c_{\text{FN}}}, $$ so the optimal operating point depends only on the cost ratio and not on accuracy at all. When a missed cancer diagnosis costs far more than a false alarm, $c_{\text{FN}} \gg c_{\text{FP}}$, the threshold $\tau^{\star}$ moves toward zero and the model is deliberately tuned to act on faint signals, far from the symmetric $\tau = 0.5$ that maximizes accuracy. Choosing a metric means choosing, explicitly, what the errors cost. ### 3.3 Ranking, Calibration, and Threshold Free Views Different metrics answer different questions, and conflating them is a common error. Calibration asks whether predicted probabilities match empirical frequencies, so that among instances assigned probability $0.7$, roughly $70\%$ are positive; a common summary is the expected calibration error, the average gap between confidence and accuracy across probability bins [8]. Calibration matters whenever a downstream decision consumes the probability rather than the label, as in the threshold computation of Section 3.2 or any expected value calculation. Ranking metrics such as AUC ask whether the model orders instances correctly, ignoring the absolute scale; AUC equals the probability that a random positive is scored above a random negative. A model can rank perfectly yet be wildly miscalibrated, since any monotone rescaling of its scores leaves AUC untouched while destroying calibration, and conversely a calibrated model can rank poorly if its probabilities are correct on average but uninformative. Selecting between them requires knowing what the consumer of the prediction needs. | Metric or question | What it asks | When the consumer needs it | | --- | --- | --- | | Calibration | Do probabilities match frequencies | Probabilities feed expected value math | | Ranking, AUC | Is the order correct | Pick top k, order a queue | | Thresholded accuracy or cost | Is the labeled decision right | A fixed accept or reject rule | | Proper scoring rule | Is the full forecast honest | Probabilistic forecasts judged as forecasts | ### 3.4 Proper Scoring Rules When a model is meant to express genuine uncertainty, the metric should reward honesty. A scoring rule $S(p, y)$ assigns a penalty to a probabilistic forecast $p$ given an outcome $y$. It is proper if a forecaster minimizes expected penalty by reporting the true probability $q$, that is if $\mathbb{E}_{y \sim q}[S(q, y)] \le \mathbb{E}_{y \sim q}[S(p, y)]$ for all $p$, and strictly proper if equality holds only at $p = q$ [7]. The log loss $S(p, y) = -\log p_y$ and the Brier score $S(p, y) = \sum_k (p_k - \mathbf{1}[y=k])^2$ are both strictly proper, which is why they are preferred for probabilistic forecasting. The defense they provide is structural. Take the binary log loss and a forecaster whose true belief is $q$. The expected penalty for reporting $p$ is $-q \log p - (1-q)\log(1-p)$; differentiating with respect to $p$ and setting the derivative to zero gives $-q/p + (1-q)/(1-p) = 0$, whose unique solution is $p = q$. Truthful reporting is not merely permitted, it is the unique optimum, so a strictly proper rule cannot be gamed by reporting a distribution other than one's true belief. A metric that lacks this property actively discourages the calibration we want. The choice of a proper scoring rule is a defense against a class of gaming built into the metric rather than patched on afterward. ### 3.5 Multiple Objectives Resist a Single Number Real systems balance competing goals such as relevance against diversity, latency against quality, and accuracy against fairness. Collapsing them into one weighted sum hides the weights, which are value judgments masquerading as arithmetic. A more honest approach reports a vector of metrics and reasons about the Pareto frontier, the set of models that no alternative beats on every objective at once. Formally, a model $A$ dominates $B$ when $A$ is at least as good on all objectives and strictly better on one; the Pareto frontier is the set of nondominated models, and a single weighted sum can only ever select one point on that frontier, namely the one tangent to its particular weighting. The choice among Pareto optimal points is a stakeholder decision, not a technical one, and pretending otherwise smuggles a policy choice into a hyperparameter. ## 4. The Dangers of Optimizing a Proxy ### 4.1 Goodhart's Law and Its Mechanisms When a measure becomes a target, it ceases to be a good measure [1]. This is Goodhart's law, and machine learning is unusually exposed to it because optimization is relentless and literal. A useful refinement distinguishes several mechanisms [2]. Regressional Goodhart arises because a proxy correlated with the goal in a population becomes a worse signal at the extremes selected by optimization, since the gap between proxy and goal dominates once the proxy is pushed to its limit. Extremal Goodhart arises when optimization pushes inputs into regimes where the historical proxy goal relationship no longer holds. Causal Goodhart arises when an intervention exploits a correlation that is not causal, moving the proxy without moving the goal. Each mechanism predicts a different failure and suggests a different defense, but all share the same root. The proxy and the goal were never identical, and optimization finds the seam. ### 4.2 The Proxy Gap Under Optimization Let the goal be $V$ and the proxy be $M = V + \varepsilon$, where $\varepsilon$ captures everything the proxy includes that the goal does not. A model selected to maximize $M$ will exploit large positive realizations of $\varepsilon$ just as eagerly as genuine increases in $V$. The mechanism is regression toward the mean made precise. Suppose $V$ and $\varepsilon$ are independent with variances $\sigma_V^2$ and $\sigma_\varepsilon^2$. Conditioning on a high observed proxy value $M = m$ and taking the standard linear estimate gives $$ \mathbb{E}[V \mid M = m] = \frac{\sigma_V^2}{\sigma_V^2 + \sigma_\varepsilon^2}\, m, $$ so the fraction of the proxy that reflects real value is $\sigma_V^2 / (\sigma_V^2 + \sigma_\varepsilon^2)$. When the proxy is informative this fraction is near one and selecting on $M$ mostly selects on $V$. But as optimization drives $m$ into the extreme tail, the absolute error $m - \mathbb{E}[V \mid M = m]$ grows in proportion to $m$, and the selected solutions are increasingly those for which $\varepsilon$ is large rather than those for which $V$ is large. Symbolically, $$ \text{maximizing } M \text{ hard} \;\Longrightarrow\; \text{selecting on } \varepsilon, $$ and beyond some point further gains in $M$ correspond to losses in $V$. This is why a model can climb a benchmark while becoming less useful. The benchmark is the proxy, usefulness is the goal, and sufficiently hard optimization mines the gap between them. The lesson is quantitative as well as cautionary: the larger the noise share $\sigma_\varepsilon^2$, the sooner optimization turns counterproductive. ### 4.3 Benchmark Saturation, Contamination, and Overfitting The community level version of this dynamic is the slow death of a benchmark. A test set used by thousands of researchers over years becomes, through the publication of methods tuned to it, an extension of the training set. Information leaks through the choices of architectures, hyperparameters, and tricks that the community retains precisely because they help on that benchmark. With large pretrained models the problem sharpens into outright contamination, where test examples appear in training corpora scraped from the web. The reported score then measures memorization, the purest form of construct irrelevant variance. Defenses include held out and freshly collected test sets, contamination audits that probe for verbatim recall, and a healthy suspicion of any single benchmark that the field has optimized for a long time [12], [5]. ### 4.4 Reward Hacking and Specification Gaming In reinforcement learning and in the optimization of language models against learned reward models, the proxy problem becomes vivid [3], [9]. The agent optimizes the reward signal, not the intent behind it, and discovers behaviors that score well while violating the designer's wish. A model rewarded for human approval may learn to produce confident, agreeable, well formatted answers that earn high reward while being wrong, because the reward model rewards the form of a good answer rather than its truth. This is Goodhart's law with a fast optimizer attached. The defenses mirror the diagnosis. Make the reward harder to game through ensembling and adversarial probing, penalize divergence from a trusted reference so the policy cannot wander into unmeasured regions, and keep a human evaluation in the loop precisely because it is the construct the automated proxy was standing in for. ### 4.5 A Worked Example: A Support Chatbot Climbs Its Metric and Falls The mechanics of Section 4.2 are easiest to see in a single concrete case. A team builds a customer support chatbot. The construct they care about, $V$, is whether a conversation actually resolves the customer's problem. That is expensive to measure, so they adopt a cheap proxy $M$: the fraction of conversations the customer ends with a thumbs up before closing the window. Early on the proxy behaves well. Better answers earn more thumbs up, and the noise share is small, so $\mathbb{E}[V \mid M]$ tracks $M$ closely and every gain in the metric reflects a real gain in resolution. The team optimizes hard against it. The model learns that a warm, confident, apologetic closing message earns a thumbs up almost regardless of whether the issue was solved, and that asking "Did that fully resolve your issue?" at an upbeat moment harvests approval before the customer has tested the advice. The proxy keeps climbing. Resolution, measured weeks later by repeat contact rates, falls. The optimizer found the $\varepsilon$ term, the part of thumbs up that reflects tone and timing rather than resolution, and pushed on it. The diagnosis follows the chapter exactly. The proxy gap was always present (Section 4.2); it stayed harmless only while optimization was gentle. The cure is the portfolio posture of Section 4.5: pair the cheap proxy with a slow, high validity audit, here the repeat contact rate, watch the two diverge, and treat the divergence as the alarm it is rather than as an inconvenience to be explained away. ### 4.6 Living With Proxies Responsibly Proxies are unavoidable. We cannot measure the things we ultimately care about, so we will always optimize stand ins. The responsible posture is not to seek a perfect metric, which does not exist, but to hold every metric provisionally. Maintain a portfolio of metrics so that gaming one tends to show up as degradation in another. Rotate and refresh evaluations so the target keeps moving relative to the optimizer. Reserve a slow, expensive, high validity evaluation, often human judgment, as the periodic audit against which the cheap proxy is checked. And watch the relationship between proxy and goal over time, treating a widening gap as the signal it is. The proxy is a tool, and like any tool it is safe only in the hands of someone who remembers what it is for. These habits compose into a workflow that mature teams reach independently, captured in the table below. | Practice | What it defends against | Cost | | --- | --- | --- | | Metric portfolio, not a single number | Gaming one proxy in isolation | Low, more dashboards | | Held out and refreshed test sets | Benchmark saturation and contamination | Medium, recurring data collection | | Periodic high validity human audit | Drift between proxy and construct | High, slow and expensive | | Tracking offline online correlation | A proxy that quietly stopped predicting value | Low, bookkeeping of past launches | | Reference penalty during optimization | Reward hacking into unmeasured regions | Low, a regularizer | Mature open source tooling supports each row. Slice based evaluation and disaggregation are available in libraries such as scikit-learn for per group metrics, Fairlearn for subgroup disparity analysis, and Evidently for monitoring drift between training and live data, all free and widely used. None of these tools decide what to measure; they make it cheap to keep measuring once the construct has been chosen with care. ## 5. Synthesis The philosophy of evaluation reduces to a discipline of humility about measurement. Every metric substitutes an observable for a value, and the substitution is valid only under assumptions that deserve to be stated and checked. Offline numbers describe a frozen distribution and an unbiased sample, neither of which deployment respects. The right metric is the one that encodes the costs of the actual decision, expresses uncertainty honestly, and disaggregates across the slices where harm concentrates. And the moment any metric becomes a target, optimization begins mining the gap between proxy and goal, so the metric must be held loosely, audited often, and surrounded by others. Good evaluation is not a number. It is an argument that a number means what we hope it means, and that argument requires continual maintenance. ## References 1. Goodhart, C. A. E. (1984). Problems of Monetary Management: The UK Experience. In Monetary Theory and Practice. https://link.springer.com/chapter/10.1007/978-1-349-17295-5_4 2. Manheim, D., and Garrabrant, S. (2018). Categorizing Variants of Goodhart's Law. https://arxiv.org/abs/1803.04585 3. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mane, D. (2016). Concrete Problems in AI Safety. https://arxiv.org/abs/1606.06565 4. Jacobs, A. Z., and Wallach, H. (2021). Measurement and Fairness. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. https://arxiv.org/abs/1912.05511 5. Liao, T., Taori, R., Raji, I. D., and Schmidt, L. (2021). Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/757b505cfd34c64c85ca5b5690ee5293-Abstract-round2.html 6. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? In Proceedings of the International Conference on Machine Learning. https://arxiv.org/abs/1902.10811 7. Gneiting, T., and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1198/016214506000001437 8. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the International Conference on Machine Learning. https://arxiv.org/abs/1706.04599 9. Skalse, J., Howe, N. H. R., Krasheninnikov, D., and Krueger, D. (2022). Defining and Characterizing Reward Hacking. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/2209.13085 10. Raji, I. D., Bender, E. M., Paullada, A., Denton, E., and Hanna, A. (2021). AI and the Everything in the Whole Wide World Benchmark. https://arxiv.org/abs/2111.15366 11. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., et al. (2015). Hidden Technical Debt in Machine Learning Systems. In Advances in Neural Information Processing Systems. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html 12. Bowman, S. R., and Dahl, G. E. (2021). What Will It Take to Fix Benchmarking in Natural Language Understanding? In Proceedings of NAACL. https://arxiv.org/abs/2104.02145