168 Information Retrieval Metrics
Evaluating a search or retrieval system means asking a deceptively simple question: when a user issues a query, did the system return the right documents, and did it place them where the user would actually look? This chapter develops the standard offline metrics used to answer that question, beginning with set based notions of precision and recall, extending them to ranked lists through precision at \(k\) and recall at \(k\), building the precision recall view, and then formalizing the two dominant summary metrics, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). It closes by contrasting offline evaluation against online evaluation in deployed systems.
168.1 1. Relevance and the Evaluation Setup
Fix a query \(q\) drawn from a query set \(Q\). A retrieval system returns an ordered list of documents \(\langle d_1, d_2, \ldots, d_n \rangle\) from a corpus. To evaluate it we need ground truth relevance judgments. In the simplest binary setting each document is either relevant (\(\mathrm{rel} = 1\)) or non relevant (\(\mathrm{rel} = 0\)) with respect to \(q\). In the graded setting relevance takes values in an ordinal scale, for example \(\{0, 1, 2, 3, 4\}\) corresponding to labels from “not relevant” to “perfect”.
Let \(R_q\) denote the set of documents judged relevant to \(q\), with \(|R_q|\) its cardinality. Two practical complications shape everything that follows. First, judgments are expensive, so in large corpora only a pooled subset of documents is ever judged, and unjudged documents are typically treated as non relevant. Second, real users do not read entire result lists; they read the top of the list. Both facts push us toward rank aware, top heavy metrics.
168.2 2. Set Based Precision and Recall
Before ranking, consider a system that returns an unordered set \(A\) of retrieved documents. Precision measures the purity of that set and recall measures its completeness:
\[ \mathrm{Precision} = \frac{|A \cap R_q|}{|A|}, \qquad \mathrm{Recall} = \frac{|A \cap R_q|}{|R_q|}. \]
Precision answers “of what I returned, how much was relevant?” Recall answers “of what was relevant, how much did I return?” These quantities trade off against each other. Returning the entire corpus guarantees recall of \(1\) but drives precision toward \(|R_q| / |\text{corpus}|\), which is tiny. Returning a single highly confident document can yield precision \(1\) but negligible recall.
A common scalar summary is the \(F_\beta\) score, the weighted harmonic mean
\[ F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\beta^2 \cdot \mathrm{Precision} + \mathrm{Recall}}, \]
with \(F_1\) (\(\beta = 1\)) the balanced case. The harmonic mean is chosen deliberately: it is dominated by the smaller of the two values, so a system cannot earn a high \(F_1\) by excelling at one metric while neglecting the other.
168.3 3. Precision at k and Recall at k
Set based metrics ignore order, which is the wrong abstraction for search where position matters enormously. The first step toward rank awareness is to evaluate only a prefix of the ranked list. Let the top \(k\) retrieved documents be \(A_k = \langle d_1, \ldots, d_k \rangle\), and let \(\mathrm{rel}(d_i) \in \{0, 1\}\) be the binary relevance of the document at rank \(i\). Then
\[ P@k = \frac{1}{k} \sum_{i=1}^{k} \mathrm{rel}(d_i), \qquad R@k = \frac{1}{|R_q|} \sum_{i=1}^{k} \mathrm{rel}(d_i). \]
\(P@k\) is the fraction of the top \(k\) results that are relevant. \(R@k\) is the fraction of all relevant documents that appear within the top \(k\). The choice of \(k\) encodes a hypothesis about user behavior. For a web search interface showing ten blue links, \(P@10\) is natural. For a “feeling lucky” or voice assistant returning one answer, \(P@1\) is what matters. For a first stage retriever feeding a downstream reranker, \(R@100\) or \(R@1000\) is the quantity of interest, since the reranker can only reorder what the retriever surfaces.
ranks: 1 2 3 4 5
rel: 1 0 1 1 0
P@5 = (1+0+1+1+0)/5 = 0.60
R@5 = 3 / |R_q| (e.g. 3/4 = 0.75 if four documents are relevant)
Note an asymmetry. \(P@k\) uses \(k\) in its denominator and so is well defined and bounded in \([0,1]\) regardless of how many relevant documents exist. \(R@k\) can be capped below \(1\) when \(|R_q| > k\), since no prefix of length \(k\) can recover more than \(k\) relevant documents. A subtle variant, \(R\text{-precision}\), sets \(k = |R_q|\) for each query, at which point precision and recall coincide, giving a single self normalizing number per query.
168.4 4. The Precision Recall View for Retrieval
Sweeping \(k\) from \(1\) to \(n\) traces out a sequence of \((\mathrm{Recall}, \mathrm{Precision})\) pairs. Plotting precision against recall yields the precision recall curve, the central diagnostic for ranked retrieval. Each time the document at the next rank is relevant, recall increases and precision tends to rise locally; each time it is non relevant, recall stays fixed and precision falls. The curve therefore has a characteristic sawtooth shape.
To compare curves it is convention to interpolate. The interpolated precision at recall level \(r\) is the highest precision observed at any recall greater than or equal to \(r\):
\[ P_{\mathrm{interp}}(r) = \max_{r' \ge r} P(r'). \]
This produces a monotonically non increasing curve and removes the sawtooth, reflecting the optimistic stance that a user willing to accept recall \(r\) could in principle stop at the more favorable operating point further down the list. The eleven point interpolated average precision, computed at recall levels \(\{0.0, 0.1, \ldots, 1.0\}\), was the classic TREC summary before MAP became dominant.
The precision recall curve should not be confused with the receiver operating characteristic (ROC) curve, which plots true positive rate against false positive rate. In retrieval the non relevant class typically dwarfs the relevant class, and ROC’s false positive rate is insensitive to this imbalance because its denominator is the huge count of non relevant documents. Precision, by contrast, has the retrieved set in its denominator and so remains sensitive to how many junk documents the user must wade through. This is why precision recall analysis is preferred for search and other heavily imbalanced retrieval problems.
168.5 5. Average Precision and Mean Average Precision
The precision recall curve is informative but unwieldy as a single number. Average Precision (AP) collapses it into one scalar per query by averaging the precision values measured at exactly the ranks where relevant documents appear:
\[ \mathrm{AP}(q) = \frac{1}{|R_q|} \sum_{i=1}^{n} P@i \cdot \mathrm{rel}(d_i). \]
Each relevant document contributes the precision of the prefix that ends at its rank. Because \(P@i\) is large when relevant documents cluster near the top, AP rewards systems that rank relevant material early. Geometrically, AP approximates the area under the (uninterpolated) precision recall curve.
Worked example. Suppose \(|R_q| = 3\) and relevant documents appear at ranks \(1\), \(3\), and \(6\):
\[ P@1 = \tfrac{1}{1} = 1.00, \quad P@3 = \tfrac{2}{3} \approx 0.667, \quad P@6 = \tfrac{3}{6} = 0.500, \] \[ \mathrm{AP} = \frac{1.00 + 0.667 + 0.500}{3} \approx 0.722. \]
Documents that are relevant but never retrieved contribute a precision of zero to the sum, so AP implicitly penalizes missing recall: the divisor remains \(|R_q|\) even when fewer than \(|R_q|\) relevant documents are found.
Mean Average Precision is simply AP averaged over the query set:
\[ \mathrm{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{AP}(q). \]
MAP has been the workhorse of academic IR evaluation for decades. Its strengths are that it is a single number, it incorporates the full ranking rather than a fixed cutoff, and it has a clean interpretation as mean area under the precision recall curve. Its principal limitation is that it assumes binary relevance: a marginally useful document and a perfect answer count identically. When relevance is graded, a different metric is needed.
168.6 6. Discounted Cumulative Gain and NDCG
NDCG addresses graded relevance directly and is the dominant metric in modern learning to rank and neural retrieval. It rests on two intuitions. First, more relevant documents are worth more, so we assign each document a gain that grows with its relevance grade. Second, a relevant document is worth less the further down the list it sits, because users are less likely to examine it, so we apply a position discount.
Let \(\mathrm{rel}_i\) be the graded relevance of the document at rank \(i\). The Discounted Cumulative Gain at cutoff \(k\) is
\[ \mathrm{DCG}@k = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i + 1)}. \]
The numerator \(2^{\mathrm{rel}_i} - 1\) is the exponential gain, which sharply amplifies the value of highly relevant documents relative to merely acceptable ones; an older linear variant uses \(\mathrm{rel}_i\) directly. The denominator \(\log_2(i + 1)\) is the discount, equal to \(1\) at rank \(1\) and growing slowly thereafter, so that gains accrued late in the list are damped.
DCG is not comparable across queries because queries differ in how many relevant documents they have and how relevant those documents are. A query with five perfect documents can accumulate far more DCG than a query with one marginal document, regardless of system quality. Normalization fixes this. Let the Ideal DCG, \(\mathrm{IDCG}@k\), be the DCG of the best possible ranking, obtained by sorting all judged documents in decreasing order of relevance. Then
\[ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}. \]
NDCG lies in \([0, 1]\), equals \(1\) for an ideal ranking, and is averaged over queries to summarize a system. Because both numerator and denominator are computed at the same cutoff \(k\), NDCG@\(k\) measures how close the system’s top \(k\) comes to the best achievable top \(k\).
Worked example. Suppose the top three retrieved documents have grades \(3, 2, 3\) and the ideal ordering of judged documents is \(3, 3, 2\):
DCG@3 = (2^3-1)/log2(2) + (2^2-1)/log2(3) + (2^3-1)/log2(4)
= 7/1.000 + 3/1.585 + 7/2.000
= 7.000 + 1.893 + 3.500 = 12.393
IDCG@3 = 7/1.000 + 7/1.585 + 3/2.000
= 7.000 + 4.416 + 1.500 = 12.916
NDCG@3 = 12.393 / 12.916 = 0.960
NDCG’s flexibility with graded judgments and its tunable cutoff have made it the default reporting metric for benchmarks such as MS MARCO and the BEIR suite, and the standard objective proxy when training learning to rank models.
168.7 7. Reciprocal Rank and Other Companions
Two further metrics appear constantly. The Reciprocal Rank of a query is \(1 / \mathrm{rank}_1\), the inverse of the position of the first relevant document, and Mean Reciprocal Rank (MRR) averages it over queries:
\[ \mathrm{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\mathrm{rank}_1(q)}. \]
MRR is the right metric for known item search and question answering, where the user wants one correct answer and cares only how quickly it appears. It ignores everything after the first relevant hit, which is appropriate for single answer tasks and inappropriate when coverage of multiple relevant documents matters.
A useful mental map: use \(P@k\) and \(R@k\) to communicate operating points to stakeholders; use MAP when relevance is binary and full ranking quality matters; use NDCG when relevance is graded; use MRR when a single right answer is the goal.
168.8 8. Offline versus Online Evaluation
Everything so far is offline evaluation: metrics computed against a fixed corpus, a fixed query set, and fixed relevance judgments. Offline evaluation is reproducible, cheap to rerun, and ideal for rapid iteration and model selection. Its weaknesses are equally clear. Judgments are incomplete, so unjudged but genuinely relevant documents are scored as non relevant, biasing comparisons against systems that surface novel documents. Judgments may also be stale, drifting away from current user intent as the world and the corpus change. Most fundamentally, offline metrics encode an assumed model of relevance and user behavior that may not match what real users actually do.
Online evaluation measures behavior in the live system. The gold standard is the controlled experiment, or A/B test, in which incoming traffic is randomly split between a control system and a treatment system and outcomes are compared. The metrics are behavioral rather than judged: click through rate, time to first click, dwell time, query reformulation and abandonment rates, and ultimately task success or downstream business outcomes such as conversions. Because assignment is randomized, differences in these metrics support causal claims about which system is better for real users, something no offline metric can do.
Online evaluation has its own hazards. Behavioral signals are noisy proxies for relevance and are confounded by presentation: position bias means users click top results regardless of quality, so naive click counts reward whatever ranking is already shown. Counterfactual estimators and click models attempt to correct for this. A/B tests also require substantial traffic to reach statistical significance, they expose real users to potentially worse experiences during the test, and they cannot evaluate systems that have not yet been built. Interleaving, which merges two rankings into one list and attributes clicks to the system that contributed each clicked document, is a more sensitive online design that detects differences with far less traffic than a classic A/B split.
In practice the two regimes are complementary and used in sequence. Teams iterate quickly offline using MAP or NDCG to filter a large space of candidate models down to a few promising ones, validating that offline gains correlate with the online metrics that actually matter. They then run online A/B tests or interleaving experiments on the survivors to confirm that offline improvements translate into better real world behavior before shipping. The persistent risk is metric divergence, where offline gains fail to move online metrics; tracking and minimizing this gap is a core discipline of any serious search team.
168.9 9. Summary
Retrieval evaluation moves from set based precision and recall, through the rank aware cutoff metrics \(P@k\) and \(R@k\), to the precision recall curve, and finally to the scalar summaries MAP for binary relevance and NDCG for graded relevance, with MRR serving single answer tasks. These offline metrics are indispensable for fast iteration but rest on assumed relevance models and incomplete judgments. Online evaluation through A/B testing and interleaving measures what users actually do and supports causal conclusions, at the cost of noise, position bias, and traffic requirements. Mature systems use both: offline metrics to search the model space efficiently, online experiments to validate that those gains are real.
168.10 References
- Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/
- Järvelin, K. and Kekäläinen, J. “Cumulated gain based evaluation of IR techniques.” ACM Transactions on Information Systems, 2002. https://dl.acm.org/doi/10.1145/582415.582418
- Voorhees, E. M. and Harman, D. K. (eds.). TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. https://trec.nist.gov/
- Sanderson, M. “Test Collection Based Evaluation of Information Retrieval Systems.” Foundations and Trends in Information Retrieval, 2010. https://www.nowpublishers.com/article/Details/INR-009
- Kohavi, R., Tang, D., and Xu, Y. Trustworthy Online Controlled Experiments. Cambridge University Press, 2020. https://experimentguide.com/
- Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. “Large scale validation and analysis of interleaved search evaluation.” ACM Transactions on Information Systems, 2012. https://dl.acm.org/doi/10.1145/2094072.2094078
- Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. “BEIR: A Heterogeneous Benchmark for Zero shot Evaluation of Information Retrieval Models.” NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2104.08663
- Nguyen, T., Rosenberg, M., Song, X., et al. “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” 2016. https://arxiv.org/abs/1611.09268