168 Information Retrieval Metrics

Evaluating a search or retrieval system means asking a deceptively simple question: when a user issues a query, did the system return the right documents, and did it place them where the user would actually look? This chapter develops the standard offline metrics used to answer that question, beginning with set based notions of precision and recall, extending them to ranked lists through precision at $k$ and recall at $k$, building the precision recall view, and then formalizing the two dominant summary metrics, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). It closes by contrasting offline evaluation against online evaluation in deployed systems.

168.1 1. Relevance and the Evaluation Setup

Fix a query $q$ drawn from a query set $Q$. A retrieval system returns an ordered list of documents $\langle d_1, d_2, \ldots, d_n \rangle$ from a corpus. To evaluate it we need ground truth relevance judgments. In the simplest binary setting each document is either relevant ($\mathrm{rel} = 1$) or non relevant ($\mathrm{rel} = 0$) with respect to $q$. In the graded setting relevance takes values in an ordinal scale, for example $\{0, 1, 2, 3, 4\}$ corresponding to labels from “not relevant” to “perfect”.

Let $R_q$ denote the set of documents judged relevant to $q$, with $|R_q|$ its cardinality. Two practical complications shape everything that follows. First, judgments are expensive, so in large corpora only a pooled subset of documents is ever judged, and unjudged documents are typically treated as non relevant. Second, real users do not read entire result lists; they read the top of the list. Both facts push us toward rank aware, top heavy metrics.

168.2 2. Set Based Precision and Recall

Before ranking, consider a system that returns an unordered set $A$ of retrieved documents. Precision measures the purity of that set and recall measures its completeness:

\[ \mathrm{Precision} = \frac{|A \cap R_q|}{|A|}, \qquad \mathrm{Recall} = \frac{|A \cap R_q|}{|R_q|}. \]

Precision answers “of what I returned, how much was relevant?” Recall answers “of what was relevant, how much did I return?” These quantities trade off against each other. Returning the entire corpus guarantees recall of $1$ but drives precision toward $|R_q| / |\text{corpus}|$, which is tiny. Returning a single highly confident document can yield precision $1$ but negligible recall.

A common scalar summary is the $F_\beta$ score, the weighted harmonic mean

\[ F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\beta^2 \cdot \mathrm{Precision} + \mathrm{Recall}}, \]

with $F_1$ ($\beta = 1$) the balanced case. The harmonic mean is chosen deliberately: it is dominated by the smaller of the two values, so a system cannot earn a high $F_1$ by excelling at one metric while neglecting the other.

168.3 3. Precision at k and Recall at k

Set based metrics ignore order, which is the wrong abstraction for search where position matters enormously. The first step toward rank awareness is to evaluate only a prefix of the ranked list. Let the top $k$ retrieved documents be $A_k = \langle d_1, \ldots, d_k \rangle$, and let $\mathrm{rel}(d_i) \in \{0, 1\}$ be the binary relevance of the document at rank $i$. Then

\[ P@k = \frac{1}{k} \sum_{i=1}^{k} \mathrm{rel}(d_i), \qquad R@k = \frac{1}{|R_q|} \sum_{i=1}^{k} \mathrm{rel}(d_i). \]

$P@k$ is the fraction of the top $k$ results that are relevant. $R@k$ is the fraction of all relevant documents that appear within the top $k$. The choice of $k$ encodes a hypothesis about user behavior. For a web search interface showing ten blue links, $P@10$ is natural. For a “feeling lucky” or voice assistant returning one answer, $P@1$ is what matters. For a first stage retriever feeding a downstream reranker, $R@100$ or $R@1000$ is the quantity of interest, since the reranker can only reorder what the retriever surfaces.

ranks:    1   2   3   4   5
rel:      1   0   1   1   0
P@5 = (1+0+1+1+0)/5 = 0.60
R@5 = 3 / |R_q|     (e.g. 3/4 = 0.75 if four documents are relevant)

Note an asymmetry. $P@k$ uses $k$ in its denominator and so is well defined and bounded in $[0,1]$ regardless of how many relevant documents exist. $R@k$ can be capped below $1$ when $|R_q| > k$, since no prefix of length $k$ can recover more than $k$ relevant documents. A subtle variant, $R\text{-precision}$, sets $k = |R_q|$ for each query, at which point precision and recall coincide, giving a single self normalizing number per query.

168.4 4. The Precision Recall View for Retrieval

Sweeping $k$ from $1$ to $n$ traces out a sequence of $(\mathrm{Recall}, \mathrm{Precision})$ pairs. Plotting precision against recall yields the precision recall curve, the central diagnostic for ranked retrieval. Each time the document at the next rank is relevant, recall increases and precision tends to rise locally; each time it is non relevant, recall stays fixed and precision falls. The curve therefore has a characteristic sawtooth shape.

To compare curves it is convention to interpolate. The interpolated precision at recall level $r$ is the highest precision observed at any recall greater than or equal to $r$:

\[ P_{\mathrm{interp}}(r) = \max_{r' \ge r} P(r'). \]

This produces a monotonically non increasing curve and removes the sawtooth, reflecting the optimistic stance that a user willing to accept recall $r$ could in principle stop at the more favorable operating point further down the list. The eleven point interpolated average precision, computed at recall levels $\{0.0, 0.1, \ldots, 1.0\}$, was the classic TREC summary before MAP became dominant.

The precision recall curve should not be confused with the receiver operating characteristic (ROC) curve, which plots true positive rate against false positive rate. In retrieval the non relevant class typically dwarfs the relevant class, and ROC’s false positive rate is insensitive to this imbalance because its denominator is the huge count of non relevant documents. Precision, by contrast, has the retrieved set in its denominator and so remains sensitive to how many junk documents the user must wade through. This is why precision recall analysis is preferred for search and other heavily imbalanced retrieval problems.

168.5 5. Average Precision and Mean Average Precision

The precision recall curve is informative but unwieldy as a single number. Average Precision (AP) collapses it into one scalar per query by averaging the precision values measured at exactly the ranks where relevant documents appear:

\[ \mathrm{AP}(q) = \frac{1}{|R_q|} \sum_{i=1}^{n} P@i \cdot \mathrm{rel}(d_i). \]

Each relevant document contributes the precision of the prefix that ends at its rank. Because $P@i$ is large when relevant documents cluster near the top, AP rewards systems that rank relevant material early. Geometrically, AP approximates the area under the (uninterpolated) precision recall curve.

Worked example. Suppose $|R_q| = 3$ and relevant documents appear at ranks $1$, $3$, and $6$:

\[ P@1 = \tfrac{1}{1} = 1.00, \quad P@3 = \tfrac{2}{3} \approx 0.667, \quad P@6 = \tfrac{3}{6} = 0.500, \] \[ \mathrm{AP} = \frac{1.00 + 0.667 + 0.500}{3} \approx 0.722. \]

Documents that are relevant but never retrieved contribute a precision of zero to the sum, so AP implicitly penalizes missing recall: the divisor remains $|R_q|$ even when fewer than $|R_q|$ relevant documents are found.

Mean Average Precision is simply AP averaged over the query set:

\[ \mathrm{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{AP}(q). \]

MAP has been the workhorse of academic IR evaluation for decades. Its strengths are that it is a single number, it incorporates the full ranking rather than a fixed cutoff, and it has a clean interpretation as mean area under the precision recall curve. Its principal limitation is that it assumes binary relevance: a marginally useful document and a perfect answer count identically. When relevance is graded, a different metric is needed.

168.6 6. Discounted Cumulative Gain and NDCG

NDCG addresses graded relevance directly and is the dominant metric in modern learning to rank and neural retrieval. It rests on two intuitions. First, more relevant documents are worth more, so we assign each document a gain that grows with its relevance grade. Second, a relevant document is worth less the further down the list it sits, because users are less likely to examine it, so we apply a position discount.

Let $\mathrm{rel}_i$ be the graded relevance of the document at rank $i$. The Discounted Cumulative Gain at cutoff $k$ is

\[ \mathrm{DCG}@k = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i + 1)}. \]

The numerator $2^{\mathrm{rel}_i} - 1$ is the exponential gain, which sharply amplifies the value of highly relevant documents relative to merely acceptable ones; an older linear variant uses $\mathrm{rel}_i$ directly. The denominator $\log_2(i + 1)$ is the discount, equal to $1$ at rank $1$ and growing slowly thereafter, so that gains accrued late in the list are damped.

DCG is not comparable across queries because queries differ in how many relevant documents they have and how relevant those documents are. A query with five perfect documents can accumulate far more DCG than a query with one marginal document, regardless of system quality. Normalization fixes this. Let the Ideal DCG, $\mathrm{IDCG}@k$, be the DCG of the best possible ranking, obtained by sorting all judged documents in decreasing order of relevance. Then

\[ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}. \]

NDCG lies in $[0, 1]$, equals $1$ for an ideal ranking, and is averaged over queries to summarize a system. Because both numerator and denominator are computed at the same cutoff $k$, NDCG@$k$ measures how close the system’s top $k$ comes to the best achievable top $k$.

Worked example. Suppose the top three retrieved documents have grades $3, 2, 3$ and the ideal ordering of judged documents is $3, 3, 2$:

DCG@3  = (2^3-1)/log2(2) + (2^2-1)/log2(3) + (2^3-1)/log2(4)
       = 7/1.000 + 3/1.585 + 7/2.000
       = 7.000 + 1.893 + 3.500 = 12.393
IDCG@3 = 7/1.000 + 7/1.585 + 3/2.000
       = 7.000 + 4.416 + 1.500 = 12.916
NDCG@3 = 12.393 / 12.916 = 0.960

NDCG’s flexibility with graded judgments and its tunable cutoff have made it the default reporting metric for benchmarks such as MS MARCO and the BEIR suite, and the standard objective proxy when training learning to rank models.

168.7 7. Reciprocal Rank and Other Companions

Two further metrics appear constantly. The Reciprocal Rank of a query is $1 / \mathrm{rank}_1$, the inverse of the position of the first relevant document, and Mean Reciprocal Rank (MRR) averages it over queries:

\[ \mathrm{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\mathrm{rank}_1(q)}. \]

MRR is the right metric for known item search and question answering, where the user wants one correct answer and cares only how quickly it appears. It ignores everything after the first relevant hit, which is appropriate for single answer tasks and inappropriate when coverage of multiple relevant documents matters.

A useful mental map: use $P@k$ and $R@k$ to communicate operating points to stakeholders; use MAP when relevance is binary and full ranking quality matters; use NDCG when relevance is graded; use MRR when a single right answer is the goal.

168.8 8. Reference Implementation

The metrics above are short to implement exactly, and doing so removes any ambiguity about edge cases (an empty relevant set, a cutoff $k$ larger than the list, a query with no relevant hit). The aiinaction companion library ships parity-tested implementations in Python, Julia, and Rust; all three agree on shared numeric fixtures to within $10^{-9}$.

A few design choices are worth making explicit, because they are exactly the places where two well-meaning implementations silently disagree:

Cutoff clamping. When $k$ exceeds the list length, $P@k$ uses the available length as its denominator rather than $k$. This keeps $P@k \in [0, 1]$ and matches the natural reading “precision over the documents we actually returned.”
Recall’s denominator. $R@k$ divides by $|R_q|$, supplied separately as num_relevant, because relevant documents may live beyond the returned prefix and must still count against recall. The implementation rejects a num_relevant smaller than the number of relevant labels actually present, which would otherwise produce a recall above $1$.
AP and missing recall. Average precision divides the sum of prefix precisions at hits by $|R_q|$, so a relevant document that is never retrieved contributes nothing to the numerator while still inflating the denominator. With the binary ranking $\langle 1,0,1,1,0,1\rangle$ and $|R_q| = 5$, \[ \mathrm{AP} = \frac{P@1 + P@3 + P@4 + P@6}{5} = \frac{1 + \tfrac{2}{3} + \tfrac{3}{4} + \tfrac{4}{6}}{5} \approx 0.6167, \] lower than the $0.7708$ it would score if the divisor were the four hits found, which is precisely the recall penalty.
NDCG normalization. The ideal DCG is computed by sorting the judged pool in decreasing grade order; when relevant documents exist beyond the returned list, pass that pool explicitly as ideal_grades so the denominator reflects the true best ranking rather than just the returned grades.

Code

from aiinaction.ch163_ir_metrics import (
    precision_at_k, recall_at_k, average_precision,
    dcg_at_k, ndcg_at_k, reciprocal_rank,
    mean_average_precision, mean_reciprocal_rank,
)

# A single ranked query: binary relevance labels in rank order (top first).
ranking = [1, 0, 1, 1, 0, 1]   # relevant at ranks 1, 3, 4, 6
num_relevant = 5               # one relevant doc was never retrieved

print(f"P@3   = {precision_at_k(ranking, 3):.4f}")
print(f"R@3   = {recall_at_k(ranking, 3, num_relevant):.4f}")
print(f"AP    = {average_precision(ranking, num_relevant):.4f}")
print(f"RR    = {reciprocal_rank(ranking):.4f}")

# Graded relevances for DCG / NDCG (and the full judged pool for the ideal).
grades = [3, 2, 3, 0, 1, 2]
ideal_pool = [3, 3, 2, 2, 1, 0]
print(f"DCG@3 = {dcg_at_k(grades, 3):.4f}")
print(f"NDCG@3 = {ndcg_at_k(grades, 3):.4f}")
print(f"NDCG@6 (pool) = {ndcg_at_k(grades, 6, ideal_grades=ideal_pool):.4f}")

# Aggregate over a small query set.
query_set = [[1, 0, 1, 1, 0, 1], [0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]]
print(f"MAP   = {mean_average_precision(query_set):.4f}")
print(f"MRR   = {mean_reciprocal_rank(query_set):.4f}")

P@3   = 0.6667
R@3   = 0.4000
AP    = 0.6167
RR    = 1.0000
DCG@3 = 12.3928
NDCG@3 = 0.9595
NDCG@6 (pool) = 0.9488
MAP   = 0.4625
MRR   = 0.5556

using AIInAction.Ch163IrMetrics

ranking = [1, 0, 1, 1, 0, 1]   # relevant at ranks 1, 3, 4, 6
num_relevant = 5

println("P@3   = ", precision_at_k(ranking, 3))
println("R@3   = ", recall_at_k(ranking, 3, num_relevant))
println("AP    = ", average_precision(ranking; num_relevant=num_relevant))
println("RR    = ", reciprocal_rank(ranking))

grades = [3, 2, 3, 0, 1, 2]
ideal_pool = [3, 3, 2, 2, 1, 0]
println("DCG@3 = ", dcg_at_k(grades, 3))
println("NDCG@3 = ", ndcg_at_k(grades, 3))
println("NDCG@6 (pool) = ", ndcg_at_k(grades, 6; ideal_grades=ideal_pool))

query_set = [[1, 0, 1, 1, 0, 1], [0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]]
println("MAP   = ", mean_average_precision(query_set))
println("MRR   = ", mean_reciprocal_rank(query_set))

use aiinaction::ch163_ir_metrics::*;

fn main() {
    let ranking = [1, 0, 1, 1, 0, 1]; // relevant at ranks 1, 3, 4, 6
    let num_relevant = 5;

    println!("P@3   = {:.4}", precision_at_k(&ranking, 3).unwrap());
    println!("R@3   = {:.4}", recall_at_k(&ranking, 3, num_relevant).unwrap());
    println!("AP    = {:.4}", average_precision(&ranking, Some(num_relevant)).unwrap());
    println!("RR    = {:.4}", reciprocal_rank(&ranking).unwrap());

    let grades = [3, 2, 3, 0, 1, 2];
    let ideal_pool = [3, 3, 2, 2, 1, 0];
    println!("DCG@3 = {:.4}", dcg_at_k(&grades, 3).unwrap());
    println!("NDCG@3 = {:.4}", ndcg_at_k(&grades, 3, None).unwrap());
    println!("NDCG@6 (pool) = {:.4}", ndcg_at_k(&grades, 6, Some(&ideal_pool)).unwrap());

    let query_set = vec![
        vec![1, 0, 1, 1, 0, 1],
        vec![0, 1, 0, 0, 1, 0],
        vec![0, 0, 0, 0, 0, 1],
    ];
    println!("MAP   = {:.4}", mean_average_precision(&query_set, None).unwrap());
    println!("MRR   = {:.4}", mean_reciprocal_rank(&query_set).unwrap());
}

168.9 9. Offline versus Online Evaluation

Everything so far is offline evaluation: metrics computed against a fixed corpus, a fixed query set, and fixed relevance judgments. Offline evaluation is reproducible, cheap to rerun, and ideal for rapid iteration and model selection. Its weaknesses are equally clear. Judgments are incomplete, so unjudged but genuinely relevant documents are scored as non relevant, biasing comparisons against systems that surface novel documents. Judgments may also be stale, drifting away from current user intent as the world and the corpus change. Most fundamentally, offline metrics encode an assumed model of relevance and user behavior that may not match what real users actually do.

Online evaluation measures behavior in the live system. The gold standard is the controlled experiment, or A/B test, in which incoming traffic is randomly split between a control system and a treatment system and outcomes are compared. The metrics are behavioral rather than judged: click through rate, time to first click, dwell time, query reformulation and abandonment rates, and ultimately task success or downstream business outcomes such as conversions. Because assignment is randomized, differences in these metrics support causal claims about which system is better for real users, something no offline metric can do.

Online evaluation has its own hazards. Behavioral signals are noisy proxies for relevance and are confounded by presentation: position bias means users click top results regardless of quality, so naive click counts reward whatever ranking is already shown. Counterfactual estimators and click models attempt to correct for this. A/B tests also require substantial traffic to reach statistical significance, they expose real users to potentially worse experiences during the test, and they cannot evaluate systems that have not yet been built. Interleaving, which merges two rankings into one list and attributes clicks to the system that contributed each clicked document, is a more sensitive online design that detects differences with far less traffic than a classic A/B split.

In practice the two regimes are complementary and used in sequence. Teams iterate quickly offline using MAP or NDCG to filter a large space of candidate models down to a few promising ones, validating that offline gains correlate with the online metrics that actually matter. They then run online A/B tests or interleaving experiments on the survivors to confirm that offline improvements translate into better real world behavior before shipping. The persistent risk is metric divergence, where offline gains fail to move online metrics; tracking and minimizing this gap is a core discipline of any serious search team.

168.10 10. Summary

Retrieval evaluation moves from set based precision and recall, through the rank aware cutoff metrics $P@k$ and $R@k$, to the precision recall curve, and finally to the scalar summaries MAP for binary relevance and NDCG for graded relevance, with MRR serving single answer tasks. These offline metrics are indispensable for fast iteration but rest on assumed relevance models and incomplete judgments. Online evaluation through A/B testing and interleaving measures what users actually do and supports causal conclusions, at the cost of noise, position bias, and traffic requirements. Mature systems use both: offline metrics to search the model space efficiently, online experiments to validate that those gains are real.

168.11 References

Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/
Järvelin, K. and Kekäläinen, J. “Cumulated gain based evaluation of IR techniques.” ACM Transactions on Information Systems, 2002. https://dl.acm.org/doi/10.1145/582415.582418
Voorhees, E. M. and Harman, D. K. (eds.). TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. https://trec.nist.gov/
Sanderson, M. “Test Collection Based Evaluation of Information Retrieval Systems.” Foundations and Trends in Information Retrieval, 2010. https://www.nowpublishers.com/article/Details/INR-009
Kohavi, R., Tang, D., and Xu, Y. Trustworthy Online Controlled Experiments. Cambridge University Press, 2020. https://experimentguide.com/
Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. “Large scale validation and analysis of interleaved search evaluation.” ACM Transactions on Information Systems, 2012. https://dl.acm.org/doi/10.1145/2094072.2094078
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. “BEIR: A Heterogeneous Benchmark for Zero shot Evaluation of Information Retrieval Models.” NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2104.08663
Nguyen, T., Rosenberg, M., Song, X., et al. “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” 2016. https://arxiv.org/abs/1611.09268

# Information Retrieval Metrics Evaluating a search or retrieval system means asking a deceptively simple question: when a user issues a query, did the system return the right documents, and did it place them where the user would actually look? This chapter develops the standard offline metrics used to answer that question, beginning with set based notions of precision and recall, extending them to ranked lists through precision at $k$ and recall at $k$, building the precision recall view, and then formalizing the two dominant summary metrics, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). It closes by contrasting offline evaluation against online evaluation in deployed systems. ## 1. Relevance and the Evaluation Setup Fix a query $q$ drawn from a query set $Q$. A retrieval system returns an ordered list of documents $\langle d_1, d_2, \ldots, d_n \rangle$ from a corpus. To evaluate it we need ground truth relevance judgments. In the simplest binary setting each document is either relevant ($\mathrm{rel} = 1$) or non relevant ($\mathrm{rel} = 0$) with respect to $q$. In the graded setting relevance takes values in an ordinal scale, for example $\{0, 1, 2, 3, 4\}$ corresponding to labels from "not relevant" to "perfect". Let $R_q$ denote the set of documents judged relevant to $q$, with $|R_q|$ its cardinality. Two practical complications shape everything that follows. First, judgments are expensive, so in large corpora only a pooled subset of documents is ever judged, and unjudged documents are typically treated as non relevant. Second, real users do not read entire result lists; they read the top of the list. Both facts push us toward rank aware, top heavy metrics. ## 2. Set Based Precision and Recall Before ranking, consider a system that returns an unordered set $A$ of retrieved documents. Precision measures the purity of that set and recall measures its completeness: $$ \mathrm{Precision} = \frac{|A \cap R_q|}{|A|}, \qquad \mathrm{Recall} = \frac{|A \cap R_q|}{|R_q|}. $$ Precision answers "of what I returned, how much was relevant?" Recall answers "of what was relevant, how much did I return?" These quantities trade off against each other. Returning the entire corpus guarantees recall of $1$ but drives precision toward $|R_q| / |\text{corpus}|$, which is tiny. Returning a single highly confident document can yield precision $1$ but negligible recall. A common scalar summary is the $F_\beta$ score, the weighted harmonic mean $$ F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\beta^2 \cdot \mathrm{Precision} + \mathrm{Recall}}, $$ with $F_1$ ($\beta = 1$) the balanced case. The harmonic mean is chosen deliberately: it is dominated by the smaller of the two values, so a system cannot earn a high $F_1$ by excelling at one metric while neglecting the other. ## 3. Precision at k and Recall at k Set based metrics ignore order, which is the wrong abstraction for search where position matters enormously. The first step toward rank awareness is to evaluate only a prefix of the ranked list. Let the top $k$ retrieved documents be $A_k = \langle d_1, \ldots, d_k \rangle$, and let $\mathrm{rel}(d_i) \in \{0, 1\}$ be the binary relevance of the document at rank $i$. Then $$ P@k = \frac{1}{k} \sum_{i=1}^{k} \mathrm{rel}(d_i), \qquad R@k = \frac{1}{|R_q|} \sum_{i=1}^{k} \mathrm{rel}(d_i). $$ $P@k$ is the fraction of the top $k$ results that are relevant. $R@k$ is the fraction of all relevant documents that appear within the top $k$. The choice of $k$ encodes a hypothesis about user behavior. For a web search interface showing ten blue links, $P@10$ is natural. For a "feeling lucky" or voice assistant returning one answer, $P@1$ is what matters. For a first stage retriever feeding a downstream reranker, $R@100$ or $R@1000$ is the quantity of interest, since the reranker can only reorder what the retriever surfaces. ```text ranks: 1 2 3 4 5 rel: 1 0 1 1 0 P@5 = (1+0+1+1+0)/5 = 0.60 R@5 = 3 / |R_q| (e.g. 3/4 = 0.75 if four documents are relevant) ``` Note an asymmetry. $P@k$ uses $k$ in its denominator and so is well defined and bounded in $[0,1]$ regardless of how many relevant documents exist. $R@k$ can be capped below $1$ when $|R_q| > k$, since no prefix of length $k$ can recover more than $k$ relevant documents. A subtle variant, $R\text{-precision}$, sets $k = |R_q|$ for each query, at which point precision and recall coincide, giving a single self normalizing number per query. ## 4. The Precision Recall View for Retrieval Sweeping $k$ from $1$ to $n$ traces out a sequence of $(\mathrm{Recall}, \mathrm{Precision})$ pairs. Plotting precision against recall yields the precision recall curve, the central diagnostic for ranked retrieval. Each time the document at the next rank is relevant, recall increases and precision tends to rise locally; each time it is non relevant, recall stays fixed and precision falls. The curve therefore has a characteristic sawtooth shape. To compare curves it is convention to interpolate. The interpolated precision at recall level $r$ is the highest precision observed at any recall greater than or equal to $r$: $$ P_{\mathrm{interp}}(r) = \max_{r' \ge r} P(r'). $$ This produces a monotonically non increasing curve and removes the sawtooth, reflecting the optimistic stance that a user willing to accept recall $r$ could in principle stop at the more favorable operating point further down the list. The eleven point interpolated average precision, computed at recall levels $\{0.0, 0.1, \ldots, 1.0\}$, was the classic TREC summary before MAP became dominant. The precision recall curve should not be confused with the receiver operating characteristic (ROC) curve, which plots true positive rate against false positive rate. In retrieval the non relevant class typically dwarfs the relevant class, and ROC's false positive rate is insensitive to this imbalance because its denominator is the huge count of non relevant documents. Precision, by contrast, has the retrieved set in its denominator and so remains sensitive to how many junk documents the user must wade through. This is why precision recall analysis is preferred for search and other heavily imbalanced retrieval problems. ## 5. Average Precision and Mean Average Precision The precision recall curve is informative but unwieldy as a single number. Average Precision (AP) collapses it into one scalar per query by averaging the precision values measured at exactly the ranks where relevant documents appear: $$ \mathrm{AP}(q) = \frac{1}{|R_q|} \sum_{i=1}^{n} P@i \cdot \mathrm{rel}(d_i). $$ Each relevant document contributes the precision of the prefix that ends at its rank. Because $P@i$ is large when relevant documents cluster near the top, AP rewards systems that rank relevant material early. Geometrically, AP approximates the area under the (uninterpolated) precision recall curve. Worked example. Suppose $|R_q| = 3$ and relevant documents appear at ranks $1$, $3$, and $6$: $$ P@1 = \tfrac{1}{1} = 1.00, \quad P@3 = \tfrac{2}{3} \approx 0.667, \quad P@6 = \tfrac{3}{6} = 0.500, $$ $$ \mathrm{AP} = \frac{1.00 + 0.667 + 0.500}{3} \approx 0.722. $$ Documents that are relevant but never retrieved contribute a precision of zero to the sum, so AP implicitly penalizes missing recall: the divisor remains $|R_q|$ even when fewer than $|R_q|$ relevant documents are found. Mean Average Precision is simply AP averaged over the query set: $$ \mathrm{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{AP}(q). $$ MAP has been the workhorse of academic IR evaluation for decades. Its strengths are that it is a single number, it incorporates the full ranking rather than a fixed cutoff, and it has a clean interpretation as mean area under the precision recall curve. Its principal limitation is that it assumes binary relevance: a marginally useful document and a perfect answer count identically. When relevance is graded, a different metric is needed. ## 6. Discounted Cumulative Gain and NDCG NDCG addresses graded relevance directly and is the dominant metric in modern learning to rank and neural retrieval. It rests on two intuitions. First, more relevant documents are worth more, so we assign each document a gain that grows with its relevance grade. Second, a relevant document is worth less the further down the list it sits, because users are less likely to examine it, so we apply a position discount. Let $\mathrm{rel}_i$ be the graded relevance of the document at rank $i$. The Discounted Cumulative Gain at cutoff $k$ is $$ \mathrm{DCG}@k = \sum_{i=1}^{k} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i + 1)}. $$ The numerator $2^{\mathrm{rel}_i} - 1$ is the exponential gain, which sharply amplifies the value of highly relevant documents relative to merely acceptable ones; an older linear variant uses $\mathrm{rel}_i$ directly. The denominator $\log_2(i + 1)$ is the discount, equal to $1$ at rank $1$ and growing slowly thereafter, so that gains accrued late in the list are damped. DCG is not comparable across queries because queries differ in how many relevant documents they have and how relevant those documents are. A query with five perfect documents can accumulate far more DCG than a query with one marginal document, regardless of system quality. Normalization fixes this. Let the Ideal DCG, $\mathrm{IDCG}@k$, be the DCG of the best possible ranking, obtained by sorting all judged documents in decreasing order of relevance. Then $$ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}. $$ NDCG lies in $[0, 1]$, equals $1$ for an ideal ranking, and is averaged over queries to summarize a system. Because both numerator and denominator are computed at the same cutoff $k$, NDCG@$k$ measures how close the system's top $k$ comes to the best achievable top $k$. Worked example. Suppose the top three retrieved documents have grades $3, 2, 3$ and the ideal ordering of judged documents is $3, 3, 2$: ```text DCG@3 = (2^3-1)/log2(2) + (2^2-1)/log2(3) + (2^3-1)/log2(4) = 7/1.000 + 3/1.585 + 7/2.000 = 7.000 + 1.893 + 3.500 = 12.393 IDCG@3 = 7/1.000 + 7/1.585 + 3/2.000 = 7.000 + 4.416 + 1.500 = 12.916 NDCG@3 = 12.393 / 12.916 = 0.960 ``` NDCG's flexibility with graded judgments and its tunable cutoff have made it the default reporting metric for benchmarks such as MS MARCO and the BEIR suite, and the standard objective proxy when training learning to rank models. ## 7. Reciprocal Rank and Other Companions Two further metrics appear constantly. The Reciprocal Rank of a query is $1 / \mathrm{rank}_1$, the inverse of the position of the first relevant document, and Mean Reciprocal Rank (MRR) averages it over queries: $$ \mathrm{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\mathrm{rank}_1(q)}. $$ MRR is the right metric for known item search and question answering, where the user wants one correct answer and cares only how quickly it appears. It ignores everything after the first relevant hit, which is appropriate for single answer tasks and inappropriate when coverage of multiple relevant documents matters. A useful mental map: use $P@k$ and $R@k$ to communicate operating points to stakeholders; use MAP when relevance is binary and full ranking quality matters; use NDCG when relevance is graded; use MRR when a single right answer is the goal. ## 8. Reference Implementation The metrics above are short to implement exactly, and doing so removes any ambiguity about edge cases (an empty relevant set, a cutoff $k$ larger than the list, a query with no relevant hit). The `aiinaction` companion library ships parity-tested implementations in Python, Julia, and Rust; all three agree on shared numeric fixtures to within $10^{-9}$. A few design choices are worth making explicit, because they are exactly the places where two well-meaning implementations silently disagree: - **Cutoff clamping.** When $k$ exceeds the list length, $P@k$ uses the available length as its denominator rather than $k$. This keeps $P@k \in [0, 1]$ and matches the natural reading "precision over the documents we actually returned." - **Recall's denominator.** $R@k$ divides by $|R_q|$, supplied separately as `num_relevant`, because relevant documents may live beyond the returned prefix and must still count against recall. The implementation rejects a `num_relevant` smaller than the number of relevant labels actually present, which would otherwise produce a recall above $1$. - **AP and missing recall.** Average precision divides the sum of prefix precisions at hits by $|R_q|$, so a relevant document that is never retrieved contributes nothing to the numerator while still inflating the denominator. With the binary ranking $\langle 1,0,1,1,0,1\rangle$ and $|R_q| = 5$, $$ \mathrm{AP} = \frac{P@1 + P@3 + P@4 + P@6}{5} = \frac{1 + \tfrac{2}{3} + \tfrac{3}{4} + \tfrac{4}{6}}{5} \approx 0.6167, $$ lower than the $0.7708$ it would score if the divisor were the four hits found, which is precisely the recall penalty. - **NDCG normalization.** The ideal DCG is computed by sorting the *judged pool* in decreasing grade order; when relevant documents exist beyond the returned list, pass that pool explicitly as `ideal_grades` so the denominator reflects the true best ranking rather than just the returned grades. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch163_ir_metrics import ( precision_at_k, recall_at_k, average_precision, dcg_at_k, ndcg_at_k, reciprocal_rank, mean_average_precision, mean_reciprocal_rank, ) # A single ranked query: binary relevance labels in rank order (top first). ranking = [1, 0, 1, 1, 0, 1] # relevant at ranks 1, 3, 4, 6 num_relevant = 5 # one relevant doc was never retrieved print(f"P@3 = {precision_at_k(ranking, 3):.4f}") print(f"R@3 = {recall_at_k(ranking, 3, num_relevant):.4f}") print(f"AP = {average_precision(ranking, num_relevant):.4f}") print(f"RR = {reciprocal_rank(ranking):.4f}") # Graded relevances for DCG / NDCG (and the full judged pool for the ideal). grades = [3, 2, 3, 0, 1, 2] ideal_pool = [3, 3, 2, 2, 1, 0] print(f"DCG@3 = {dcg_at_k(grades, 3):.4f}") print(f"NDCG@3 = {ndcg_at_k(grades, 3):.4f}") print(f"NDCG@6 (pool) = {ndcg_at_k(grades, 6, ideal_grades=ideal_pool):.4f}") # Aggregate over a small query set. query_set = [[1, 0, 1, 1, 0, 1], [0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]] print(f"MAP = {mean_average_precision(query_set):.4f}") print(f"MRR = {mean_reciprocal_rank(query_set):.4f}") ``` ## Julia ```julia using AIInAction.Ch163IrMetrics ranking = [1, 0, 1, 1, 0, 1] # relevant at ranks 1, 3, 4, 6 num_relevant = 5 println("P@3 = ", precision_at_k(ranking, 3)) println("R@3 = ", recall_at_k(ranking, 3, num_relevant)) println("AP = ", average_precision(ranking; num_relevant=num_relevant)) println("RR = ", reciprocal_rank(ranking)) grades = [3, 2, 3, 0, 1, 2] ideal_pool = [3, 3, 2, 2, 1, 0] println("DCG@3 = ", dcg_at_k(grades, 3)) println("NDCG@3 = ", ndcg_at_k(grades, 3)) println("NDCG@6 (pool) = ", ndcg_at_k(grades, 6; ideal_grades=ideal_pool)) query_set = [[1, 0, 1, 1, 0, 1], [0, 1, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]] println("MAP = ", mean_average_precision(query_set)) println("MRR = ", mean_reciprocal_rank(query_set)) ``` ## Rust ```rust use aiinaction::ch163_ir_metrics::*; fn main() { let ranking = [1, 0, 1, 1, 0, 1]; // relevant at ranks 1, 3, 4, 6 let num_relevant = 5; println!("P@3 = {:.4}", precision_at_k(&ranking, 3).unwrap()); println!("R@3 = {:.4}", recall_at_k(&ranking, 3, num_relevant).unwrap()); println!("AP = {:.4}", average_precision(&ranking, Some(num_relevant)).unwrap()); println!("RR = {:.4}", reciprocal_rank(&ranking).unwrap()); let grades = [3, 2, 3, 0, 1, 2]; let ideal_pool = [3, 3, 2, 2, 1, 0]; println!("DCG@3 = {:.4}", dcg_at_k(&grades, 3).unwrap()); println!("NDCG@3 = {:.4}", ndcg_at_k(&grades, 3, None).unwrap()); println!("NDCG@6 (pool) = {:.4}", ndcg_at_k(&grades, 6, Some(&ideal_pool)).unwrap()); let query_set = vec![ vec![1, 0, 1, 1, 0, 1], vec![0, 1, 0, 0, 1, 0], vec![0, 0, 0, 0, 0, 1], ]; println!("MAP = {:.4}", mean_average_precision(&query_set, None).unwrap()); println!("MRR = {:.4}", mean_reciprocal_rank(&query_set).unwrap()); } ``` ::: ## 9. Offline versus Online Evaluation Everything so far is offline evaluation: metrics computed against a fixed corpus, a fixed query set, and fixed relevance judgments. Offline evaluation is reproducible, cheap to rerun, and ideal for rapid iteration and model selection. Its weaknesses are equally clear. Judgments are incomplete, so unjudged but genuinely relevant documents are scored as non relevant, biasing comparisons against systems that surface novel documents. Judgments may also be stale, drifting away from current user intent as the world and the corpus change. Most fundamentally, offline metrics encode an assumed model of relevance and user behavior that may not match what real users actually do. Online evaluation measures behavior in the live system. The gold standard is the controlled experiment, or A/B test, in which incoming traffic is randomly split between a control system and a treatment system and outcomes are compared. The metrics are behavioral rather than judged: click through rate, time to first click, dwell time, query reformulation and abandonment rates, and ultimately task success or downstream business outcomes such as conversions. Because assignment is randomized, differences in these metrics support causal claims about which system is better for real users, something no offline metric can do. Online evaluation has its own hazards. Behavioral signals are noisy proxies for relevance and are confounded by presentation: position bias means users click top results regardless of quality, so naive click counts reward whatever ranking is already shown. Counterfactual estimators and click models attempt to correct for this. A/B tests also require substantial traffic to reach statistical significance, they expose real users to potentially worse experiences during the test, and they cannot evaluate systems that have not yet been built. Interleaving, which merges two rankings into one list and attributes clicks to the system that contributed each clicked document, is a more sensitive online design that detects differences with far less traffic than a classic A/B split. In practice the two regimes are complementary and used in sequence. Teams iterate quickly offline using MAP or NDCG to filter a large space of candidate models down to a few promising ones, validating that offline gains correlate with the online metrics that actually matter. They then run online A/B tests or interleaving experiments on the survivors to confirm that offline improvements translate into better real world behavior before shipping. The persistent risk is metric divergence, where offline gains fail to move online metrics; tracking and minimizing this gap is a core discipline of any serious search team. ## 10. Summary Retrieval evaluation moves from set based precision and recall, through the rank aware cutoff metrics $P@k$ and $R@k$, to the precision recall curve, and finally to the scalar summaries MAP for binary relevance and NDCG for graded relevance, with MRR serving single answer tasks. These offline metrics are indispensable for fast iteration but rest on assumed relevance models and incomplete judgments. Online evaluation through A/B testing and interleaving measures what users actually do and supports causal conclusions, at the cost of noise, position bias, and traffic requirements. Mature systems use both: offline metrics to search the model space efficiently, online experiments to validate that those gains are real. ## References 1. Manning, C. D., Raghavan, P., and Schütze, H. *Introduction to Information Retrieval*. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/ 2. Järvelin, K. and Kekäläinen, J. "Cumulated gain based evaluation of IR techniques." *ACM Transactions on Information Systems*, 2002. https://dl.acm.org/doi/10.1145/582415.582418 3. Voorhees, E. M. and Harman, D. K. (eds.). *TREC: Experiment and Evaluation in Information Retrieval*. MIT Press, 2005. https://trec.nist.gov/ 4. Sanderson, M. "Test Collection Based Evaluation of Information Retrieval Systems." *Foundations and Trends in Information Retrieval*, 2010. https://www.nowpublishers.com/article/Details/INR-009 5. Kohavi, R., Tang, D., and Xu, Y. *Trustworthy Online Controlled Experiments*. Cambridge University Press, 2020. https://experimentguide.com/ 6. Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. "Large scale validation and analysis of interleaved search evaluation." *ACM Transactions on Information Systems*, 2012. https://dl.acm.org/doi/10.1145/2094072.2094078 7. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. "BEIR: A Heterogeneous Benchmark for Zero shot Evaluation of Information Retrieval Models." *NeurIPS Datasets and Benchmarks*, 2021. https://arxiv.org/abs/2104.08663 8. Nguyen, T., Rosenberg, M., Song, X., et al. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." 2016. https://arxiv.org/abs/1611.09268