167 Ranking Metrics

Many machine learning systems do not predict a single label or value. Instead they produce an ordered list of items, and the quality of the system depends on whether the most relevant items appear near the top. Search engines, recommender systems, question answering pipelines, and retrieval components inside large language model applications all share this structure. Evaluating such systems requires metrics that respect order, reward relevance early, and remain interpretable across queries with different numbers of relevant items. This chapter develops the three workhorse ranking metrics in rigorous detail: mean reciprocal rank, mean average precision, and normalized discounted cumulative gain. We connect each to the practical concerns of search and recommendation.

167.1 1. The Ranking Evaluation Setup

A ranking system receives a query $q$ drawn from a distribution of queries and returns an ordered list of items $\pi = (i_1, i_2, \dots, i_n)$ from a candidate pool. The position of an item in this list is its rank, with rank $1$ being the top. Associated with each query is a set of relevance judgments. In the simplest binary case, each item is either relevant or not, encoded as $\mathrm{rel}(i) \in \{0, 1\}$. In the graded case, relevance takes ordinal values such as $\mathrm{rel}(i) \in \{0, 1, 2, 3, 4\}$, capturing degrees of usefulness from irrelevant to perfect.

The central difficulty is that accuracy style metrics ignore order. A list that places the one relevant document at rank $1$ and a list that places it at rank $100$ both contain the same relevant document, yet the user experience differs enormously. Ranking metrics encode the intuition that mistakes near the top of the list cost more than mistakes deep in the list, since users inspect results sequentially and rarely look past the first page.

We evaluate a ranking by computing a per query score and then averaging across a query set $Q$. Almost all ranking metrics share this two stage structure:

\[ \text{Metric} = \frac{1}{|Q|} \sum_{q \in Q} \text{score}(q). \]

The metrics differ in how they define the per query score. Throughout, we use $k$ to denote a cutoff, evaluating only the top $k$ results, which mirrors how users interact with paginated interfaces.

167.2 2. Mean Reciprocal Rank

167.2.1 2.1 Definition

Mean reciprocal rank (MRR) is the simplest order aware metric. It assumes a single notion of correctness per query, namely the position of the first relevant item. Let $\mathrm{rank}_q$ be the rank of the first relevant item in the list returned for query $q$. The reciprocal rank is $1 / \mathrm{rank}_q$, and MRR averages this across queries:

\[ \mathrm{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\mathrm{rank}_q}. \]

If no relevant item appears (or none appears within the cutoff $k$), the reciprocal rank is defined as $0$. The metric ranges from $0$ to $1$, where $1$ means every query had a relevant item at the top.

167.2.2 2.2 Interpretation and Properties

The reciprocal weighting is steep. Moving the first relevant item from rank $1$ to rank $2$ drops the score from $1.0$ to $0.5$, while moving it from rank $9$ to rank $10$ changes the score only from $0.111$ to $0.100$. This convexity reflects a strong preference for getting the answer right at the very top and a relative indifference to the deep tail.

MRR is the natural metric when each query has essentially one correct answer. Known item search, where a user seeks a specific document they know exists, fits this model. So does factoid question answering, where the system retrieves a passage containing the answer, and navigational web search, where the user wants a particular homepage. Because MRR considers only the first hit, it is a poor choice when queries have many relevant items whose collective coverage matters, such as exploratory search or recommendation feeds.

query A: [rel, -, -]      RR = 1/1 = 1.000
query B: [-, -, rel]      RR = 1/3 = 0.333
query C: [-, rel, -]      RR = 1/2 = 0.500
MRR = (1.000 + 0.333 + 0.500) / 3 = 0.611

A subtle point concerns expected reciprocal rank, a related metric that models the probability a user stops at each position given graded relevance. Plain MRR ignores everything after the first relevant item, so it cannot distinguish a list with one relevant item from a list with many, as long as the first relevant position is the same.

167.3 3. Mean Average Precision

167.3.1 3.1 Precision, Recall, and Precision at k

To handle queries with multiple relevant items we need set based quantities adapted to ranking. Precision at cutoff $k$ counts the fraction of the top $k$ items that are relevant:

\[ P@k = \frac{1}{k} \sum_{j=1}^{k} \mathrm{rel}(i_j), \]

with $\mathrm{rel}(i_j) \in \{0,1\}$. Recall at $k$ is the fraction of all relevant items that appear in the top $k$. Precision at $k$ is intuitive and widely reported, but it is insensitive to ordering within the top $k$: shuffling the first $k$ items leaves $P@k$ unchanged. It also requires choosing $k$ arbitrarily.

167.3.2 3.2 Average Precision

Average precision (AP) repairs the order insensitivity of precision at $k$ by averaging the precision computed at each rank where a relevant item occurs. For a single query with $R$ relevant items in the candidate pool,

\[ \mathrm{AP} = \frac{1}{R} \sum_{k=1}^{n} P@k \cdot \mathrm{rel}(i_k), \]

where the indicator $\mathrm{rel}(i_k)$ ensures only positions holding a relevant item contribute. Each relevant item contributes the precision of the list up to and including its own position. Because precision is high when relevant items are densely packed near the top, AP rewards rankings that front load relevance.

Average precision has an elegant geometric meaning: it approximates the area under the precision recall curve for the query. As we walk down the ranked list, each relevant item we encounter raises recall by $1/R$ and produces a precision value; AP is the recall weighted average of those precision values, which traces the curve.

167.3.3 3.3 From AP to MAP

Mean average precision (MAP) averages AP over the query set:

\[ \mathrm{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{AP}_q. \]

A worked example clarifies the mechanics. Suppose a query has $R = 3$ relevant items, and the returned list marks relevant positions at ranks $1$, $3$, and $6$:

rank:    1     2    3     4    5    6
rel:     1     0    1     0    0    1
P@k at hits:
  k=1: 1/1 = 1.000
  k=3: 2/3 = 0.667
  k=6: 3/6 = 0.500
AP = (1.000 + 0.667 + 0.500) / 3 = 0.722

Had the three relevant items appeared at ranks $1$, $2$, $3$ the AP would be $1.0$, the maximum. Pushing them deeper lowers AP smoothly, so MAP captures both how many relevant items are retrieved and how early they appear.

167.3.4 3.4 Strengths and Limitations

MAP is the standard single number summary in classical information retrieval evaluations, and it is stable because it integrates information from every relevant item rather than just the first. Its principal limitation is that it assumes binary relevance. Graded judgments must be collapsed to relevant or not, discarding the distinction between a perfect result and a merely acceptable one. When such gradations matter, the next metric is preferred.

167.4 4. Normalized Discounted Cumulative Gain

167.4.1 4.1 Gain and Discount

Normalized discounted cumulative gain (NDCG) is built for graded relevance. It rests on two ideas. First, each item carries a gain that grows with its relevance grade. A common choice is the exponential gain $2^{\mathrm{rel}(i)} - 1$, which sharply rewards highly relevant items; a linear gain equal to $\mathrm{rel}(i)$ is also used. Second, gain is discounted by position so that a relevant item lower in the list contributes less. The standard discount is logarithmic, $1 / \log_2(j + 1)$ for rank $j$.

Discounted cumulative gain at cutoff $k$ combines these:

\[ \mathrm{DCG}@k = \sum_{j=1}^{k} \frac{2^{\mathrm{rel}(i_j)} - 1}{\log_2(j + 1)}. \]

The logarithmic discount falls off more gently than the reciprocal discount of MRR, encoding the assumption that users have diminishing but nonzero patience as they scan downward.

167.4.2 4.2 Normalization

Raw DCG is not comparable across queries, because a query with many highly relevant items can accumulate more gain than a query with few. To normalize, we divide by the ideal DCG, denoted $\mathrm{IDCG}@k$, which is the DCG of the best possible ranking obtained by sorting all items in descending order of relevance grade. The normalized metric is

\[ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}. \]

NDCG ranges from $0$ to $1$, where $1$ means the system produced an ideally ordered list down to position $k$. Averaging over queries yields the reported value:

\[ \overline{\mathrm{NDCG}}@k = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{NDCG}_q@k. \]

167.4.3 4.3 Worked Example

Consider a query with five retrieved items carrying graded relevance $(3, 2, 0, 1, 2)$ at ranks $1$ through $5$, using exponential gain.

rank j   rel   gain = 2^rel - 1   discount = 1/log2(j+1)   contribution
  1       3          7                  1.000                 7.000
  2       2          3                  0.631                 1.893
  3       0          0                  0.500                 0.000
  4       1          1                  0.431                 0.431
  5       2          3                  0.387                 1.161
DCG@5 = 7.000 + 1.893 + 0.000 + 0.431 + 1.161 = 10.485

The ideal ordering sorts the grades as $(3, 2, 2, 1, 0)$. Computing IDCG with the same discounts gives approximately $7.000 + 1.893 + 1.500 + 0.431 + 0 = 10.824$. Therefore $\mathrm{NDCG}@5 \approx 10.485 / 10.824 \approx 0.969$. The high value reflects that the list is close to ideal, marred mainly by the irrelevant item at rank $3$ that displaces a relevant one.

167.4.4 4.4 Why NDCG Dominates Modern Practice

NDCG handles graded relevance natively, normalizes across heterogeneous queries, and supports a tunable cutoff that matches a page of results. These properties make it the default offline metric for web search and for learning to rank systems, where models are trained to optimize smoothed surrogates of NDCG such as the LambdaMART objective. One caution is that the choice of gain function and discount changes the numbers, so NDCG values are only comparable when the same formulation is used. The exponential gain with logarithmic discount is the most widely reported convention.

167.5 5. Choosing and Applying a Metric

167.5.1 5.1 Search Applications

In web and enterprise search, relevance is usually graded by human assessors on a multi point scale, and users scan a ranked page. NDCG at cutoffs such as $5$ or $10$ is the canonical offline metric because it respects both grade and position. For navigational queries with a single correct target, MRR is a sharper diagnostic, and many search teams report both. MAP remains valuable for recall oriented tasks such as legal or patent search, where finding all relevant documents matters and binary relevance is acceptable.

167.5.2 5.2 Recommender Applications

Recommender systems present ranked feeds where implicit feedback, such as clicks or purchases, defines relevance. Because such feedback is typically binary and sparse, MAP and precision at $k$ are common, alongside NDCG when engagement strength provides graded signal. A practical wrinkle is that offline relevance labels in recommendation are biased: items the user never saw are treated as irrelevant even though they might have been liked. This missing not at random structure means offline ranking metrics should be complemented by online experiments such as A/B tests measuring click through rate, dwell time, or conversion.

167.5.3 5.3 Retrieval for Language Models

Retrieval augmented generation has revived interest in ranking metrics. The retriever feeds a small number of passages to a language model, so what matters is whether a passage containing the answer appears in the top few. Recall at $k$ and MRR are the most reported retriever metrics here, since a single supporting passage at a high rank often suffices for the downstream model to answer correctly.

167.5.4 5.4 Statistical Care

A single averaged number hides variance. Per query scores are often highly skewed, so report confidence intervals or use paired significance tests, such as a paired bootstrap or a randomization test, when comparing two systems on the same query set. Cutoffs should be fixed in advance, and ties in scores should be broken by a deterministic rule so that results are reproducible. Finally, remember that offline metrics are proxies for user satisfaction; their value lies in correlating with the online outcomes the system ultimately serves.

167.6 6. Reference Implementation

The companion library ships a small, dependency-light implementation of all three metrics in Python, Julia, and Rust under the module ch162_ranking_metrics. Each function takes a per-query relevance list in ranked order, treats a strictly positive entry as relevant, and supports an optional top-$k$ cutoff. The three implementations are checked against a shared set of numeric fixtures so they agree to floating-point tolerance.

The derivations above translate directly to code. Reciprocal rank scans for the first positive entry and returns $1/\mathrm{rank}$. Average precision accumulates $\mathrm{hits}/k$ at every hit position and divides by the number of relevant items $R$. DCG sums $(2^{\mathrm{rel}} - 1)/\log_2(j+1)$ over positions, and NDCG normalizes by the DCG of the descending-sorted grades.

Code

from aiinaction.ch162_ranking_metrics import (
    mean_reciprocal_rank,
    average_precision,
    mean_average_precision,
    dcg,
    ndcg,
)

# MRR: three known-item queries, the single relevant item at ranks 1, 3, 2.
mrr_queries = [[1, 0, 0], [0, 0, 1], [0, 1, 0]]
print(f"MRR = {mean_reciprocal_rank(mrr_queries):.4f}")

# MAP: two queries under binary relevance.
ap_q1 = [1, 0, 1, 0, 0, 1]
ap_q2 = [0, 1, 1, 0]
print(f"AP(q1) = {average_precision(ap_q1):.4f}")
print(f"MAP    = {mean_average_precision([ap_q1, ap_q2]):.4f}")

# NDCG: graded relevance, exponential gain with logarithmic discount.
grades = [3, 2, 0, 1, 2]
print(f"DCG@5  = {dcg(grades):.4f}")
print(f"NDCG@5 = {ndcg(grades):.4f}")

MRR = 0.6111
AP(q1) = 0.7222
MAP    = 0.6528
DCG@5  = 10.4840
NDCG@5 = 0.9686

using AIInAction.Ch162RankingMetrics

# MRR: three known-item queries, the single relevant item at ranks 1, 3, 2.
mrr_queries = [[1, 0, 0], [0, 0, 1], [0, 1, 0]]
println("MRR = ", round(mean_reciprocal_rank(mrr_queries); digits=4))

# MAP: two queries under binary relevance.
ap_q1 = [1, 0, 1, 0, 0, 1]
ap_q2 = [0, 1, 1, 0]
println("AP(q1) = ", round(average_precision(ap_q1); digits=4))
println("MAP    = ", round(mean_average_precision([ap_q1, ap_q2]); digits=4))

# NDCG: graded relevance, exponential gain with logarithmic discount.
grades = [3, 2, 0, 1, 2]
println("DCG@5  = ", round(dcg(grades); digits=4))
println("NDCG@5 = ", round(ndcg(grades); digits=4))

use aiinaction::ch162_ranking_metrics::{
    average_precision, dcg, mean_average_precision, mean_reciprocal_rank, ndcg,
};

fn main() {
    // MRR: three known-item queries, the relevant item at ranks 1, 3, 2.
    let mrr_queries = vec![
        vec![1.0, 0.0, 0.0],
        vec![0.0, 0.0, 1.0],
        vec![0.0, 1.0, 0.0],
    ];
    println!("MRR = {:.4}", mean_reciprocal_rank(&mrr_queries, None).unwrap());

    // MAP: two queries under binary relevance.
    let ap_q1 = vec![1.0, 0.0, 1.0, 0.0, 0.0, 1.0];
    let ap_q2 = vec![0.0, 1.0, 1.0, 0.0];
    println!("AP(q1) = {:.4}", average_precision(&ap_q1, None, None).unwrap());
    let queries = vec![ap_q1.clone(), ap_q2];
    println!("MAP    = {:.4}", mean_average_precision(&queries, None, None).unwrap());

    // NDCG: graded relevance, exponential gain with logarithmic discount.
    let grades = vec![3.0, 2.0, 0.0, 1.0, 2.0];
    println!("DCG@5  = {:.4}", dcg(&grades, None, true).unwrap());
    println!("NDCG@5 = {:.4}", ndcg(&grades, None, true).unwrap());
}

167.7 7. Summary

Ranking metrics translate the intuition that early positions matter into precise, averageable scores. Mean reciprocal rank targets the single first relevant hit with a steep discount, ideal for known item and question answering tasks. Mean average precision integrates precision across all relevant items under binary relevance, serving recall sensitive retrieval. Normalized discounted cumulative gain accommodates graded relevance with a gentle logarithmic discount and per query normalization, making it the default for modern search and learning to rank. Selecting among them means matching the metric’s assumptions to the task: how many items are relevant, whether relevance is graded, how deep users look, and whether offline labels faithfully reflect real satisfaction.

167.8 References

Manning, C. D., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/
Jarvelin, K. and Kekalainen, J. Cumulated Gain Based Evaluation of IR Techniques. ACM Transactions on Information Systems, 2002. https://dl.acm.org/doi/10.1145/582415.582418
Voorhees, E. M. The TREC 8 Question Answering Track Report. TREC, 1999. https://trec.nist.gov/pubs/trec8/papers/qa_report.pdf
Burges, C. J. C. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report, 2010. https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/
Wang, Y., Wang, L., Li, Y., He, D., and Liu, T. A Theoretical Analysis of NDCG Type Ranking Measures. COLT, 2013. https://proceedings.mlr.press/v30/Wang13.html
Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. Expected Reciprocal Rank for Graded Relevance. CIKM, 2009. https://dl.acm.org/doi/10.1145/1645953.1646033
Karpukhin, V., Oguz, B., Min, S., et al. Dense Passage Retrieval for Open Domain Question Answering. EMNLP, 2020. https://aclanthology.org/2020.emnlp-main.550/

# Ranking Metrics Many machine learning systems do not predict a single label or value. Instead they produce an ordered list of items, and the quality of the system depends on whether the most relevant items appear near the top. Search engines, recommender systems, question answering pipelines, and retrieval components inside large language model applications all share this structure. Evaluating such systems requires metrics that respect order, reward relevance early, and remain interpretable across queries with different numbers of relevant items. This chapter develops the three workhorse ranking metrics in rigorous detail: mean reciprocal rank, mean average precision, and normalized discounted cumulative gain. We connect each to the practical concerns of search and recommendation. ## 1. The Ranking Evaluation Setup A ranking system receives a query $q$ drawn from a distribution of queries and returns an ordered list of items $\pi = (i_1, i_2, \dots, i_n)$ from a candidate pool. The position of an item in this list is its rank, with rank $1$ being the top. Associated with each query is a set of relevance judgments. In the simplest binary case, each item is either relevant or not, encoded as $\mathrm{rel}(i) \in \{0, 1\}$. In the graded case, relevance takes ordinal values such as $\mathrm{rel}(i) \in \{0, 1, 2, 3, 4\}$, capturing degrees of usefulness from irrelevant to perfect. The central difficulty is that accuracy style metrics ignore order. A list that places the one relevant document at rank $1$ and a list that places it at rank $100$ both contain the same relevant document, yet the user experience differs enormously. Ranking metrics encode the intuition that mistakes near the top of the list cost more than mistakes deep in the list, since users inspect results sequentially and rarely look past the first page. We evaluate a ranking by computing a per query score and then averaging across a query set $Q$. Almost all ranking metrics share this two stage structure: $$ \text{Metric} = \frac{1}{|Q|} \sum_{q \in Q} \text{score}(q). $$ The metrics differ in how they define the per query score. Throughout, we use $k$ to denote a cutoff, evaluating only the top $k$ results, which mirrors how users interact with paginated interfaces. ## 2. Mean Reciprocal Rank ### 2.1 Definition Mean reciprocal rank (MRR) is the simplest order aware metric. It assumes a single notion of correctness per query, namely the position of the first relevant item. Let $\mathrm{rank}_q$ be the rank of the first relevant item in the list returned for query $q$. The reciprocal rank is $1 / \mathrm{rank}_q$, and MRR averages this across queries: $$ \mathrm{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\mathrm{rank}_q}. $$ If no relevant item appears (or none appears within the cutoff $k$), the reciprocal rank is defined as $0$. The metric ranges from $0$ to $1$, where $1$ means every query had a relevant item at the top. ### 2.2 Interpretation and Properties The reciprocal weighting is steep. Moving the first relevant item from rank $1$ to rank $2$ drops the score from $1.0$ to $0.5$, while moving it from rank $9$ to rank $10$ changes the score only from $0.111$ to $0.100$. This convexity reflects a strong preference for getting the answer right at the very top and a relative indifference to the deep tail. MRR is the natural metric when each query has essentially one correct answer. Known item search, where a user seeks a specific document they know exists, fits this model. So does factoid question answering, where the system retrieves a passage containing the answer, and navigational web search, where the user wants a particular homepage. Because MRR considers only the first hit, it is a poor choice when queries have many relevant items whose collective coverage matters, such as exploratory search or recommendation feeds. ```text query A: [rel, -, -] RR = 1/1 = 1.000 query B: [-, -, rel] RR = 1/3 = 0.333 query C: [-, rel, -] RR = 1/2 = 0.500 MRR = (1.000 + 0.333 + 0.500) / 3 = 0.611 ``` A subtle point concerns expected reciprocal rank, a related metric that models the probability a user stops at each position given graded relevance. Plain MRR ignores everything after the first relevant item, so it cannot distinguish a list with one relevant item from a list with many, as long as the first relevant position is the same. ## 3. Mean Average Precision ### 3.1 Precision, Recall, and Precision at k To handle queries with multiple relevant items we need set based quantities adapted to ranking. Precision at cutoff $k$ counts the fraction of the top $k$ items that are relevant: $$ P@k = \frac{1}{k} \sum_{j=1}^{k} \mathrm{rel}(i_j), $$ with $\mathrm{rel}(i_j) \in \{0,1\}$. Recall at $k$ is the fraction of all relevant items that appear in the top $k$. Precision at $k$ is intuitive and widely reported, but it is insensitive to ordering within the top $k$: shuffling the first $k$ items leaves $P@k$ unchanged. It also requires choosing $k$ arbitrarily. ### 3.2 Average Precision Average precision (AP) repairs the order insensitivity of precision at $k$ by averaging the precision computed at each rank where a relevant item occurs. For a single query with $R$ relevant items in the candidate pool, $$ \mathrm{AP} = \frac{1}{R} \sum_{k=1}^{n} P@k \cdot \mathrm{rel}(i_k), $$ where the indicator $\mathrm{rel}(i_k)$ ensures only positions holding a relevant item contribute. Each relevant item contributes the precision of the list up to and including its own position. Because precision is high when relevant items are densely packed near the top, AP rewards rankings that front load relevance. Average precision has an elegant geometric meaning: it approximates the area under the precision recall curve for the query. As we walk down the ranked list, each relevant item we encounter raises recall by $1/R$ and produces a precision value; AP is the recall weighted average of those precision values, which traces the curve. ### 3.3 From AP to MAP Mean average precision (MAP) averages AP over the query set: $$ \mathrm{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{AP}_q. $$ A worked example clarifies the mechanics. Suppose a query has $R = 3$ relevant items, and the returned list marks relevant positions at ranks $1$, $3$, and $6$: ```text rank: 1 2 3 4 5 6 rel: 1 0 1 0 0 1 P@k at hits: k=1: 1/1 = 1.000 k=3: 2/3 = 0.667 k=6: 3/6 = 0.500 AP = (1.000 + 0.667 + 0.500) / 3 = 0.722 ``` Had the three relevant items appeared at ranks $1$, $2$, $3$ the AP would be $1.0$, the maximum. Pushing them deeper lowers AP smoothly, so MAP captures both how many relevant items are retrieved and how early they appear. ### 3.4 Strengths and Limitations MAP is the standard single number summary in classical information retrieval evaluations, and it is stable because it integrates information from every relevant item rather than just the first. Its principal limitation is that it assumes binary relevance. Graded judgments must be collapsed to relevant or not, discarding the distinction between a perfect result and a merely acceptable one. When such gradations matter, the next metric is preferred. ## 4. Normalized Discounted Cumulative Gain ### 4.1 Gain and Discount Normalized discounted cumulative gain (NDCG) is built for graded relevance. It rests on two ideas. First, each item carries a gain that grows with its relevance grade. A common choice is the exponential gain $2^{\mathrm{rel}(i)} - 1$, which sharply rewards highly relevant items; a linear gain equal to $\mathrm{rel}(i)$ is also used. Second, gain is discounted by position so that a relevant item lower in the list contributes less. The standard discount is logarithmic, $1 / \log_2(j + 1)$ for rank $j$. Discounted cumulative gain at cutoff $k$ combines these: $$ \mathrm{DCG}@k = \sum_{j=1}^{k} \frac{2^{\mathrm{rel}(i_j)} - 1}{\log_2(j + 1)}. $$ The logarithmic discount falls off more gently than the reciprocal discount of MRR, encoding the assumption that users have diminishing but nonzero patience as they scan downward. ### 4.2 Normalization Raw DCG is not comparable across queries, because a query with many highly relevant items can accumulate more gain than a query with few. To normalize, we divide by the ideal DCG, denoted $\mathrm{IDCG}@k$, which is the DCG of the best possible ranking obtained by sorting all items in descending order of relevance grade. The normalized metric is $$ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}. $$ NDCG ranges from $0$ to $1$, where $1$ means the system produced an ideally ordered list down to position $k$. Averaging over queries yields the reported value: $$ \overline{\mathrm{NDCG}}@k = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{NDCG}_q@k. $$ ### 4.3 Worked Example Consider a query with five retrieved items carrying graded relevance $(3, 2, 0, 1, 2)$ at ranks $1$ through $5$, using exponential gain. ```text rank j rel gain = 2^rel - 1 discount = 1/log2(j+1) contribution 1 3 7 1.000 7.000 2 2 3 0.631 1.893 3 0 0 0.500 0.000 4 1 1 0.431 0.431 5 2 3 0.387 1.161 DCG@5 = 7.000 + 1.893 + 0.000 + 0.431 + 1.161 = 10.485 ``` The ideal ordering sorts the grades as $(3, 2, 2, 1, 0)$. Computing IDCG with the same discounts gives approximately $7.000 + 1.893 + 1.500 + 0.431 + 0 = 10.824$. Therefore $\mathrm{NDCG}@5 \approx 10.485 / 10.824 \approx 0.969$. The high value reflects that the list is close to ideal, marred mainly by the irrelevant item at rank $3$ that displaces a relevant one. ### 4.4 Why NDCG Dominates Modern Practice NDCG handles graded relevance natively, normalizes across heterogeneous queries, and supports a tunable cutoff that matches a page of results. These properties make it the default offline metric for web search and for learning to rank systems, where models are trained to optimize smoothed surrogates of NDCG such as the LambdaMART objective. One caution is that the choice of gain function and discount changes the numbers, so NDCG values are only comparable when the same formulation is used. The exponential gain with logarithmic discount is the most widely reported convention. ## 5. Choosing and Applying a Metric ### 5.1 Search Applications In web and enterprise search, relevance is usually graded by human assessors on a multi point scale, and users scan a ranked page. NDCG at cutoffs such as $5$ or $10$ is the canonical offline metric because it respects both grade and position. For navigational queries with a single correct target, MRR is a sharper diagnostic, and many search teams report both. MAP remains valuable for recall oriented tasks such as legal or patent search, where finding all relevant documents matters and binary relevance is acceptable. ### 5.2 Recommender Applications Recommender systems present ranked feeds where implicit feedback, such as clicks or purchases, defines relevance. Because such feedback is typically binary and sparse, MAP and precision at $k$ are common, alongside NDCG when engagement strength provides graded signal. A practical wrinkle is that offline relevance labels in recommendation are biased: items the user never saw are treated as irrelevant even though they might have been liked. This missing not at random structure means offline ranking metrics should be complemented by online experiments such as A/B tests measuring click through rate, dwell time, or conversion. ### 5.3 Retrieval for Language Models Retrieval augmented generation has revived interest in ranking metrics. The retriever feeds a small number of passages to a language model, so what matters is whether a passage containing the answer appears in the top few. Recall at $k$ and MRR are the most reported retriever metrics here, since a single supporting passage at a high rank often suffices for the downstream model to answer correctly. ### 5.4 Statistical Care A single averaged number hides variance. Per query scores are often highly skewed, so report confidence intervals or use paired significance tests, such as a paired bootstrap or a randomization test, when comparing two systems on the same query set. Cutoffs should be fixed in advance, and ties in scores should be broken by a deterministic rule so that results are reproducible. Finally, remember that offline metrics are proxies for user satisfaction; their value lies in correlating with the online outcomes the system ultimately serves. ## 6. Reference Implementation The companion library ships a small, dependency-light implementation of all three metrics in Python, Julia, and Rust under the module `ch162_ranking_metrics`. Each function takes a per-query relevance list in ranked order, treats a strictly positive entry as relevant, and supports an optional top-$k$ cutoff. The three implementations are checked against a shared set of numeric fixtures so they agree to floating-point tolerance. The derivations above translate directly to code. Reciprocal rank scans for the first positive entry and returns $1/\mathrm{rank}$. Average precision accumulates $\mathrm{hits}/k$ at every hit position and divides by the number of relevant items $R$. DCG sums $(2^{\mathrm{rel}} - 1)/\log_2(j+1)$ over positions, and NDCG normalizes by the DCG of the descending-sorted grades. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch162_ranking_metrics import ( mean_reciprocal_rank, average_precision, mean_average_precision, dcg, ndcg, ) # MRR: three known-item queries, the single relevant item at ranks 1, 3, 2. mrr_queries = [[1, 0, 0], [0, 0, 1], [0, 1, 0]] print(f"MRR = {mean_reciprocal_rank(mrr_queries):.4f}") # MAP: two queries under binary relevance. ap_q1 = [1, 0, 1, 0, 0, 1] ap_q2 = [0, 1, 1, 0] print(f"AP(q1) = {average_precision(ap_q1):.4f}") print(f"MAP = {mean_average_precision([ap_q1, ap_q2]):.4f}") # NDCG: graded relevance, exponential gain with logarithmic discount. grades = [3, 2, 0, 1, 2] print(f"DCG@5 = {dcg(grades):.4f}") print(f"NDCG@5 = {ndcg(grades):.4f}") ``` ## Julia ```julia using AIInAction.Ch162RankingMetrics # MRR: three known-item queries, the single relevant item at ranks 1, 3, 2. mrr_queries = [[1, 0, 0], [0, 0, 1], [0, 1, 0]] println("MRR = ", round(mean_reciprocal_rank(mrr_queries); digits=4)) # MAP: two queries under binary relevance. ap_q1 = [1, 0, 1, 0, 0, 1] ap_q2 = [0, 1, 1, 0] println("AP(q1) = ", round(average_precision(ap_q1); digits=4)) println("MAP = ", round(mean_average_precision([ap_q1, ap_q2]); digits=4)) # NDCG: graded relevance, exponential gain with logarithmic discount. grades = [3, 2, 0, 1, 2] println("DCG@5 = ", round(dcg(grades); digits=4)) println("NDCG@5 = ", round(ndcg(grades); digits=4)) ``` ## Rust ```rust use aiinaction::ch162_ranking_metrics::{ average_precision, dcg, mean_average_precision, mean_reciprocal_rank, ndcg, }; fn main() { // MRR: three known-item queries, the relevant item at ranks 1, 3, 2. let mrr_queries = vec![ vec![1.0, 0.0, 0.0], vec![0.0, 0.0, 1.0], vec![0.0, 1.0, 0.0], ]; println!("MRR = {:.4}", mean_reciprocal_rank(&mrr_queries, None).unwrap()); // MAP: two queries under binary relevance. let ap_q1 = vec![1.0, 0.0, 1.0, 0.0, 0.0, 1.0]; let ap_q2 = vec![0.0, 1.0, 1.0, 0.0]; println!("AP(q1) = {:.4}", average_precision(&ap_q1, None, None).unwrap()); let queries = vec![ap_q1.clone(), ap_q2]; println!("MAP = {:.4}", mean_average_precision(&queries, None, None).unwrap()); // NDCG: graded relevance, exponential gain with logarithmic discount. let grades = vec![3.0, 2.0, 0.0, 1.0, 2.0]; println!("DCG@5 = {:.4}", dcg(&grades, None, true).unwrap()); println!("NDCG@5 = {:.4}", ndcg(&grades, None, true).unwrap()); } ``` ::: ## 7. Summary Ranking metrics translate the intuition that early positions matter into precise, averageable scores. Mean reciprocal rank targets the single first relevant hit with a steep discount, ideal for known item and question answering tasks. Mean average precision integrates precision across all relevant items under binary relevance, serving recall sensitive retrieval. Normalized discounted cumulative gain accommodates graded relevance with a gentle logarithmic discount and per query normalization, making it the default for modern search and learning to rank. Selecting among them means matching the metric's assumptions to the task: how many items are relevant, whether relevance is graded, how deep users look, and whether offline labels faithfully reflect real satisfaction. ## References 1. Manning, C. D., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/ 2. Jarvelin, K. and Kekalainen, J. Cumulated Gain Based Evaluation of IR Techniques. ACM Transactions on Information Systems, 2002. https://dl.acm.org/doi/10.1145/582415.582418 3. Voorhees, E. M. The TREC 8 Question Answering Track Report. TREC, 1999. https://trec.nist.gov/pubs/trec8/papers/qa_report.pdf 4. Burges, C. J. C. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report, 2010. https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/ 5. Wang, Y., Wang, L., Li, Y., He, D., and Liu, T. A Theoretical Analysis of NDCG Type Ranking Measures. COLT, 2013. https://proceedings.mlr.press/v30/Wang13.html 6. Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. Expected Reciprocal Rank for Graded Relevance. CIKM, 2009. https://dl.acm.org/doi/10.1145/1645953.1646033 7. Karpukhin, V., Oguz, B., Min, S., et al. Dense Passage Retrieval for Open Domain Question Answering. EMNLP, 2020. https://aclanthology.org/2020.emnlp-main.550/