167 Ranking Metrics
Many machine learning systems do not predict a single label or value. Instead they produce an ordered list of items, and the quality of the system depends on whether the most relevant items appear near the top. Search engines, recommender systems, question answering pipelines, and retrieval components inside large language model applications all share this structure. Evaluating such systems requires metrics that respect order, reward relevance early, and remain interpretable across queries with different numbers of relevant items. This chapter develops the three workhorse ranking metrics in rigorous detail: mean reciprocal rank, mean average precision, and normalized discounted cumulative gain. We connect each to the practical concerns of search and recommendation.
167.1 1. The Ranking Evaluation Setup
A ranking system receives a query \(q\) drawn from a distribution of queries and returns an ordered list of items \(\pi = (i_1, i_2, \dots, i_n)\) from a candidate pool. The position of an item in this list is its rank, with rank \(1\) being the top. Associated with each query is a set of relevance judgments. In the simplest binary case, each item is either relevant or not, encoded as \(\mathrm{rel}(i) \in \{0, 1\}\). In the graded case, relevance takes ordinal values such as \(\mathrm{rel}(i) \in \{0, 1, 2, 3, 4\}\), capturing degrees of usefulness from irrelevant to perfect.
The central difficulty is that accuracy style metrics ignore order. A list that places the one relevant document at rank \(1\) and a list that places it at rank \(100\) both contain the same relevant document, yet the user experience differs enormously. Ranking metrics encode the intuition that mistakes near the top of the list cost more than mistakes deep in the list, since users inspect results sequentially and rarely look past the first page.
We evaluate a ranking by computing a per query score and then averaging across a query set \(Q\). Almost all ranking metrics share this two stage structure:
\[ \text{Metric} = \frac{1}{|Q|} \sum_{q \in Q} \text{score}(q). \]
The metrics differ in how they define the per query score. Throughout, we use \(k\) to denote a cutoff, evaluating only the top \(k\) results, which mirrors how users interact with paginated interfaces.
167.2 2. Mean Reciprocal Rank
167.2.1 2.1 Definition
Mean reciprocal rank (MRR) is the simplest order aware metric. It assumes a single notion of correctness per query, namely the position of the first relevant item. Let \(\mathrm{rank}_q\) be the rank of the first relevant item in the list returned for query \(q\). The reciprocal rank is \(1 / \mathrm{rank}_q\), and MRR averages this across queries:
\[ \mathrm{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\mathrm{rank}_q}. \]
If no relevant item appears (or none appears within the cutoff \(k\)), the reciprocal rank is defined as \(0\). The metric ranges from \(0\) to \(1\), where \(1\) means every query had a relevant item at the top.
167.2.2 2.2 Interpretation and Properties
The reciprocal weighting is steep. Moving the first relevant item from rank \(1\) to rank \(2\) drops the score from \(1.0\) to \(0.5\), while moving it from rank \(9\) to rank \(10\) changes the score only from \(0.111\) to \(0.100\). This convexity reflects a strong preference for getting the answer right at the very top and a relative indifference to the deep tail.
MRR is the natural metric when each query has essentially one correct answer. Known item search, where a user seeks a specific document they know exists, fits this model. So does factoid question answering, where the system retrieves a passage containing the answer, and navigational web search, where the user wants a particular homepage. Because MRR considers only the first hit, it is a poor choice when queries have many relevant items whose collective coverage matters, such as exploratory search or recommendation feeds.
query A: [rel, -, -] RR = 1/1 = 1.000
query B: [-, -, rel] RR = 1/3 = 0.333
query C: [-, rel, -] RR = 1/2 = 0.500
MRR = (1.000 + 0.333 + 0.500) / 3 = 0.611
A subtle point concerns expected reciprocal rank, a related metric that models the probability a user stops at each position given graded relevance. Plain MRR ignores everything after the first relevant item, so it cannot distinguish a list with one relevant item from a list with many, as long as the first relevant position is the same.
167.3 3. Mean Average Precision
167.3.1 3.1 Precision, Recall, and Precision at k
To handle queries with multiple relevant items we need set based quantities adapted to ranking. Precision at cutoff \(k\) counts the fraction of the top \(k\) items that are relevant:
\[ P@k = \frac{1}{k} \sum_{j=1}^{k} \mathrm{rel}(i_j), \]
with \(\mathrm{rel}(i_j) \in \{0,1\}\). Recall at \(k\) is the fraction of all relevant items that appear in the top \(k\). Precision at \(k\) is intuitive and widely reported, but it is insensitive to ordering within the top \(k\): shuffling the first \(k\) items leaves \(P@k\) unchanged. It also requires choosing \(k\) arbitrarily.
167.3.2 3.2 Average Precision
Average precision (AP) repairs the order insensitivity of precision at \(k\) by averaging the precision computed at each rank where a relevant item occurs. For a single query with \(R\) relevant items in the candidate pool,
\[ \mathrm{AP} = \frac{1}{R} \sum_{k=1}^{n} P@k \cdot \mathrm{rel}(i_k), \]
where the indicator \(\mathrm{rel}(i_k)\) ensures only positions holding a relevant item contribute. Each relevant item contributes the precision of the list up to and including its own position. Because precision is high when relevant items are densely packed near the top, AP rewards rankings that front load relevance.
Average precision has an elegant geometric meaning: it approximates the area under the precision recall curve for the query. As we walk down the ranked list, each relevant item we encounter raises recall by \(1/R\) and produces a precision value; AP is the recall weighted average of those precision values, which traces the curve.
167.3.3 3.3 From AP to MAP
Mean average precision (MAP) averages AP over the query set:
\[ \mathrm{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{AP}_q. \]
A worked example clarifies the mechanics. Suppose a query has \(R = 3\) relevant items, and the returned list marks relevant positions at ranks \(1\), \(3\), and \(6\):
rank: 1 2 3 4 5 6
rel: 1 0 1 0 0 1
P@k at hits:
k=1: 1/1 = 1.000
k=3: 2/3 = 0.667
k=6: 3/6 = 0.500
AP = (1.000 + 0.667 + 0.500) / 3 = 0.722
Had the three relevant items appeared at ranks \(1\), \(2\), \(3\) the AP would be \(1.0\), the maximum. Pushing them deeper lowers AP smoothly, so MAP captures both how many relevant items are retrieved and how early they appear.
167.3.4 3.4 Strengths and Limitations
MAP is the standard single number summary in classical information retrieval evaluations, and it is stable because it integrates information from every relevant item rather than just the first. Its principal limitation is that it assumes binary relevance. Graded judgments must be collapsed to relevant or not, discarding the distinction between a perfect result and a merely acceptable one. When such gradations matter, the next metric is preferred.
167.4 4. Normalized Discounted Cumulative Gain
167.4.1 4.1 Gain and Discount
Normalized discounted cumulative gain (NDCG) is built for graded relevance. It rests on two ideas. First, each item carries a gain that grows with its relevance grade. A common choice is the exponential gain \(2^{\mathrm{rel}(i)} - 1\), which sharply rewards highly relevant items; a linear gain equal to \(\mathrm{rel}(i)\) is also used. Second, gain is discounted by position so that a relevant item lower in the list contributes less. The standard discount is logarithmic, \(1 / \log_2(j + 1)\) for rank \(j\).
Discounted cumulative gain at cutoff \(k\) combines these:
\[ \mathrm{DCG}@k = \sum_{j=1}^{k} \frac{2^{\mathrm{rel}(i_j)} - 1}{\log_2(j + 1)}. \]
The logarithmic discount falls off more gently than the reciprocal discount of MRR, encoding the assumption that users have diminishing but nonzero patience as they scan downward.
167.4.2 4.2 Normalization
Raw DCG is not comparable across queries, because a query with many highly relevant items can accumulate more gain than a query with few. To normalize, we divide by the ideal DCG, denoted \(\mathrm{IDCG}@k\), which is the DCG of the best possible ranking obtained by sorting all items in descending order of relevance grade. The normalized metric is
\[ \mathrm{NDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}. \]
NDCG ranges from \(0\) to \(1\), where \(1\) means the system produced an ideally ordered list down to position \(k\). Averaging over queries yields the reported value:
\[ \overline{\mathrm{NDCG}}@k = \frac{1}{|Q|} \sum_{q \in Q} \mathrm{NDCG}_q@k. \]
167.4.3 4.3 Worked Example
Consider a query with five retrieved items carrying graded relevance \((3, 2, 0, 1, 2)\) at ranks \(1\) through \(5\), using exponential gain.
rank j rel gain = 2^rel - 1 discount = 1/log2(j+1) contribution
1 3 7 1.000 7.000
2 2 3 0.631 1.893
3 0 0 0.500 0.000
4 1 1 0.431 0.431
5 2 3 0.387 1.161
DCG@5 = 7.000 + 1.893 + 0.000 + 0.431 + 1.161 = 10.485
The ideal ordering sorts the grades as \((3, 2, 2, 1, 0)\). Computing IDCG with the same discounts gives approximately \(7.000 + 1.893 + 1.500 + 0.431 + 0 = 10.824\). Therefore \(\mathrm{NDCG}@5 \approx 10.485 / 10.824 \approx 0.969\). The high value reflects that the list is close to ideal, marred mainly by the irrelevant item at rank \(3\) that displaces a relevant one.
167.4.4 4.4 Why NDCG Dominates Modern Practice
NDCG handles graded relevance natively, normalizes across heterogeneous queries, and supports a tunable cutoff that matches a page of results. These properties make it the default offline metric for web search and for learning to rank systems, where models are trained to optimize smoothed surrogates of NDCG such as the LambdaMART objective. One caution is that the choice of gain function and discount changes the numbers, so NDCG values are only comparable when the same formulation is used. The exponential gain with logarithmic discount is the most widely reported convention.
167.5 5. Choosing and Applying a Metric
167.5.1 5.1 Search Applications
In web and enterprise search, relevance is usually graded by human assessors on a multi point scale, and users scan a ranked page. NDCG at cutoffs such as \(5\) or \(10\) is the canonical offline metric because it respects both grade and position. For navigational queries with a single correct target, MRR is a sharper diagnostic, and many search teams report both. MAP remains valuable for recall oriented tasks such as legal or patent search, where finding all relevant documents matters and binary relevance is acceptable.
167.5.2 5.2 Recommender Applications
Recommender systems present ranked feeds where implicit feedback, such as clicks or purchases, defines relevance. Because such feedback is typically binary and sparse, MAP and precision at \(k\) are common, alongside NDCG when engagement strength provides graded signal. A practical wrinkle is that offline relevance labels in recommendation are biased: items the user never saw are treated as irrelevant even though they might have been liked. This missing not at random structure means offline ranking metrics should be complemented by online experiments such as A/B tests measuring click through rate, dwell time, or conversion.
167.5.3 5.3 Retrieval for Language Models
Retrieval augmented generation has revived interest in ranking metrics. The retriever feeds a small number of passages to a language model, so what matters is whether a passage containing the answer appears in the top few. Recall at \(k\) and MRR are the most reported retriever metrics here, since a single supporting passage at a high rank often suffices for the downstream model to answer correctly.
167.5.4 5.4 Statistical Care
A single averaged number hides variance. Per query scores are often highly skewed, so report confidence intervals or use paired significance tests, such as a paired bootstrap or a randomization test, when comparing two systems on the same query set. Cutoffs should be fixed in advance, and ties in scores should be broken by a deterministic rule so that results are reproducible. Finally, remember that offline metrics are proxies for user satisfaction; their value lies in correlating with the online outcomes the system ultimately serves.
167.6 6. Summary
Ranking metrics translate the intuition that early positions matter into precise, averageable scores. Mean reciprocal rank targets the single first relevant hit with a steep discount, ideal for known item and question answering tasks. Mean average precision integrates precision across all relevant items under binary relevance, serving recall sensitive retrieval. Normalized discounted cumulative gain accommodates graded relevance with a gentle logarithmic discount and per query normalization, making it the default for modern search and learning to rank. Selecting among them means matching the metric’s assumptions to the task: how many items are relevant, whether relevance is graded, how deep users look, and whether offline labels faithfully reflect real satisfaction.
167.7 References
- Manning, C. D., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. Cambridge University Press, 2008. https://nlp.stanford.edu/IR-book/
- Jarvelin, K. and Kekalainen, J. Cumulated Gain Based Evaluation of IR Techniques. ACM Transactions on Information Systems, 2002. https://dl.acm.org/doi/10.1145/582415.582418
- Voorhees, E. M. The TREC 8 Question Answering Track Report. TREC, 1999. https://trec.nist.gov/pubs/trec8/papers/qa_report.pdf
- Burges, C. J. C. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report, 2010. https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/
- Wang, Y., Wang, L., Li, Y., He, D., and Liu, T. A Theoretical Analysis of NDCG Type Ranking Measures. COLT, 2013. https://proceedings.mlr.press/v30/Wang13.html
- Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. Expected Reciprocal Rank for Graded Relevance. CIKM, 2009. https://dl.acm.org/doi/10.1145/1645953.1646033
- Karpukhin, V., Oguz, B., Min, S., et al. Dense Passage Retrieval for Open Domain Question Answering. EMNLP, 2020. https://aclanthology.org/2020.emnlp-main.550/