124 Cost-Sensitive Learning

Most introductory treatments of classification assume that every mistake carries the same penalty. A false positive and a false negative are tallied together as a single error, and the learner is asked to minimize their count. This assumption is convenient and almost always wrong. Approving a fraudulent transaction does not cost the same as declining a legitimate one. Missing a malignant tumor is not interchangeable with ordering an unnecessary follow-up scan. Cost-sensitive learning is the body of theory and practice that takes these asymmetries seriously, replacing the goal of minimizing error count with the goal of minimizing expected cost. This chapter develops the cost matrix, derives the Bayes-optimal decision rule that follows from it, and surveys the methods that turn that rule into trained models. We close with worked considerations from fraud detection and clinical medicine, two domains where the cost of being wrong is measured in dollars and in lives.

124.1 1. From Error Rates to Expected Cost

124.1.1 1.1 The cost matrix

Consider a classification problem with classes indexed by $i, j \in \{1, \dots, K\}$. The fundamental object of cost-sensitive learning is the cost matrix $C$, whose entry $C(i, j)$ specifies the cost incurred when an instance whose true class is $j$ is predicted to be class $i$. By convention the rows index the predicted label and the columns index the true label. For a binary problem with a positive class (say, fraud or disease) and a negative class, the matrix has four entries:

\[ C = \begin{pmatrix} C(0,0) & C(0,1) \\ C(1,0) & C(1,1) \end{pmatrix} = \begin{pmatrix} C_{TN} & C_{FN} \\ C_{FP} & C_{TP} \end{pmatrix}. \]

Here $C_{FN} = C(0,1)$ is the cost of a false negative (predicting negative when the truth is positive), and $C_{FP} = C(1,0)$ is the cost of a false positive. Correct decisions usually carry small or zero cost, so $C_{TP}$ and $C_{TN}$ are often set to zero, though nothing in the theory requires this. In medicine a correct positive may still carry the cost of treatment, and in fraud a correct decline still annoys a misclassified-but-legitimate customer if the threshold is borderline.

A cost matrix is called reasonable when the cost of a correct classification never exceeds the cost of an error for the same true class. If labeling a true positive correctly cost more than labeling it as a negative, no rational learner would ever predict the positive class. The condition $C(i,j) > C(j,j)$ for $i \neq j$ keeps the problem well posed.

A subtle but important fact, established by Elkan, is that only differences between costs within a column matter for the decision, not their absolute values. Adding a constant to any column of $C$, or multiplying the whole matrix by a positive scalar, leaves the arg-min in the decision rule unchanged. This invariance means a four-entry binary cost matrix has only one degree of freedom that affects decisions, namely the ratio $(C_{FN} - C_{TP}) / (C_{FP} - C_{TN})$. A practitioner can therefore set the two correct-classification costs to zero without loss of generality, reducing the matrix to the single cost ratio that drives every downstream choice. We use this normalization throughout.

124.1.2 1.2 Conditional risk

Suppose a probabilistic classifier supplies, for an input $x$, the posterior class probabilities $P(j \mid x)$. The expected cost of predicting label $i$ for that input, known as the conditional risk, is

\[ R(i \mid x) = \sum_{j=1}^{K} C(i, j)\, P(j \mid x). \]

The Bayes-optimal decision is the label that minimizes this conditional risk:

\[ h^{*}(x) = \arg\min_{i} R(i \mid x) = \arg\min_{i} \sum_{j} C(i,j)\, P(j \mid x). \]

When the cost matrix is the zero-one matrix, $C(i,j) = 1 - \mathbb{1}[i = j]$, this rule collapses to predicting the most probable class, recovering the familiar maximum-a-posteriori classifier. The power of the framework is that any asymmetry in $C$ tilts the decision boundary away from the probability-maximizing choice in a precisely quantified way.

124.2 2. The Bayes-Optimal Threshold

124.2.1 2.1 Deriving the threshold for binary problems

For two classes the decision reduces to comparing two conditional risks. Write $p = P(1 \mid x)$ for the posterior probability of the positive class, so $P(0 \mid x) = 1 - p$. We predict positive when $R(1 \mid x) \le R(0 \mid x)$, that is when

\[ C_{TP}\, p + C_{FP}\,(1 - p) \;\le\; C_{FN}\, p + C_{TN}\,(1 - p). \]

Rearranging to isolate $p$ yields the optimal threshold $p^{*}$ above which we should predict positive:

\[ p^{*} = \frac{C_{FP} - C_{TN}}{(C_{FP} - C_{TN}) + (C_{FN} - C_{TP})}. \]

With the common simplification $C_{TP} = C_{TN} = 0$, this becomes the compact and memorable form

\[ p^{*} = \frac{C_{FP}}{C_{FP} + C_{FN}}. \]

The interpretation is direct. If a false negative is nine times as costly as a false positive, then $p^{*} = 1 / (1 + 9) = 0.1$, and we should flag the positive class whenever its probability exceeds ten percent rather than fifty percent. Raising the cost of misses pushes the threshold down and makes the classifier more aggressive about predicting the expensive class. The general formula for $p^{*}$ above is sometimes called the Elkan threshold, and it is the binary specialization of the conditional-risk minimization in Section 1.2.

124.2.2 2.2 A worked example

Suppose a fraud screen produces a calibrated probability of fraud for a transaction. A missed fraud costs the issuer the disputed amount, here $C_{FN} = 200$ dollars, while a false decline costs an estimated $C_{FP} = 8$ dollars in handling and customer friction. Correct decisions cost nothing, so the threshold is

\[ p^{*} = \frac{C_{FP}}{C_{FP} + C_{FN}} = \frac{8}{8 + 200} = \frac{8}{208} \approx 0.0385. \]

Any transaction whose fraud probability exceeds about 3.85 percent should be flagged. Consider a transaction scored at $p = 0.06$. The two conditional risks are

\[ R(\text{flag}) = C_{FP}\,(1 - p) = 8 \times 0.94 = 7.52, \qquad R(\text{approve}) = C_{FN}\, p = 200 \times 0.06 = 12.0. \]

Flagging has the lower expected cost, 7.52 against 12.0, so the Bayes rule flags the transaction even though fraud is far from the more probable outcome. A naive maximum-probability classifier, which approves whenever $p < 0.5$, would have approved it and accepted an expected loss of 12 dollars. The gap between 7.52 and 12.0 is the per-transaction saving the cost-sensitive rule captures, and over a large stream of transactions it compounds into the difference between a profitable and an unprofitable screen.

124.2.3 2.3 Why threshold moving requires calibration

The threshold formula assumes that $p$ is a genuine probability. A score is calibrated when, among all instances assigned probability $q$, the long-run fraction that are truly positive equals $q$. Many models output scores that rank instances correctly but are not calibrated, meaning the predicted $0.1$ does not correspond to a true positive rate of ten percent among instances scored at that level. Applying the Bayes threshold to uncalibrated scores produces decisions that are optimal in name only, because the input to the threshold formula is not the quantity the formula assumes.

Calibration quality is diagnosed with a reliability diagram, which bins predictions by score and plots observed frequency against mean predicted probability; a perfectly calibrated model lies on the diagonal. Before threshold moving, scores should be calibrated, for example with Platt scaling, which fits a one-dimensional logistic transform to the scores, or isotonic regression, which fits a monotone step function and is more flexible but needs more data to avoid overfitting. Tree ensembles and support vector machines are notorious for miscalibration, ensembles tending to be over-confident in the tails and margin-based methods pushing scores toward the extremes, so this step is rarely optional in practice. The open-source scikit-learn library provides both calibrators through CalibratedClassifierCV and a reliability-diagram helper, fit on a held-out split distinct from the data used to train the base model.

# Conceptual: calibrate, then apply the cost-based threshold.
calibrated = calibrate(model, validation_X, validation_y)  # Platt or isotonic
p = calibrated.predict_proba(X)[:, 1]
threshold = c_fp / (c_fp + c_fn)
y_hat = (p >= threshold).astype(int)

124.2.4 2.4 Threshold moving versus retraining

Threshold moving is attractive because it leaves the trained model untouched and adjusts only the final decision. It is provably optimal when the model produces correct posterior probabilities, and it lets a single model serve multiple cost regimes by swapping thresholds. Its limitation is that a model trained to minimize log loss on an imbalanced dataset may spend almost none of its capacity learning the minority region, so even a perfectly placed threshold operates on a poorly resolved probability surface. When that happens, the cost structure must enter training itself, which motivates the methods of the next section.

124.3 3. Cost-Sensitive Training

124.3.1 3.1 Rescaling and instance weighting

The most widely used training-time technique is to weight each instance by the cost of misclassifying it. If positives cost $C_{FN}$ to miss and negatives cost $C_{FP}$ to falsely flag, we assign weight $C_{FN}$ to positive examples and $C_{FP}$ to negative examples, then minimize the weighted empirical risk

\[ \hat{R}(h) = \frac{1}{n} \sum_{k=1}^{n} w_k\, \ell\big(h(x_k), y_k\big), \qquad w_k = C\big(\bar{y}_k, y_k\big), \]

where $\bar{y}_k$ denotes the class that the cost matrix penalizes for an instance of true class $y_k$ (the opposite class in the binary case), so positives receive weight $C_{FN}$ and negatives receive weight $C_{FP}$.

This is the rescaling principle, made precise by Elkan: a cost-sensitive problem with a given cost matrix is equivalent to a cost-insensitive problem on a dataset where the base rate of each class is changed by exactly the factor that reproduces the cost ratio. Concretely, to honor a threshold $p^{*}$ the negative class should be resampled by the factor that maps the original base rate to the target implied by $p^{*}$, which for the normalized matrix amounts to weighting positives and negatives in the ratio $C_{FN} : C_{FP}$. Most modern open-source libraries expose this directly through a class_weight parameter or per-sample weights, and gradient-based learners simply scale each example’s gradient contribution by $w_k$.

# Class-weighted training mirrors the cost ratio.
weights = {0: c_fp, 1: c_fn}
model.fit(X, y, class_weight=weights)

Resampling achieves a similar effect by oversampling the expensive class or undersampling the cheap one until the empirical class ratio matches the cost ratio. Resampling changes the data the model sees, whereas weighting changes the loss while keeping the data fixed; the two are theoretically interchangeable for losses that are linear in the per-instance term, but they differ in variance and in their interaction with stochastic optimization.

124.3.2 3.2 Cost-sensitive loss functions

A cleaner approach folds the cost matrix into the loss function so that the model optimizes expected cost end to end. For binary classification a weighted cross-entropy

\[ \ell(p, y) = -\,C_{FN}\, y \log p \;-\; C_{FP}\,(1 - y) \log(1 - p) \]

penalizes errors on positives and negatives asymmetrically. In deep learning, the focal loss extends this idea by adding a modulating factor $(1 - p_t)^{\gamma}$ that down-weights easy, well-classified examples and concentrates learning on the hard minority cases, which is why it became standard for dense object detection where background examples vastly outnumber objects.

When the full cost matrix varies per instance, as when each fraudulent transaction has a different dollar amount, the loss can carry example-specific costs rather than per-class constants. This example-dependent cost-sensitive learning is strictly more expressive than class weighting and matters whenever the cost of an error depends on the instance, not merely on its class.

124.3.3 3.3 Cost-sensitive trees and ensembles

Decision tree learners can be made cost sensitive at the splitting criterion, choosing splits that reduce expected cost rather than impurity, and at the leaf, labeling each leaf with the cost-minimizing class given the leaf’s class distribution and the cost matrix. Boosting admits an elegant cost-sensitive variant: AdaCost and related algorithms modify the weight-update rule of AdaBoost so that misclassifying a high-cost example increases its weight more sharply, steering successive weak learners toward the expensive mistakes.

124.4 4. Metalearning and Wrapper Methods

124.4.1 4.1 MetaCost

A central insight of cost-sensitive learning is that one can make an arbitrary error-based classifier cost sensitive without modifying its internals. MetaCost, introduced by Domingos, wraps any classifier in a procedure that relabels the training data according to the Bayes-optimal rule and then retrains on the relabeled data. The algorithm proceeds as follows. It learns an ensemble of classifiers on bootstrap resamples of the training set, uses the ensemble to estimate $P(j \mid x)$ for every training instance, then relabels each instance with the class that minimizes conditional risk $\sum_j C(i,j) P(j \mid x)$. A final model is trained on this relabeled set. The relabeling pushes class boundaries outward around expensive classes, so even a plain error-minimizing learner, trained on the new labels, behaves cost sensitively.

# MetaCost skeleton.
ensemble = [base.fit(resample(X, y)) for _ in range(m)]
P = average_proba(ensemble, X)          # estimate posteriors
relabel = argmin_over_i(C @ P.T)        # Bayes-optimal label per instance
final_model = base.fit(X, relabel)      # retrain on relabeled data

124.4.2 4.2 Empirical thresholding and the empirical-cost view

When calibration is doubtful, a robust alternative to the analytic threshold is empirical thresholding: sweep the decision threshold over a validation set and select the value that minimizes measured total cost. This sidesteps the calibration requirement because it optimizes the realized cost directly rather than trusting the probability values. The cost over a validation set of size $n$ is

\[ \text{Cost} = C_{FP}\cdot\text{FP} + C_{FN}\cdot\text{FN} + C_{TP}\cdot\text{TP} + C_{TN}\cdot\text{TN}, \]

and the chosen threshold is the one that minimizes this quantity. Empirical thresholding is the workhorse of production systems precisely because it is agnostic to how well the underlying scores are calibrated.

124.4.3 4.3 Choosing a strategy

These strategies form a rough hierarchy. If a model is well calibrated and costs are known, threshold moving is the simplest correct choice. If calibration is suspect, empirical thresholding recovers robustness at the price of a validation sweep. If the minority region is poorly resolved, training-time weighting or resampling improves the underlying probabilities. If costs vary per instance, example-dependent losses are required. And if the base learner cannot be modified, MetaCost provides a wrapper that achieves cost sensitivity through relabeling. In practice these are combined: weight during training to resolve the boundary, then tune the threshold empirically to lock in the operating point.

The following diagram summarizes the selection logic as a sequence of questions.

flowchart TD
    A["Costs known and asymmetric"] --> B{"Are scores well calibrated"}
    B -- "Yes" --> C{"Do costs vary per instance"}
    B -- "No" --> D["Empirical thresholding on validation set"]
    C -- "No" --> E["Apply Elkan threshold to calibrated scores"]
    C -- "Yes" --> F["Example dependent cost sensitive loss"]
    D --> G{"Is the minority region poorly resolved"}
    E --> G
    F --> G
    G -- "Yes" --> H["Add training time weighting or resampling"]
    G -- "No" --> I["Deploy and monitor realized cost"]
    H --> I

When the base learner is a black box that cannot be reweighted or have its loss changed, route instead through MetaCost, which sits outside this flow by relabeling the data so that an ordinary error-minimizing learner inherits the cost structure.

124.5 5. Applications

124.5.1 5.1 Fraud detection

Card fraud is the canonical example-dependent cost problem. The cost of a missed fraud is roughly the transaction amount the issuer must reimburse, so $C_{FN}$ is not a constant but a per-instance value equal to the disputed sum. The cost of a false positive is subtler: a wrongly declined transaction costs the small operational expense of a fraud alert plus the much larger and harder to quantify cost of customer friction and attrition. Because legitimate transactions outnumber fraudulent ones by a factor of hundreds or thousands, accuracy is a useless metric; a model that approves everything achieves over ninety-nine percent accuracy while reimbursing every fraud loss.

The practical pipeline calibrates the fraud score, then sets a per-transaction threshold using the transaction amount as $C_{FN}$, so that a five thousand dollar transaction is flagged at a far lower probability than a five dollar one. Because fraud distributions drift as adversaries adapt, both the model and the threshold are re-estimated frequently, and the operating point is chosen by minimizing realized monetary cost on recent labeled data rather than by any fixed probability cutoff.

124.5.2 5.2 Medical diagnosis and screening

Clinical classification is dominated by the asymmetry between false negatives and false positives. Missing a malignancy can be fatal, while a false positive typically triggers an additional test, with cost measured in money, radiation exposure, biopsy risk, and patient anxiety. The ratio $C_{FN} / C_{FP}$ in screening can easily reach ten or higher, which by the threshold formula pushes the positive-prediction threshold well below one half and makes screening tools deliberately oversensitive. This is a feature, not a defect: screening is designed to catch disease at the expense of follow-up workload, with a downstream confirmatory test absorbing the false positives.

Three cautions apply. First, costs are rarely scalar and rarely agreed upon; the cost of a false negative depends on disease stage and treatability, and different stakeholders, namely patients, clinicians, and payers, weight outcomes differently. Second, costs are often dynamic, since a missed early-stage cancer becomes far more expensive as it advances, which argues for example-dependent rather than class-constant costs. Third, eliciting credible cost values is itself a research problem, frequently approached by reasoning backward from a clinically acceptable sensitivity to the implied cost ratio. The cost-sensitive framework does not invent these numbers, but it makes the consequences of any chosen numbers explicit and auditable.

124.6 6. Evaluation Under Asymmetric Costs

Accuracy is meaningless when costs are asymmetric and classes are imbalanced, so cost-sensitive systems are evaluated with metrics aligned to the decision. The natural object is the entrywise product of the confusion matrix with the cost matrix. If the confusion matrix counts true negatives, false negatives, false positives, and true positives as $(TN, FN, FP, TP)$, the realized total cost on a held-out set is

\[ \text{Cost} = C_{TN}\,TN + C_{FN}\,FN + C_{FP}\,FP + C_{TP}\,TP, \]

and with the normalized matrix this reduces to $C_{FP}\,FP + C_{FN}\,FN$. Dividing by the number of instances gives the empirical conditional risk, which is the finite-sample estimate of the Bayes risk the whole framework seeks to minimize. Total cost computed with the operative cost matrix is therefore the most direct measure and should be reported whenever costs are known.

When a single threshold is not yet fixed, ranking metrics such as the area under the precision-recall curve summarize performance across operating points and are far more informative than the area under the ROC curve in the heavily imbalanced regime, because precision responds to the prevalence of the minority class while the false positive rate does not (reference 9). The decisive habit is to evaluate models on the cost they incur, not the errors they count, since two models with identical error rates can differ enormously in expected cost once the asymmetry is taken into account.

124.6.1 6.1 When to use cost-sensitive methods, and common pitfalls

Reach for these methods whenever the penalties for different errors genuinely differ, whenever classes are imbalanced enough that accuracy is dominated by the majority, or whenever the operating point must be tuned to a business or clinical constraint. If costs are truly symmetric and classes are balanced, the ordinary error-minimizing setup is already correct and the extra machinery adds only complexity.

Several pitfalls recur in practice. The first is applying the analytic threshold to uncalibrated scores, which silently violates the assumption behind the formula; calibrate first or use empirical thresholding. The second is tuning the threshold on the same data used to fit the model, which leaks information and produces an optimistic operating point; use a separate validation split. The third is treating costs as fixed and known when they are estimated, dynamic, or contested, as is typical in medicine; sensitivity analysis over a plausible range of cost ratios is more honest than a single point estimate. The fourth is conflating resampling with weighting under stochastic optimization, where they can behave differently even though they are equivalent in expectation. The final pitfall is distribution drift: a threshold tuned on last quarter’s data can drift out of optimality as the base rate or the cost of fraud changes, so the operating point must be re-estimated on recent labeled data rather than frozen at deployment.

124.7 References

Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. Proceedings of IJCAI 2001. https://cseweb.ucsd.edu/~elkan/rescale.pdf
Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proceedings of KDD 1999. https://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. (2017). Focal Loss for Dense Object Detection. Proceedings of ICCV 2017. https://arxiv.org/abs/1708.02002
Fan, W., Stolfo, S., Zhang, J., Chan, P. (1999). AdaCost: Misclassification Cost-Sensitive Boosting. Proceedings of ICML 1999. https://www.cs.columbia.edu/~sal/hpapers/adacost.pdf
Bahnsen, A. C., Aouada, D., Ottersten, B. (2015). Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications. https://albahnsen.github.io/files/Example-Dependent%20Cost-Sensitive%20Decision%20Trees.pdf
Provost, F., Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning, 42. https://link.springer.com/article/10.1023/A:1007601015854
Niculescu-Mizil, A., Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. Proceedings of ICML 2005. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9). https://ieeexplore.ieee.org/document/5128907
Saito, T., Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

# Cost-Sensitive Learning Most introductory treatments of classification assume that every mistake carries the same penalty. A false positive and a false negative are tallied together as a single error, and the learner is asked to minimize their count. This assumption is convenient and almost always wrong. Approving a fraudulent transaction does not cost the same as declining a legitimate one. Missing a malignant tumor is not interchangeable with ordering an unnecessary follow-up scan. Cost-sensitive learning is the body of theory and practice that takes these asymmetries seriously, replacing the goal of minimizing error count with the goal of minimizing expected cost. This chapter develops the cost matrix, derives the Bayes-optimal decision rule that follows from it, and surveys the methods that turn that rule into trained models. We close with worked considerations from fraud detection and clinical medicine, two domains where the cost of being wrong is measured in dollars and in lives. ## 1. From Error Rates to Expected Cost ### 1.1 The cost matrix Consider a classification problem with classes indexed by $i, j \in \{1, \dots, K\}$. The fundamental object of cost-sensitive learning is the cost matrix $C$, whose entry $C(i, j)$ specifies the cost incurred when an instance whose true class is $j$ is predicted to be class $i$. By convention the rows index the predicted label and the columns index the true label. For a binary problem with a positive class (say, fraud or disease) and a negative class, the matrix has four entries: $$ C = \begin{pmatrix} C(0,0) & C(0,1) \\ C(1,0) & C(1,1) \end{pmatrix} = \begin{pmatrix} C_{TN} & C_{FN} \\ C_{FP} & C_{TP} \end{pmatrix}. $$ Here $C_{FN} = C(0,1)$ is the cost of a false negative (predicting negative when the truth is positive), and $C_{FP} = C(1,0)$ is the cost of a false positive. Correct decisions usually carry small or zero cost, so $C_{TP}$ and $C_{TN}$ are often set to zero, though nothing in the theory requires this. In medicine a correct positive may still carry the cost of treatment, and in fraud a correct decline still annoys a misclassified-but-legitimate customer if the threshold is borderline. A cost matrix is called reasonable when the cost of a correct classification never exceeds the cost of an error for the same true class. If labeling a true positive correctly cost more than labeling it as a negative, no rational learner would ever predict the positive class. The condition $C(i,j) > C(j,j)$ for $i \neq j$ keeps the problem well posed. A subtle but important fact, established by Elkan, is that only differences between costs within a column matter for the decision, not their absolute values. Adding a constant to any column of $C$, or multiplying the whole matrix by a positive scalar, leaves the arg-min in the decision rule unchanged. This invariance means a four-entry binary cost matrix has only one degree of freedom that affects decisions, namely the ratio $(C_{FN} - C_{TP}) / (C_{FP} - C_{TN})$. A practitioner can therefore set the two correct-classification costs to zero without loss of generality, reducing the matrix to the single cost ratio that drives every downstream choice. We use this normalization throughout. ### 1.2 Conditional risk Suppose a probabilistic classifier supplies, for an input $x$, the posterior class probabilities $P(j \mid x)$. The expected cost of predicting label $i$ for that input, known as the conditional risk, is $$ R(i \mid x) = \sum_{j=1}^{K} C(i, j)\, P(j \mid x). $$ The Bayes-optimal decision is the label that minimizes this conditional risk: $$ h^{*}(x) = \arg\min_{i} R(i \mid x) = \arg\min_{i} \sum_{j} C(i,j)\, P(j \mid x). $$ When the cost matrix is the zero-one matrix, $C(i,j) = 1 - \mathbb{1}[i = j]$, this rule collapses to predicting the most probable class, recovering the familiar maximum-a-posteriori classifier. The power of the framework is that any asymmetry in $C$ tilts the decision boundary away from the probability-maximizing choice in a precisely quantified way. ## 2. The Bayes-Optimal Threshold ### 2.1 Deriving the threshold for binary problems For two classes the decision reduces to comparing two conditional risks. Write $p = P(1 \mid x)$ for the posterior probability of the positive class, so $P(0 \mid x) = 1 - p$. We predict positive when $R(1 \mid x) \le R(0 \mid x)$, that is when $$ C_{TP}\, p + C_{FP}\,(1 - p) \;\le\; C_{FN}\, p + C_{TN}\,(1 - p). $$ Rearranging to isolate $p$ yields the optimal threshold $p^{*}$ above which we should predict positive: $$ p^{*} = \frac{C_{FP} - C_{TN}}{(C_{FP} - C_{TN}) + (C_{FN} - C_{TP})}. $$ With the common simplification $C_{TP} = C_{TN} = 0$, this becomes the compact and memorable form $$ p^{*} = \frac{C_{FP}}{C_{FP} + C_{FN}}. $$ The interpretation is direct. If a false negative is nine times as costly as a false positive, then $p^{*} = 1 / (1 + 9) = 0.1$, and we should flag the positive class whenever its probability exceeds ten percent rather than fifty percent. Raising the cost of misses pushes the threshold down and makes the classifier more aggressive about predicting the expensive class. The general formula for $p^{*}$ above is sometimes called the Elkan threshold, and it is the binary specialization of the conditional-risk minimization in Section 1.2. ### 2.2 A worked example Suppose a fraud screen produces a calibrated probability of fraud for a transaction. A missed fraud costs the issuer the disputed amount, here $C_{FN} = 200$ dollars, while a false decline costs an estimated $C_{FP} = 8$ dollars in handling and customer friction. Correct decisions cost nothing, so the threshold is $$ p^{*} = \frac{C_{FP}}{C_{FP} + C_{FN}} = \frac{8}{8 + 200} = \frac{8}{208} \approx 0.0385. $$ Any transaction whose fraud probability exceeds about 3.85 percent should be flagged. Consider a transaction scored at $p = 0.06$. The two conditional risks are $$ R(\text{flag}) = C_{FP}\,(1 - p) = 8 \times 0.94 = 7.52, \qquad R(\text{approve}) = C_{FN}\, p = 200 \times 0.06 = 12.0. $$ Flagging has the lower expected cost, 7.52 against 12.0, so the Bayes rule flags the transaction even though fraud is far from the more probable outcome. A naive maximum-probability classifier, which approves whenever $p < 0.5$, would have approved it and accepted an expected loss of 12 dollars. The gap between 7.52 and 12.0 is the per-transaction saving the cost-sensitive rule captures, and over a large stream of transactions it compounds into the difference between a profitable and an unprofitable screen. ### 2.3 Why threshold moving requires calibration The threshold formula assumes that $p$ is a genuine probability. A score is calibrated when, among all instances assigned probability $q$, the long-run fraction that are truly positive equals $q$. Many models output scores that rank instances correctly but are not calibrated, meaning the predicted $0.1$ does not correspond to a true positive rate of ten percent among instances scored at that level. Applying the Bayes threshold to uncalibrated scores produces decisions that are optimal in name only, because the input to the threshold formula is not the quantity the formula assumes. Calibration quality is diagnosed with a reliability diagram, which bins predictions by score and plots observed frequency against mean predicted probability; a perfectly calibrated model lies on the diagonal. Before threshold moving, scores should be calibrated, for example with Platt scaling, which fits a one-dimensional logistic transform to the scores, or isotonic regression, which fits a monotone step function and is more flexible but needs more data to avoid overfitting. Tree ensembles and support vector machines are notorious for miscalibration, ensembles tending to be over-confident in the tails and margin-based methods pushing scores toward the extremes, so this step is rarely optional in practice. The open-source `scikit-learn` library provides both calibrators through `CalibratedClassifierCV` and a reliability-diagram helper, fit on a held-out split distinct from the data used to train the base model. ```python # Conceptual: calibrate, then apply the cost-based threshold. calibrated = calibrate(model, validation_X, validation_y) # Platt or isotonic p = calibrated.predict_proba(X)[:, 1] threshold = c_fp / (c_fp + c_fn) y_hat = (p >= threshold).astype(int) ``` ### 2.4 Threshold moving versus retraining Threshold moving is attractive because it leaves the trained model untouched and adjusts only the final decision. It is provably optimal when the model produces correct posterior probabilities, and it lets a single model serve multiple cost regimes by swapping thresholds. Its limitation is that a model trained to minimize log loss on an imbalanced dataset may spend almost none of its capacity learning the minority region, so even a perfectly placed threshold operates on a poorly resolved probability surface. When that happens, the cost structure must enter training itself, which motivates the methods of the next section. ## 3. Cost-Sensitive Training ### 3.1 Rescaling and instance weighting The most widely used training-time technique is to weight each instance by the cost of misclassifying it. If positives cost $C_{FN}$ to miss and negatives cost $C_{FP}$ to falsely flag, we assign weight $C_{FN}$ to positive examples and $C_{FP}$ to negative examples, then minimize the weighted empirical risk $$ \hat{R}(h) = \frac{1}{n} \sum_{k=1}^{n} w_k\, \ell\big(h(x_k), y_k\big), \qquad w_k = C\big(\bar{y}_k, y_k\big), $$ where $\bar{y}_k$ denotes the class that the cost matrix penalizes for an instance of true class $y_k$ (the opposite class in the binary case), so positives receive weight $C_{FN}$ and negatives receive weight $C_{FP}$. This is the rescaling principle, made precise by Elkan: a cost-sensitive problem with a given cost matrix is equivalent to a cost-insensitive problem on a dataset where the base rate of each class is changed by exactly the factor that reproduces the cost ratio. Concretely, to honor a threshold $p^{*}$ the negative class should be resampled by the factor that maps the original base rate to the target implied by $p^{*}$, which for the normalized matrix amounts to weighting positives and negatives in the ratio $C_{FN} : C_{FP}$. Most modern open-source libraries expose this directly through a `class_weight` parameter or per-sample weights, and gradient-based learners simply scale each example's gradient contribution by $w_k$. ```python # Class-weighted training mirrors the cost ratio. weights = {0: c_fp, 1: c_fn} model.fit(X, y, class_weight=weights) ``` Resampling achieves a similar effect by oversampling the expensive class or undersampling the cheap one until the empirical class ratio matches the cost ratio. Resampling changes the data the model sees, whereas weighting changes the loss while keeping the data fixed; the two are theoretically interchangeable for losses that are linear in the per-instance term, but they differ in variance and in their interaction with stochastic optimization. ### 3.2 Cost-sensitive loss functions A cleaner approach folds the cost matrix into the loss function so that the model optimizes expected cost end to end. For binary classification a weighted cross-entropy $$ \ell(p, y) = -\,C_{FN}\, y \log p \;-\; C_{FP}\,(1 - y) \log(1 - p) $$ penalizes errors on positives and negatives asymmetrically. In deep learning, the focal loss extends this idea by adding a modulating factor $(1 - p_t)^{\gamma}$ that down-weights easy, well-classified examples and concentrates learning on the hard minority cases, which is why it became standard for dense object detection where background examples vastly outnumber objects. When the full cost matrix varies per instance, as when each fraudulent transaction has a different dollar amount, the loss can carry example-specific costs rather than per-class constants. This example-dependent cost-sensitive learning is strictly more expressive than class weighting and matters whenever the cost of an error depends on the instance, not merely on its class. ### 3.3 Cost-sensitive trees and ensembles Decision tree learners can be made cost sensitive at the splitting criterion, choosing splits that reduce expected cost rather than impurity, and at the leaf, labeling each leaf with the cost-minimizing class given the leaf's class distribution and the cost matrix. Boosting admits an elegant cost-sensitive variant: AdaCost and related algorithms modify the weight-update rule of AdaBoost so that misclassifying a high-cost example increases its weight more sharply, steering successive weak learners toward the expensive mistakes. ## 4. Metalearning and Wrapper Methods ### 4.1 MetaCost A central insight of cost-sensitive learning is that one can make an arbitrary error-based classifier cost sensitive without modifying its internals. MetaCost, introduced by Domingos, wraps any classifier in a procedure that relabels the training data according to the Bayes-optimal rule and then retrains on the relabeled data. The algorithm proceeds as follows. It learns an ensemble of classifiers on bootstrap resamples of the training set, uses the ensemble to estimate $P(j \mid x)$ for every training instance, then relabels each instance with the class that minimizes conditional risk $\sum_j C(i,j) P(j \mid x)$. A final model is trained on this relabeled set. The relabeling pushes class boundaries outward around expensive classes, so even a plain error-minimizing learner, trained on the new labels, behaves cost sensitively. ```python # MetaCost skeleton. ensemble = [base.fit(resample(X, y)) for _ in range(m)] P = average_proba(ensemble, X) # estimate posteriors relabel = argmin_over_i(C @ P.T) # Bayes-optimal label per instance final_model = base.fit(X, relabel) # retrain on relabeled data ``` ### 4.2 Empirical thresholding and the empirical-cost view When calibration is doubtful, a robust alternative to the analytic threshold is empirical thresholding: sweep the decision threshold over a validation set and select the value that minimizes measured total cost. This sidesteps the calibration requirement because it optimizes the realized cost directly rather than trusting the probability values. The cost over a validation set of size $n$ is $$ \text{Cost} = C_{FP}\cdot\text{FP} + C_{FN}\cdot\text{FN} + C_{TP}\cdot\text{TP} + C_{TN}\cdot\text{TN}, $$ and the chosen threshold is the one that minimizes this quantity. Empirical thresholding is the workhorse of production systems precisely because it is agnostic to how well the underlying scores are calibrated. ### 4.3 Choosing a strategy These strategies form a rough hierarchy. If a model is well calibrated and costs are known, threshold moving is the simplest correct choice. If calibration is suspect, empirical thresholding recovers robustness at the price of a validation sweep. If the minority region is poorly resolved, training-time weighting or resampling improves the underlying probabilities. If costs vary per instance, example-dependent losses are required. And if the base learner cannot be modified, MetaCost provides a wrapper that achieves cost sensitivity through relabeling. In practice these are combined: weight during training to resolve the boundary, then tune the threshold empirically to lock in the operating point. The following diagram summarizes the selection logic as a sequence of questions. ```{mermaid} flowchart TD A["Costs known and asymmetric"] --> B{"Are scores well calibrated"} B -- "Yes" --> C{"Do costs vary per instance"} B -- "No" --> D["Empirical thresholding on validation set"] C -- "No" --> E["Apply Elkan threshold to calibrated scores"] C -- "Yes" --> F["Example dependent cost sensitive loss"] D --> G{"Is the minority region poorly resolved"} E --> G F --> G G -- "Yes" --> H["Add training time weighting or resampling"] G -- "No" --> I["Deploy and monitor realized cost"] H --> I ``` When the base learner is a black box that cannot be reweighted or have its loss changed, route instead through MetaCost, which sits outside this flow by relabeling the data so that an ordinary error-minimizing learner inherits the cost structure. ## 5. Applications ### 5.1 Fraud detection Card fraud is the canonical example-dependent cost problem. The cost of a missed fraud is roughly the transaction amount the issuer must reimburse, so $C_{FN}$ is not a constant but a per-instance value equal to the disputed sum. The cost of a false positive is subtler: a wrongly declined transaction costs the small operational expense of a fraud alert plus the much larger and harder to quantify cost of customer friction and attrition. Because legitimate transactions outnumber fraudulent ones by a factor of hundreds or thousands, accuracy is a useless metric; a model that approves everything achieves over ninety-nine percent accuracy while reimbursing every fraud loss. The practical pipeline calibrates the fraud score, then sets a per-transaction threshold using the transaction amount as $C_{FN}$, so that a five thousand dollar transaction is flagged at a far lower probability than a five dollar one. Because fraud distributions drift as adversaries adapt, both the model and the threshold are re-estimated frequently, and the operating point is chosen by minimizing realized monetary cost on recent labeled data rather than by any fixed probability cutoff. ### 5.2 Medical diagnosis and screening Clinical classification is dominated by the asymmetry between false negatives and false positives. Missing a malignancy can be fatal, while a false positive typically triggers an additional test, with cost measured in money, radiation exposure, biopsy risk, and patient anxiety. The ratio $C_{FN} / C_{FP}$ in screening can easily reach ten or higher, which by the threshold formula pushes the positive-prediction threshold well below one half and makes screening tools deliberately oversensitive. This is a feature, not a defect: screening is designed to catch disease at the expense of follow-up workload, with a downstream confirmatory test absorbing the false positives. Three cautions apply. First, costs are rarely scalar and rarely agreed upon; the cost of a false negative depends on disease stage and treatability, and different stakeholders, namely patients, clinicians, and payers, weight outcomes differently. Second, costs are often dynamic, since a missed early-stage cancer becomes far more expensive as it advances, which argues for example-dependent rather than class-constant costs. Third, eliciting credible cost values is itself a research problem, frequently approached by reasoning backward from a clinically acceptable sensitivity to the implied cost ratio. The cost-sensitive framework does not invent these numbers, but it makes the consequences of any chosen numbers explicit and auditable. ## 6. Evaluation Under Asymmetric Costs Accuracy is meaningless when costs are asymmetric and classes are imbalanced, so cost-sensitive systems are evaluated with metrics aligned to the decision. The natural object is the entrywise product of the confusion matrix with the cost matrix. If the confusion matrix counts true negatives, false negatives, false positives, and true positives as $(TN, FN, FP, TP)$, the realized total cost on a held-out set is $$ \text{Cost} = C_{TN}\,TN + C_{FN}\,FN + C_{FP}\,FP + C_{TP}\,TP, $$ and with the normalized matrix this reduces to $C_{FP}\,FP + C_{FN}\,FN$. Dividing by the number of instances gives the empirical conditional risk, which is the finite-sample estimate of the Bayes risk the whole framework seeks to minimize. Total cost computed with the operative cost matrix is therefore the most direct measure and should be reported whenever costs are known. When a single threshold is not yet fixed, ranking metrics such as the area under the precision-recall curve summarize performance across operating points and are far more informative than the area under the ROC curve in the heavily imbalanced regime, because precision responds to the prevalence of the minority class while the false positive rate does not (reference 9). The decisive habit is to evaluate models on the cost they incur, not the errors they count, since two models with identical error rates can differ enormously in expected cost once the asymmetry is taken into account. ### 6.1 When to use cost-sensitive methods, and common pitfalls Reach for these methods whenever the penalties for different errors genuinely differ, whenever classes are imbalanced enough that accuracy is dominated by the majority, or whenever the operating point must be tuned to a business or clinical constraint. If costs are truly symmetric and classes are balanced, the ordinary error-minimizing setup is already correct and the extra machinery adds only complexity. Several pitfalls recur in practice. The first is applying the analytic threshold to uncalibrated scores, which silently violates the assumption behind the formula; calibrate first or use empirical thresholding. The second is tuning the threshold on the same data used to fit the model, which leaks information and produces an optimistic operating point; use a separate validation split. The third is treating costs as fixed and known when they are estimated, dynamic, or contested, as is typical in medicine; sensitivity analysis over a plausible range of cost ratios is more honest than a single point estimate. The fourth is conflating resampling with weighting under stochastic optimization, where they can behave differently even though they are equivalent in expectation. The final pitfall is distribution drift: a threshold tuned on last quarter's data can drift out of optimality as the base rate or the cost of fraud changes, so the operating point must be re-estimated on recent labeled data rather than frozen at deployment. ## References 1. Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. Proceedings of IJCAI 2001. https://cseweb.ucsd.edu/~elkan/rescale.pdf 2. Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proceedings of KDD 1999. https://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf 3. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. (2017). Focal Loss for Dense Object Detection. Proceedings of ICCV 2017. https://arxiv.org/abs/1708.02002 4. Fan, W., Stolfo, S., Zhang, J., Chan, P. (1999). AdaCost: Misclassification Cost-Sensitive Boosting. Proceedings of ICML 1999. https://www.cs.columbia.edu/~sal/hpapers/adacost.pdf 5. Bahnsen, A. C., Aouada, D., Ottersten, B. (2015). Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications. https://albahnsen.github.io/files/Example-Dependent%20Cost-Sensitive%20Decision%20Trees.pdf 6. Provost, F., Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning, 42. https://link.springer.com/article/10.1023/A:1007601015854 7. Niculescu-Mizil, A., Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. Proceedings of ICML 2005. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf 8. He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9). https://ieeexplore.ieee.org/document/5128907 9. Saito, T., Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432