125 Classical Machine Learning Best Practices

Classical machine learning still powers a large share of production systems. Fraud scoring, churn prediction, demand forecasting, credit risk, and ranking pipelines very often run on logistic regression, gradient boosted trees, or support vector machines rather than deep networks. These methods are fast to train, cheap to serve, easy to monitor, and frequently competitive with far more complex alternatives on tabular data. This chapter is a practical playbook. It walks through the modeling workflow as a disciplined process, explains how to build a baseline that you can trust, shows how leakage sneaks into projects and how to stop it, treats error analysis as a first class activity, gives concrete guidance on choosing among linear models, trees, and kernels, and closes with the reproducibility habits that separate a one off notebook from a durable system.

125.1 1. Start With a Baseline You Can Defend

125.1.1 1.1 Why the Baseline Comes First

The single most common mistake in applied machine learning is to reach for a sophisticated model before establishing a reference point. Without a baseline you cannot answer the only question that matters: is this model good enough to be worth deploying, and is the added complexity buying real accuracy. A baseline turns a vague modeling effort into a measurable one. It also exposes data problems early, because a baseline that scores suspiciously high or suspiciously low is usually telling you something about the data rather than the algorithm.

Build two kinds of baselines. The first is a trivial baseline that requires no learning at all. For classification, predict the majority class or the empirical class prior. For regression, predict the mean or median of the target. For a time series, predict the last observed value or a seasonal lag. These cost almost nothing and define the floor that any real model must clear.

The second is a simple learned baseline. For tabular data this is usually a logistic or linear regression with light preprocessing, or a single gradient boosted tree model with default hyperparameters. The point is not to win, it is to set an honest reference and to verify that the full pipeline runs end to end.

# Trivial baseline first, then a simple learned one
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

DummyClassifier(strategy="most_frequent")   # the floor
LogisticRegression(max_iter=1000)           # the honest reference

125.1.2 1.2 Pick the Metric Before You Model

Choose and write down your evaluation metric before training anything, because the metric defines what “better” means and quietly shapes every later decision. Match the metric to the decision the model supports. Accuracy is misleading under class imbalance, where a model that always predicts the majority class can look excellent while being useless. Prefer area under the precision recall curve when positives are rare, log loss or Brier score when you need calibrated probabilities, and a cost weighted metric when false positives and false negatives carry different business costs. For regression, decide deliberately between mean absolute error, which is robust to outliers, and root mean squared error, which punishes large misses. Define the metric once, and report it consistently from baseline through final model so that comparisons stay meaningful.

It helps to keep the core definitions explicit. From the confusion matrix counts of true positives ($TP$), false positives ($FP$), true negatives ($TN$), and false negatives ($FN$), the standard classification quantities are

\[ \text{precision} = \frac{TP}{TP + FP}, \qquad \text{recall} = \frac{TP}{TP + FN}, \qquad F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}. \]

Precision answers “when the model says positive, how often is it right,” recall answers “of the true positives, how many did the model catch,” and $F_1$ is their harmonic mean, which stays low unless both are high. When probabilities matter rather than hard labels, score the predicted probability $\hat{p}_i$ against the label $y_i \in \{0, 1\}$ with log loss or the Brier score,

\[ \text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \big[\, y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) \,\big], \qquad \text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2 . \]

Both are minimized by reporting true probabilities, so both reward calibration, but log loss penalizes confident mistakes far more harshly because $-\log \hat{p}_i \to \infty$ as a confident prediction approaches certainty on the wrong class. For regression with residuals $r_i = y_i - \hat{y}_i$,

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |r_i|, \qquad \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} r_i^2}. \]

Because RMSE squares the residuals before averaging, it is dominated by the largest misses and is the right choice when a few big errors are far more costly than many small ones. MAE weights every error equally and is the better summary when the target has heavy tailed noise or outliers you do not want to chase. The general rule is that the loss you optimize and the metric you report should encode the real cost of being wrong, and when those costs are asymmetric a cost weighted metric, $C = c_{FP} \cdot FP + c_{FN} \cdot FN$ with explicit per error costs, beats any symmetric default.

125.2 2. The Modeling Workflow

The workflow is a loop with one strict rule: the final test set is touched exactly once, at the very end, and every other decision is made on training and validation data only. The diagram below shows the order and the protected boundary around the test set.

flowchart TD
    A["Frame the problem and define the target"] --> B["Split data by time or group"]
    B --> C["Hold out final test set untouched"]
    B --> D["Build pipeline: impute, scale, encode, model"]
    D --> E["Cross validate on training data"]
    E --> F["Tune hyperparameters inside the loop"]
    F --> G["Error analysis and slicing"]
    G --> H{"Good enough"}
    H -->|"No"| D
    H -->|"Yes"| I["Evaluate once on test set"]
    I --> J["Reproducibility: pin, log, ship"]

125.2.1 2.1 Frame the Problem and Split the Data

Before any feature is engineered, state the prediction target precisely, the unit of prediction, and the moment in time at which a prediction must be made. This framing determines what information is legitimately available and therefore what counts as leakage later.

Split the data immediately, and split it the way the model will actually be used. A random split is fine for independent and identically distributed rows, but most real problems are not that clean. If the data has a time dimension, split by time so that training precedes validation, which precedes the test period. If rows cluster by entity, such as multiple transactions per customer, split by group so that no customer appears in both training and test. Carve out a final test set that you touch exactly once, at the very end. Everything else, including cross validation, happens inside the remaining data.

125.2.2 2.2 Build a Pipeline, Not a Pile of Steps

Preprocessing must live inside a single object that is fit on training folds only and applied to validation and test data. This is the difference between a method that generalizes and one that quietly cheats. Encapsulating imputation, scaling, encoding, and the estimator in one pipeline guarantees that every transformation learns its parameters from training data alone, and it makes cross validation correct by construction.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer()), ("sc", StandardScaler())]), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])
model = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000))])

125.2.3 2.3 Validate Honestly, Then Tune

Use cross validation to estimate performance and its variability, not just a single number. K fold cross validation with five or ten folds is the default for tabular data. Use stratified folds for classification to preserve class balance, grouped folds when rows cluster, and forward chaining splits for time series. Report the mean and the spread across folds, because a model with a high mean and a wide spread is fragile.

Tune hyperparameters inside this validation loop, never against the final test set. When the tuning search is large, prefer nested cross validation or a clean separation of a tuning set, otherwise the validation score becomes optimistic. Random search and Bayesian optimization usually find good configurations faster than exhaustive grid search. Resist the urge to tune endlessly. The largest gains almost always come from better features and cleaner data, not from the third decimal place of a learning rate.

125.2.4 2.4 Iterate in Tight Loops

Treat modeling as a sequence of small, reversible experiments. Change one thing, measure against the same validation protocol, keep what helps, and discard what does not. Log every run. A modeling effort that drifts without a record of what was tried tends to rediscover its own dead ends.

125.3 3. Avoiding Data Leakage

125.3.1 3.1 What Leakage Is and Why It Is So Dangerous

Leakage occurs when information that would not be available at prediction time, or information from the validation and test sets, influences training. Its signature is a model that performs beautifully in offline evaluation and then collapses in production. Leakage is dangerous precisely because it hides inside good looking numbers, so the team ships a model that was never as good as it appeared.

A precise way to state the condition is in terms of the information set available at the prediction moment. Let $t_i$ be the timestamp at which a prediction for example $i$ must be made, and let $\mathcal{F}_{t_i}$ denote everything legitimately knowable at or before $t_i$. A feature vector $x_i = \phi(\mathcal{F}_{t_i})$ is leakage free when its construction function $\phi$ reads only from $\mathcal{F}_{t_i}$. Leakage is any dependence of $x_i$ on information outside $\mathcal{F}_{t_i}$, whether that is the future ($\mathcal{F}_{s}$ for some $s > t_i$), the label $y_i$ itself, or statistics estimated from validation and test rows. Kaufman and colleagues (Kaufman et al. 2012) formalize this as the requirement that every feature be a legitimate function of the information available before the target is determined, and they catalog how violations slip past standard evaluation.

125.3.2 3.2 The Common Leakage Patterns

Several patterns recur often enough to memorize. Preprocessing on the full dataset is the classic case: fitting a scaler, an imputer, or a feature selector on all rows before splitting lets statistics from the test set bleed into training. Target leakage happens when a feature is a proxy for the label or is computed using the label, such as a “number of late payments” field that is only populated after default is known. Temporal leakage uses future information to predict the past, for example aggregating a customer’s full year of activity to predict an event in the middle of that year. Group leakage places correlated rows from the same entity on both sides of the split. Duplicate rows that straddle the split inflate scores for the same reason.

125.3.3 3.3 How to Stop It

Fit every transformation inside the pipeline on training folds only, which the pipeline pattern in section 2.2 enforces automatically. Construct features using only data with a timestamp at or before the prediction moment, and prefer point in time joins for any aggregated feature. Split by group or by time whenever rows are not independent. Audit your top features after training: if one feature single handedly dominates and produces near perfect accuracy, treat it as a leakage suspect until proven innocent. A quick sanity check is to ask, for every feature, whether its value would truly be known at the instant the prediction is made.

# Leaky: scaler sees the whole dataset before the split
X_scaled = StandardScaler().fit_transform(X)   # do not do this

# Correct: scaling lives in the pipeline, fit on train folds only
cross_val_score(model, X_train, y_train, cv=5)

125.3.4 3.4 A Worked Example of Preprocessing Leakage

To see why fitting preprocessing before splitting inflates scores, consider feature selection by univariate correlation, a deceptively common mistake. Suppose the label $y$ is pure noise, independent of every one of $d$ candidate features, so the true accuracy of any model is the class prior. If you select the $k$ features most correlated with $y$ using the entire dataset, then split, then evaluate, the selection step has already peeked at the test labels. With $d$ large and $k$ small, some features will correlate with $y$ by chance across the full sample, and because the same rows drive both selection and test scoring, that spurious correlation survives into the held out set. The reported accuracy rises well above the prior even though no signal exists. The fix is structural rather than statistical: place the selector inside the pipeline so it is refit on the training folds alone, which the pattern in section 2.2 enforces automatically. The held out rows then have no influence on which features are chosen, and the score returns to the honest floor. The general lesson is that any step that estimates a parameter from data, including imputation means, scaling statistics, encoders, and feature selectors, must be fit inside the cross validation loop.

125.4 4. Error Analysis

125.4.1 4.1 Look at the Mistakes, Not Just the Score

A single aggregate metric compresses thousands of decisions into one number and hides where the model fails. Error analysis is the practice of reopening that number and studying the individual mistakes. It is the highest leverage activity in applied machine learning, because it tells you what to fix next, whether the answer is a new feature, more data of a particular kind, a different model, or a relabeling effort.

Start by collecting the largest errors. For regression, sort by residual magnitude. For classification, sort by the confidence of wrong predictions, because a confident mistake is more informative than a borderline one. Read a few dozen of these cases by hand. Patterns emerge quickly: a particular category dominates the errors, a range of the target is systematically under predicted, or a cluster of inputs shares a missing field.

125.4.2 4.2 Slice the Performance

Aggregate metrics can hide serious failures in subpopulations. Compute your metric separately across meaningful slices: by region, by customer segment, by time period, by input source, by class. A model with strong overall accuracy may be failing badly on a slice that matters for fairness or for revenue. Slicing turns “the model is 92 percent accurate” into “the model is 96 percent accurate on returning customers and 71 percent on new ones,” which is a finding you can act on.

125.4.3 4.3 Diagnose Bias Versus Variance

Use the gap between training and validation performance to decide your next move. When training and validation scores are both poor and close together, the model is underfitting, and the cure is a more expressive model, richer features, or less regularization. When training performance is high but validation lags well behind, the model is overfitting, and the cure is more data, stronger regularization, fewer features, or a simpler model. Learning curves, which plot performance against training set size, make this diagnosis visual: a wide persistent gap signals variance, while two low flat curves signal bias. This single distinction redirects more wasted effort than almost any other habit.

The distinction has a clean mathematical backbone. For squared error regression, the expected test error of a model trained on a random dataset $D$, at a fixed input $x$, decomposes into three nonnegative parts,

\[ \mathbb{E}_D\!\left[(y - \hat{f}_D(x))^2\right] = \underbrace{\big(\mathbb{E}_D[\hat{f}_D(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_D\!\left[\big(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)]\big)^2\right]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{irreducible}}, \]

where $f$ is the true function and $\sigma^2$ is the noise floor that no model can beat. Bias is the systematic error of the average model and falls as the model becomes more expressive. Variance is the sensitivity of the fit to the particular training sample and rises with expressiveness. Underfitting is the bias dominated regime, overfitting is the variance dominated regime, and the training to validation gap is the practical estimator of variance. Reading the gap this way tells you which term to attack: shrink bias with more capacity and richer features, or shrink variance with more data and stronger regularization. Adding data lowers variance but never lowers bias, which is exactly why two low flat learning curves mean more data will not help and only a more expressive model will.

125.5 5. Choosing the Model Family

125.5.1 5.1 Linear Models

Reach for linear and logistic regression when you want a strong, transparent, fast baseline, when the signal is roughly additive, and when interpretability or calibrated probabilities matter. Regularized variants are workhorses: ridge for correlated features, lasso when you want sparsity and implicit feature selection, and elastic net as a balance of the two. The three differ only in the penalty added to the fitting objective,

\[ \hat{\beta} = \arg\min_{\beta} \; \frac{1}{n} \sum_{i=1}^{n} L(y_i, x_i^\top \beta) \; + \; \lambda \Big[\, \alpha \lVert \beta \rVert_1 + (1 - \alpha)\, \tfrac{1}{2}\lVert \beta \rVert_2^2 \,\Big], \]

where $L$ is squared error for regression or logistic loss for classification, $\lambda \ge 0$ sets the overall penalty strength, and $\alpha \in [0, 1]$ mixes the two penalties. Ridge is the pure $\ell_2$ case ($\alpha = 0$), lasso the pure $\ell_1$ case ($\alpha = 1$), and elastic net any blend between. The geometry explains the behavior: the $\ell_1$ ball has corners on the axes, so its constraint surface meets the loss contours at points where some coefficients are exactly zero, which is why lasso performs selection, while the smooth $\ell_2$ ball merely shrinks correlated coefficients toward each other without zeroing them. Increasing $\lambda$ raises bias and lowers variance, so $\lambda$ is the single knob that walks the model along the bias variance tradeoff of section 4.3. Linear models shine in high dimensional sparse settings such as text with bag of words features, where they are hard to beat for the compute. Their weaknesses are real: they assume a linear decision boundary in the feature space, they need careful encoding and scaling, and they require you to engineer interactions and nonlinearities by hand.

125.5.2 5.2 Tree Ensembles

Gradient boosted trees, as implemented in XGBoost, LightGBM, and CatBoost, are the default first choice for tabular data and win a large fraction of practical problems. They capture nonlinearities and feature interactions automatically, they are insensitive to monotonic feature scaling, they handle mixed numeric and categorical data gracefully, and they tolerate missing values. Random forests are a robust, lower maintenance alternative that is harder to overfit and needs less tuning, at some cost in peak accuracy. The trade offs to keep in mind are that boosted models need sensible tuning of tree depth, learning rate, and number of trees, that they can overfit if pushed too hard, and that they extrapolate poorly outside the range of the training data.

125.5.3 5.3 Kernel Methods

Support vector machines and kernel methods are most useful when you have a small to medium number of samples, a high or complex feature space, and a need for a flexible nonlinear boundary. A support vector machine with a radial basis function kernel can model intricate boundaries with strong margins and good generalization on clean data. The catch is cost: kernel methods scale poorly past roughly tens of thousands of rows because the kernel computation grows superlinearly, they demand careful scaling and tuning of the regularization and kernel parameters, and they do not produce calibrated probabilities without extra work. On large tabular datasets, tree ensembles usually dominate them.

125.5.4 5.4 A Decision Heuristic

For most tabular problems, the pragmatic order is simple. Start with a linear or logistic baseline to set the reference and check the pipeline. Move to gradient boosted trees as the likely production model. Consider kernel methods only when the dataset is modest in size and the boundary is genuinely nonlinear in a way trees handle awkwardly. Choose linear models when interpretability, latency, or calibrated probabilities are the binding constraint, even at a small accuracy cost. Always weigh accuracy against the operational cost of serving, monitoring, and explaining the model, because the most accurate model is not always the right one to deploy.

small data, nonlinear boundary      -> SVM / kernel
tabular, want best accuracy         -> gradient boosted trees
need interpretability or calibration-> regularized linear
high dimensional sparse text        -> linear (lasso / elastic net)

125.6 6. Reproducibility

125.6.1 6.1 Pin Everything

A result you cannot reproduce is a result you cannot trust or improve. Set and record random seeds for every source of randomness, including data splitting, model initialization, and any sampling. Pin library versions in a lockfile, because a silent upgrade of a numerical library can shift results. Capture the exact data snapshot used, ideally by versioning the dataset or recording a content hash, since data drifts even when code does not.

125.6.2 6.2 Separate Configuration From Code

Keep hyperparameters, file paths, feature lists, and split definitions in configuration files rather than scattered through the code. This makes an experiment a single artifact that can be rerun, compared, and shared. Configuration as data also makes it far easier to sweep settings and to diff two runs.

125.6.3 6.3 Track Experiments

Log each run with its configuration, its metrics across folds, the data version, and the resulting artifact. A lightweight experiment tracker, or even a disciplined append only log, prevents the common failure where the best model exists only as a forgotten notebook cell. The goal is that any past result can be located, explained, and regenerated.

125.6.4 6.4 Make Training and Serving Match

Leakage and silent failures often hide in the gap between how features are computed during training and how they are computed in production. Share the feature code between both paths, or use a feature store that guarantees identical logic. Validate at serving time that incoming data matches the schema and distribution the model was trained on, and monitor for drift after deployment so that a degrading model is caught before it does damage.

125.7 7. Putting It Together

The thread running through this chapter is discipline over cleverness. Establish a baseline so you know what better means. Wrap preprocessing and modeling in a pipeline so that validation is honest and leakage has nowhere to hide. Validate with a splitting scheme that mirrors how the model will be used. Spend real time reading the model’s mistakes, because that is where the next improvement lives. Choose the model family to fit the data and the operational constraints rather than the hype, defaulting to linear baselines and tree ensembles for tabular work. Finally, make every result reproducible, because a model that cannot be regenerated cannot be maintained. None of these practices is glamorous, and together they are what reliably turns data into a system that works in production and keeps working.

125.8 References

Pedregosa, F. et al. “Scikit learn: Machine Learning in Python.” Journal of Machine Learning Research, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
Scikit learn developers. “Common pitfalls and recommended practices.” https://scikit-learn.org/stable/common_pitfalls.html
Scikit learn developers. “Cross validation: evaluating estimator performance.” https://scikit-learn.org/stable/modules/cross_validation.html
Kaufman, S., Rosset, S., Perlich, C. “Leakage in Data Mining: Formulation, Detection, and Avoidance.” ACM Transactions on Knowledge Discovery from Data, 2012. https://dl.acm.org/doi/10.1145/2382577.2382579
Chen, T., Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” KDD 2016. https://arxiv.org/abs/1603.02754
Ke, G. et al. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” NeurIPS 2017. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree
Prokhorenkova, L. et al. “CatBoost: unbiased boosting with categorical features.” NeurIPS 2018. https://arxiv.org/abs/1706.09516
Breiman, L. “Random Forests.” Machine Learning, 2001. https://link.springer.com/article/10.1023/A:1010933404324
Cortes, C., Vapnik, V. “Support Vector Networks.” Machine Learning, 1995. https://link.springer.com/article/10.1007/BF00994018
Ng, A. “Machine Learning Yearning.” https://info.deeplearning.ai/machine-learning-yearning-book
Grinsztajn, L., Oyallon, E., Varoquaux, G. “Why do tree based models still outperform deep learning on tabular data?” NeurIPS 2022. https://arxiv.org/abs/2207.08815
Sculley, D. et al. “Hidden Technical Debt in Machine Learning Systems.” NeurIPS 2015. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

# Classical Machine Learning Best Practices Classical machine learning still powers a large share of production systems. Fraud scoring, churn prediction, demand forecasting, credit risk, and ranking pipelines very often run on logistic regression, gradient boosted trees, or support vector machines rather than deep networks. These methods are fast to train, cheap to serve, easy to monitor, and frequently competitive with far more complex alternatives on tabular data. This chapter is a practical playbook. It walks through the modeling workflow as a disciplined process, explains how to build a baseline that you can trust, shows how leakage sneaks into projects and how to stop it, treats error analysis as a first class activity, gives concrete guidance on choosing among linear models, trees, and kernels, and closes with the reproducibility habits that separate a one off notebook from a durable system. ## 1. Start With a Baseline You Can Defend ### 1.1 Why the Baseline Comes First The single most common mistake in applied machine learning is to reach for a sophisticated model before establishing a reference point. Without a baseline you cannot answer the only question that matters: is this model good enough to be worth deploying, and is the added complexity buying real accuracy. A baseline turns a vague modeling effort into a measurable one. It also exposes data problems early, because a baseline that scores suspiciously high or suspiciously low is usually telling you something about the data rather than the algorithm. Build two kinds of baselines. The first is a trivial baseline that requires no learning at all. For classification, predict the majority class or the empirical class prior. For regression, predict the mean or median of the target. For a time series, predict the last observed value or a seasonal lag. These cost almost nothing and define the floor that any real model must clear. The second is a simple learned baseline. For tabular data this is usually a logistic or linear regression with light preprocessing, or a single gradient boosted tree model with default hyperparameters. The point is not to win, it is to set an honest reference and to verify that the full pipeline runs end to end. ```python # Trivial baseline first, then a simple learned one from sklearn.dummy import DummyClassifier from sklearn.linear_model import LogisticRegression DummyClassifier(strategy="most_frequent") # the floor LogisticRegression(max_iter=1000) # the honest reference ``` ### 1.2 Pick the Metric Before You Model Choose and write down your evaluation metric before training anything, because the metric defines what "better" means and quietly shapes every later decision. Match the metric to the decision the model supports. Accuracy is misleading under class imbalance, where a model that always predicts the majority class can look excellent while being useless. Prefer area under the precision recall curve when positives are rare, log loss or Brier score when you need calibrated probabilities, and a cost weighted metric when false positives and false negatives carry different business costs. For regression, decide deliberately between mean absolute error, which is robust to outliers, and root mean squared error, which punishes large misses. Define the metric once, and report it consistently from baseline through final model so that comparisons stay meaningful. It helps to keep the core definitions explicit. From the confusion matrix counts of true positives ($TP$), false positives ($FP$), true negatives ($TN$), and false negatives ($FN$), the standard classification quantities are $$ \text{precision} = \frac{TP}{TP + FP}, \qquad \text{recall} = \frac{TP}{TP + FN}, \qquad F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}. $$ Precision answers "when the model says positive, how often is it right," recall answers "of the true positives, how many did the model catch," and $F_1$ is their harmonic mean, which stays low unless both are high. When probabilities matter rather than hard labels, score the predicted probability $\hat{p}_i$ against the label $y_i \in \{0, 1\}$ with log loss or the Brier score, $$ \text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \big[\, y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) \,\big], \qquad \text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2 . $$ Both are minimized by reporting true probabilities, so both reward calibration, but log loss penalizes confident mistakes far more harshly because $-\log \hat{p}_i \to \infty$ as a confident prediction approaches certainty on the wrong class. For regression with residuals $r_i = y_i - \hat{y}_i$, $$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |r_i|, \qquad \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} r_i^2}. $$ Because RMSE squares the residuals before averaging, it is dominated by the largest misses and is the right choice when a few big errors are far more costly than many small ones. MAE weights every error equally and is the better summary when the target has heavy tailed noise or outliers you do not want to chase. The general rule is that the loss you optimize and the metric you report should encode the real cost of being wrong, and when those costs are asymmetric a cost weighted metric, $C = c_{FP} \cdot FP + c_{FN} \cdot FN$ with explicit per error costs, beats any symmetric default. ## 2. The Modeling Workflow The workflow is a loop with one strict rule: the final test set is touched exactly once, at the very end, and every other decision is made on training and validation data only. The diagram below shows the order and the protected boundary around the test set. ```{mermaid} flowchart TD A["Frame the problem and define the target"] --> B["Split data by time or group"] B --> C["Hold out final test set untouched"] B --> D["Build pipeline: impute, scale, encode, model"] D --> E["Cross validate on training data"] E --> F["Tune hyperparameters inside the loop"] F --> G["Error analysis and slicing"] G --> H{"Good enough"} H -->|"No"| D H -->|"Yes"| I["Evaluate once on test set"] I --> J["Reproducibility: pin, log, ship"] ``` ### 2.1 Frame the Problem and Split the Data Before any feature is engineered, state the prediction target precisely, the unit of prediction, and the moment in time at which a prediction must be made. This framing determines what information is legitimately available and therefore what counts as leakage later. Split the data immediately, and split it the way the model will actually be used. A random split is fine for independent and identically distributed rows, but most real problems are not that clean. If the data has a time dimension, split by time so that training precedes validation, which precedes the test period. If rows cluster by entity, such as multiple transactions per customer, split by group so that no customer appears in both training and test. Carve out a final test set that you touch exactly once, at the very end. Everything else, including cross validation, happens inside the remaining data. ### 2.2 Build a Pipeline, Not a Pile of Steps Preprocessing must live inside a single object that is fit on training folds only and applied to validation and test data. This is the difference between a method that generalizes and one that quietly cheats. Encapsulating imputation, scaling, encoding, and the estimator in one pipeline guarantees that every transformation learns its parameters from training data alone, and it makes cross validation correct by construction. ```python from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer pre = ColumnTransformer([ ("num", Pipeline([("imp", SimpleImputer()), ("sc", StandardScaler())]), num_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols), ]) model = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000))]) ``` ### 2.3 Validate Honestly, Then Tune Use cross validation to estimate performance and its variability, not just a single number. K fold cross validation with five or ten folds is the default for tabular data. Use stratified folds for classification to preserve class balance, grouped folds when rows cluster, and forward chaining splits for time series. Report the mean and the spread across folds, because a model with a high mean and a wide spread is fragile. Tune hyperparameters inside this validation loop, never against the final test set. When the tuning search is large, prefer nested cross validation or a clean separation of a tuning set, otherwise the validation score becomes optimistic. Random search and Bayesian optimization usually find good configurations faster than exhaustive grid search. Resist the urge to tune endlessly. The largest gains almost always come from better features and cleaner data, not from the third decimal place of a learning rate. ### 2.4 Iterate in Tight Loops Treat modeling as a sequence of small, reversible experiments. Change one thing, measure against the same validation protocol, keep what helps, and discard what does not. Log every run. A modeling effort that drifts without a record of what was tried tends to rediscover its own dead ends. ## 3. Avoiding Data Leakage ### 3.1 What Leakage Is and Why It Is So Dangerous Leakage occurs when information that would not be available at prediction time, or information from the validation and test sets, influences training. Its signature is a model that performs beautifully in offline evaluation and then collapses in production. Leakage is dangerous precisely because it hides inside good looking numbers, so the team ships a model that was never as good as it appeared. A precise way to state the condition is in terms of the information set available at the prediction moment. Let $t_i$ be the timestamp at which a prediction for example $i$ must be made, and let $\mathcal{F}_{t_i}$ denote everything legitimately knowable at or before $t_i$. A feature vector $x_i = \phi(\mathcal{F}_{t_i})$ is leakage free when its construction function $\phi$ reads only from $\mathcal{F}_{t_i}$. Leakage is any dependence of $x_i$ on information outside $\mathcal{F}_{t_i}$, whether that is the future ($\mathcal{F}_{s}$ for some $s > t_i$), the label $y_i$ itself, or statistics estimated from validation and test rows. Kaufman and colleagues [@kaufman2012leakage] formalize this as the requirement that every feature be a legitimate function of the information available before the target is determined, and they catalog how violations slip past standard evaluation. ### 3.2 The Common Leakage Patterns Several patterns recur often enough to memorize. Preprocessing on the full dataset is the classic case: fitting a scaler, an imputer, or a feature selector on all rows before splitting lets statistics from the test set bleed into training. Target leakage happens when a feature is a proxy for the label or is computed using the label, such as a "number of late payments" field that is only populated after default is known. Temporal leakage uses future information to predict the past, for example aggregating a customer's full year of activity to predict an event in the middle of that year. Group leakage places correlated rows from the same entity on both sides of the split. Duplicate rows that straddle the split inflate scores for the same reason. ### 3.3 How to Stop It Fit every transformation inside the pipeline on training folds only, which the pipeline pattern in section 2.2 enforces automatically. Construct features using only data with a timestamp at or before the prediction moment, and prefer point in time joins for any aggregated feature. Split by group or by time whenever rows are not independent. Audit your top features after training: if one feature single handedly dominates and produces near perfect accuracy, treat it as a leakage suspect until proven innocent. A quick sanity check is to ask, for every feature, whether its value would truly be known at the instant the prediction is made. ```python # Leaky: scaler sees the whole dataset before the split X_scaled = StandardScaler().fit_transform(X) # do not do this # Correct: scaling lives in the pipeline, fit on train folds only cross_val_score(model, X_train, y_train, cv=5) ``` ### 3.4 A Worked Example of Preprocessing Leakage To see why fitting preprocessing before splitting inflates scores, consider feature selection by univariate correlation, a deceptively common mistake. Suppose the label $y$ is pure noise, independent of every one of $d$ candidate features, so the true accuracy of any model is the class prior. If you select the $k$ features most correlated with $y$ using the entire dataset, then split, then evaluate, the selection step has already peeked at the test labels. With $d$ large and $k$ small, some features will correlate with $y$ by chance across the full sample, and because the same rows drive both selection and test scoring, that spurious correlation survives into the held out set. The reported accuracy rises well above the prior even though no signal exists. The fix is structural rather than statistical: place the selector inside the pipeline so it is refit on the training folds alone, which the pattern in section 2.2 enforces automatically. The held out rows then have no influence on which features are chosen, and the score returns to the honest floor. The general lesson is that any step that estimates a parameter from data, including imputation means, scaling statistics, encoders, and feature selectors, must be fit inside the cross validation loop. ## 4. Error Analysis ### 4.1 Look at the Mistakes, Not Just the Score A single aggregate metric compresses thousands of decisions into one number and hides where the model fails. Error analysis is the practice of reopening that number and studying the individual mistakes. It is the highest leverage activity in applied machine learning, because it tells you what to fix next, whether the answer is a new feature, more data of a particular kind, a different model, or a relabeling effort. Start by collecting the largest errors. For regression, sort by residual magnitude. For classification, sort by the confidence of wrong predictions, because a confident mistake is more informative than a borderline one. Read a few dozen of these cases by hand. Patterns emerge quickly: a particular category dominates the errors, a range of the target is systematically under predicted, or a cluster of inputs shares a missing field. ### 4.2 Slice the Performance Aggregate metrics can hide serious failures in subpopulations. Compute your metric separately across meaningful slices: by region, by customer segment, by time period, by input source, by class. A model with strong overall accuracy may be failing badly on a slice that matters for fairness or for revenue. Slicing turns "the model is 92 percent accurate" into "the model is 96 percent accurate on returning customers and 71 percent on new ones," which is a finding you can act on. ### 4.3 Diagnose Bias Versus Variance Use the gap between training and validation performance to decide your next move. When training and validation scores are both poor and close together, the model is underfitting, and the cure is a more expressive model, richer features, or less regularization. When training performance is high but validation lags well behind, the model is overfitting, and the cure is more data, stronger regularization, fewer features, or a simpler model. Learning curves, which plot performance against training set size, make this diagnosis visual: a wide persistent gap signals variance, while two low flat curves signal bias. This single distinction redirects more wasted effort than almost any other habit. The distinction has a clean mathematical backbone. For squared error regression, the expected test error of a model trained on a random dataset $D$, at a fixed input $x$, decomposes into three nonnegative parts, $$ \mathbb{E}_D\!\left[(y - \hat{f}_D(x))^2\right] = \underbrace{\big(\mathbb{E}_D[\hat{f}_D(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_D\!\left[\big(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)]\big)^2\right]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{irreducible}}, $$ where $f$ is the true function and $\sigma^2$ is the noise floor that no model can beat. Bias is the systematic error of the average model and falls as the model becomes more expressive. Variance is the sensitivity of the fit to the particular training sample and rises with expressiveness. Underfitting is the bias dominated regime, overfitting is the variance dominated regime, and the training to validation gap is the practical estimator of variance. Reading the gap this way tells you which term to attack: shrink bias with more capacity and richer features, or shrink variance with more data and stronger regularization. Adding data lowers variance but never lowers bias, which is exactly why two low flat learning curves mean more data will not help and only a more expressive model will. ## 5. Choosing the Model Family ### 5.1 Linear Models Reach for linear and logistic regression when you want a strong, transparent, fast baseline, when the signal is roughly additive, and when interpretability or calibrated probabilities matter. Regularized variants are workhorses: ridge for correlated features, lasso when you want sparsity and implicit feature selection, and elastic net as a balance of the two. The three differ only in the penalty added to the fitting objective, $$ \hat{\beta} = \arg\min_{\beta} \; \frac{1}{n} \sum_{i=1}^{n} L(y_i, x_i^\top \beta) \; + \; \lambda \Big[\, \alpha \lVert \beta \rVert_1 + (1 - \alpha)\, \tfrac{1}{2}\lVert \beta \rVert_2^2 \,\Big], $$ where $L$ is squared error for regression or logistic loss for classification, $\lambda \ge 0$ sets the overall penalty strength, and $\alpha \in [0, 1]$ mixes the two penalties. Ridge is the pure $\ell_2$ case ($\alpha = 0$), lasso the pure $\ell_1$ case ($\alpha = 1$), and elastic net any blend between. The geometry explains the behavior: the $\ell_1$ ball has corners on the axes, so its constraint surface meets the loss contours at points where some coefficients are exactly zero, which is why lasso performs selection, while the smooth $\ell_2$ ball merely shrinks correlated coefficients toward each other without zeroing them. Increasing $\lambda$ raises bias and lowers variance, so $\lambda$ is the single knob that walks the model along the bias variance tradeoff of section 4.3. Linear models shine in high dimensional sparse settings such as text with bag of words features, where they are hard to beat for the compute. Their weaknesses are real: they assume a linear decision boundary in the feature space, they need careful encoding and scaling, and they require you to engineer interactions and nonlinearities by hand. ### 5.2 Tree Ensembles Gradient boosted trees, as implemented in XGBoost, LightGBM, and CatBoost, are the default first choice for tabular data and win a large fraction of practical problems. They capture nonlinearities and feature interactions automatically, they are insensitive to monotonic feature scaling, they handle mixed numeric and categorical data gracefully, and they tolerate missing values. Random forests are a robust, lower maintenance alternative that is harder to overfit and needs less tuning, at some cost in peak accuracy. The trade offs to keep in mind are that boosted models need sensible tuning of tree depth, learning rate, and number of trees, that they can overfit if pushed too hard, and that they extrapolate poorly outside the range of the training data. ### 5.3 Kernel Methods Support vector machines and kernel methods are most useful when you have a small to medium number of samples, a high or complex feature space, and a need for a flexible nonlinear boundary. A support vector machine with a radial basis function kernel can model intricate boundaries with strong margins and good generalization on clean data. The catch is cost: kernel methods scale poorly past roughly tens of thousands of rows because the kernel computation grows superlinearly, they demand careful scaling and tuning of the regularization and kernel parameters, and they do not produce calibrated probabilities without extra work. On large tabular datasets, tree ensembles usually dominate them. ### 5.4 A Decision Heuristic For most tabular problems, the pragmatic order is simple. Start with a linear or logistic baseline to set the reference and check the pipeline. Move to gradient boosted trees as the likely production model. Consider kernel methods only when the dataset is modest in size and the boundary is genuinely nonlinear in a way trees handle awkwardly. Choose linear models when interpretability, latency, or calibrated probabilities are the binding constraint, even at a small accuracy cost. Always weigh accuracy against the operational cost of serving, monitoring, and explaining the model, because the most accurate model is not always the right one to deploy. ```text small data, nonlinear boundary -> SVM / kernel tabular, want best accuracy -> gradient boosted trees need interpretability or calibration-> regularized linear high dimensional sparse text -> linear (lasso / elastic net) ``` ## 6. Reproducibility ### 6.1 Pin Everything A result you cannot reproduce is a result you cannot trust or improve. Set and record random seeds for every source of randomness, including data splitting, model initialization, and any sampling. Pin library versions in a lockfile, because a silent upgrade of a numerical library can shift results. Capture the exact data snapshot used, ideally by versioning the dataset or recording a content hash, since data drifts even when code does not. ### 6.2 Separate Configuration From Code Keep hyperparameters, file paths, feature lists, and split definitions in configuration files rather than scattered through the code. This makes an experiment a single artifact that can be rerun, compared, and shared. Configuration as data also makes it far easier to sweep settings and to diff two runs. ### 6.3 Track Experiments Log each run with its configuration, its metrics across folds, the data version, and the resulting artifact. A lightweight experiment tracker, or even a disciplined append only log, prevents the common failure where the best model exists only as a forgotten notebook cell. The goal is that any past result can be located, explained, and regenerated. ### 6.4 Make Training and Serving Match Leakage and silent failures often hide in the gap between how features are computed during training and how they are computed in production. Share the feature code between both paths, or use a feature store that guarantees identical logic. Validate at serving time that incoming data matches the schema and distribution the model was trained on, and monitor for drift after deployment so that a degrading model is caught before it does damage. ## 7. Putting It Together The thread running through this chapter is discipline over cleverness. Establish a baseline so you know what better means. Wrap preprocessing and modeling in a pipeline so that validation is honest and leakage has nowhere to hide. Validate with a splitting scheme that mirrors how the model will be used. Spend real time reading the model's mistakes, because that is where the next improvement lives. Choose the model family to fit the data and the operational constraints rather than the hype, defaulting to linear baselines and tree ensembles for tabular work. Finally, make every result reproducible, because a model that cannot be regenerated cannot be maintained. None of these practices is glamorous, and together they are what reliably turns data into a system that works in production and keeps working. ## References 1. Pedregosa, F. et al. "Scikit learn: Machine Learning in Python." Journal of Machine Learning Research, 2011. https://jmlr.org/papers/v12/pedregosa11a.html 2. Scikit learn developers. "Common pitfalls and recommended practices." https://scikit-learn.org/stable/common_pitfalls.html 3. Scikit learn developers. "Cross validation: evaluating estimator performance." https://scikit-learn.org/stable/modules/cross_validation.html 4. Kaufman, S., Rosset, S., Perlich, C. "Leakage in Data Mining: Formulation, Detection, and Avoidance." ACM Transactions on Knowledge Discovery from Data, 2012. https://dl.acm.org/doi/10.1145/2382577.2382579 5. Chen, T., Guestrin, C. "XGBoost: A Scalable Tree Boosting System." KDD 2016. https://arxiv.org/abs/1603.02754 6. Ke, G. et al. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." NeurIPS 2017. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree 7. Prokhorenkova, L. et al. "CatBoost: unbiased boosting with categorical features." NeurIPS 2018. https://arxiv.org/abs/1706.09516 8. Breiman, L. "Random Forests." Machine Learning, 2001. https://link.springer.com/article/10.1023/A:1010933404324 9. Cortes, C., Vapnik, V. "Support Vector Networks." Machine Learning, 1995. https://link.springer.com/article/10.1007/BF00994018 10. Ng, A. "Machine Learning Yearning." https://info.deeplearning.ai/machine-learning-yearning-book 11. Grinsztajn, L., Oyallon, E., Varoquaux, G. "Why do tree based models still outperform deep learning on tabular data?" NeurIPS 2022. https://arxiv.org/abs/2207.08815 12. Sculley, D. et al. "Hidden Technical Debt in Machine Learning Systems." NeurIPS 2015. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems