66 Feature Engineering for Numerical Data

Numerical features are the workhorses of applied machine learning. They appear as measurements, counts, monetary amounts, ratios, durations, and aggregates, and they arrive with distributions that rarely match the assumptions baked into our models. Feature engineering for numerical data is the disciplined practice of reshaping these raw quantities so that learning algorithms can extract signal efficiently, optimize stably, and generalize beyond the training sample. This chapter develops the core techniques: scaling and standardization, nonlinear transforms, binning, interactions and polynomial expansions, and the treatment of skew and outliers. Throughout, we emphasize both the mathematics that explains why a method works and the engineering discipline that keeps it from leaking information or silently degrading in production.

A useful framing is to view every transform in this chapter as a deterministic map $g: \mathbb{R} \to \mathbb{R}$ (or $g: \mathbb{R}^p \to \mathbb{R}^q$ for interactions) parameterized by statistics $\theta$ estimated from data. Three properties of $g$ recur and are worth naming up front. A transform is monotone if $x < x'$ implies $g(x) < g(x')$, which means it preserves the rank ordering of values. It is invertible if $g^{-1}$ exists, which matters when predictions must be mapped back to the original units. And it is affine if $g(x) = a x + b$, the special case that scaling occupies and that cannot change distributional shape. With this vocabulary, the chapter’s logic is compact: scaling supplies affine maps that fix geometry but not shape, nonlinear transforms supply monotone maps that fix shape, binning supplies many-to-one maps that trade resolution for robustness, and interactions supply multi-input maps that add expressive power.

The following diagram orients the rest of the chapter around a single question: what does the model need, and what does the feature look like?

flowchart TD
    A["Numerical feature"] --> B{"Model scale sensitive"}
    B -->|"No, tree ensemble"| C["Skip scaling, build interactions"]
    B -->|"Yes"| D{"Distribution shape"}
    D -->|"Skewed or heavy tail"| E["Nonlinear transform first"]
    D -->|"Roughly symmetric"| F["Scale directly"]
    E --> G{"Outliers present"}
    F --> G
    G -->|"Yes"| H["Robust scaler or clip"]
    G -->|"No"| I["Standard or min-max scaler"]

66.1 1. Why Numerical Features Need Engineering

66.1.1 1.1 Scale Sensitivity and the Geometry of Models

Many algorithms operate on geometric notions of distance, magnitude, or curvature, and these notions are not invariant to the units in which features are recorded. Consider $k$ nearest neighbors with Euclidean distance. If one feature is income measured in dollars (values in the tens of thousands) and another is age measured in years (values in the tens), the squared distance

\[ d(\mathbf{x}, \mathbf{x}')^2 = \sum_{j=1}^{p} (x_j - x'_j)^2 \]

is dominated almost entirely by income. Age contributes nothing decision relevant even if it is the better predictor. The same pathology afflicts $k$ means clustering, support vector machines with radial basis kernels, and principal component analysis, where directions of maximal variance are an artifact of measurement units rather than of structure.

Gradient based learners suffer a related problem. When features differ in scale, the loss surface becomes elongated and the condition number of the Hessian grows large. Gradient descent then zigzags across narrow valleys, forcing tiny learning rates and slow convergence. To make this precise, consider least squares with design matrix $\mathbf{X}$. The Hessian of the loss is proportional to $\mathbf{X}^{\top}\mathbf{X}$, and gradient descent converges at a rate governed by the condition number $\kappa = \lambda_{\max} / \lambda_{\min}$, the ratio of the largest to smallest eigenvalue of that matrix. The number of iterations to reach a fixed accuracy grows roughly linearly in $\kappa$, and the error contraction per step behaves like $(\kappa - 1)/(\kappa + 1)$. A feature measured in dollars contributes a diagonal entry many orders of magnitude larger than one measured in fractions, inflating $\lambda_{\max}$, blowing up $\kappa$, and crippling convergence. Standardizing features so that each contributes comparably to curvature drives $\kappa$ toward one and accelerates optimization. This is why neural network training, logistic regression with regularization, and any L1 or L2 penalized model expect inputs on comparable scales.

There is a second, distinct reason that penalized models demand a common scale. A penalty such as $\lambda \sum_j \beta_j^2$ (ridge) or $\lambda \sum_j |\beta_j|$ (lasso) applies the same budget $\lambda$ to every coefficient. But the coefficient on a feature is inversely proportional to that feature’s units: if income is rescaled from dollars to thousands of dollars, its coefficient grows by a factor of a thousand. A shared penalty therefore punishes small-unit features far more than large-unit ones, an arbitrary bias driven entirely by measurement choice rather than predictive value. Standardization removes this artifact and makes the penalty comparable across coefficients.

66.1.2 1.2 Models That Do and Do Not Care

Not every model is scale sensitive. Decision trees and their ensembles (random forests, gradient boosted trees) split on thresholds of individual features and are invariant to any monotone transformation of a single feature. Rescaling or applying a log to one input leaves the tree structure unchanged because the order of values is preserved. For these models, scaling is largely wasted effort, though monotone transforms can still matter when they change how a feature interacts with others through the partition geometry.

The practical lesson is to match preprocessing to the inductive bias of the model. Distance based, kernel based, and gradient based linear or neural models benefit from scaling and from distributional transforms. Tree ensembles benefit instead from feature construction that exposes nonlinear relationships and interactions the splits would otherwise have to approximate piecewise.

66.2 2. Scaling and Standardization

66.2.1 2.1 Standardization (Z-score)

Standardization centers a feature at zero mean and rescales it to unit variance:

\[ z = \frac{x - \mu}{\sigma}, \]

where $\mu$ and $\sigma$ are the mean and standard deviation estimated on the training set. The transformed feature has mean $0$ and variance $1$. Standardization is an affine map, so it preserves the rank ordering and the shape of the distribution exactly; a skewed feature remains just as skewed after standardizing, only recentered and rescaled. It is the default choice for regularized linear models and neural networks because it places all coefficients on a comparable footing, which makes a shared penalty strength meaningful, and because it improves the conditioning of the optimization. It does not bound the range of values and does not remove outliers, which remain present as large positive or negative $z$ scores. A practical note: estimate $\sigma$ with a numerically stable one-pass or two-pass algorithm and guard against $\sigma = 0$ for constant features, which would divide by zero. Mature libraries such as the open-source scikit-learn StandardScaler handle both concerns and expose the learned $\mu$ and $\sigma$ so the transform can be frozen and reapplied.

66.2.2 2.2 Min-Max and Range Scaling

Min-max scaling maps a feature linearly onto a fixed interval, usually $[0, 1]$:

\[ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}. \]

This preserves the shape of the distribution while bounding the range, which is useful when an algorithm expects inputs in a specific interval or when you want to preserve zero entries in sparse data with the variant that divides by the maximum absolute value. The cost is extreme sensitivity to outliers: a single anomalous maximum compresses all other values into a tiny sub interval. Max-abs scaling, $x' = x / \max |x|$, is the sparsity preserving sibling that maps to $[-1, 1]$ without shifting the origin.

66.2.3 2.3 Robust Scaling

When outliers are present, robust scaling replaces the mean and standard deviation with order statistics that resist contamination:

\[ x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}, \]

where $\text{IQR} = Q_3 - Q_1$ is the interquartile range. Because the median and quartiles are insensitive to a small fraction of extreme values, the bulk of the distribution is scaled sensibly even when heavy tails are present. The contrast with standardization is sharp in the language of robust statistics: the mean has a breakdown point of zero, meaning a single sufficiently large value can drag it arbitrarily far, while the median has a breakdown point of fifty percent, tolerating contamination of up to half the sample before it can be forced to an arbitrary value. The same asymmetry holds for the standard deviation versus the IQR. Robust scaling inherits the high breakdown point of its order statistics and is a strong default for messy real world numerical data.

# Conceptual: choose a scaler by data characteristics
# StandardScaler   -> roughly symmetric, light tails
# RobustScaler     -> outliers present, heavy tails
# MinMaxScaler     -> bounded range required, few outliers
# MaxAbsScaler     -> sparse features, preserve zeros

66.2.3.1 Worked example: one outlier, three scalers

Take the eleven values $\{10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1000\}$, where the final entry is an extreme outlier. The first ten values sit tightly in $[10, 19]$. We compare how each scaler maps this clean bulk.

For standardization, the outlier pulls the mean up to about $104$ and inflates the standard deviation to roughly $283$. The clean values $10$ through $19$ are therefore mapped to $z$ scores between about $-0.33$ and $-0.30$, all crushed into a sliver near $-0.3$, while the outlier sits at about $+3.2$. The single anomaly has flattened the entire informative range.

For min-max scaling to $[0,1]$, the range is $1000 - 10 = 990$, so the clean values $10$ through $19$ map to $0$ through about $0.009$. They are compressed into the bottom one percent of the unit interval, and any downstream distance computation will treat them as effectively identical.

For robust scaling, the median is $15$ and the interquartile range (with the standard linear interpolation used by the scikit-learn RobustScaler) is $Q_3 - Q_1 = 17.5 - 12.5 = 5$. The clean values map to $\{-1.0, -0.8, \ldots, +0.8\}$, a sensible spread of order one, and the outlier lands far out at about $197$ where it belongs, without contaminating the encoding of everything else. This is the concrete payoff of a high breakdown point: the bulk of the data is scaled as if the outlier were not there.

66.2.4 2.4 Fit on Train, Transform Everywhere

The single most important operational rule is that scaling parameters must be estimated only on training data and then applied unchanged to validation, test, and production inputs. Estimating $\mu$, $\sigma$, the median, or the min and max using the full dataset leaks information about the held out distribution into the model and inflates measured performance. The correct pattern is to fit the transformer within a pipeline and let cross validation refit it on each training fold.

# Correct: parameters learned inside the resampling loop
pipeline = make_pipeline(RobustScaler(), Ridge(alpha=1.0))
scores = cross_val_score(pipeline, X, y, cv=5)

A subtle corollary concerns drift. Because the scaler freezes training statistics, a shift in the production distribution will push transformed values outside their training range. This is not a bug in the scaler; it is a signal worth monitoring, since it indicates the model is now extrapolating.

66.3 3. Nonlinear Transforms

Scaling is linear and cannot change the shape of a distribution. When a feature is strongly skewed or when its relationship with the target is multiplicative rather than additive, a nonlinear transform is the right tool.

66.3.1 3.1 The Log Transform

The logarithm compresses large values and expands small ones, which tames right skew and converts multiplicative structure into additive structure. If a target depends on a feature through a power law, $y \approx a x^b$, then $\log y \approx \log a + b \log x$ is linear and learnable by a linear model. Log transforms are natural for quantities that are inherently positive and span several orders of magnitude, such as income, prices, counts, populations, and durations.

The bare logarithm is undefined at zero and negative values, so the common practice is the $\log 1p$ transform,

\[ x' = \log(1 + x), \]

which is defined at $x = 0$ and behaves like $x$ near the origin, since $\log(1 + x) = x - x^2/2 + \cdots$ by Taylor expansion. For data with a natural offset $c$, use $\log(x + c)$ with $c$ chosen to keep the argument positive. The log is strictly monotone on its domain, so it never reorders values, and it is invertible through $\exp$, which is what makes back-transformation possible.

The bias of back-transformation deserves care. By Jensen’s inequality, because $\exp$ is convex, $\mathbb{E}[\exp(\log Y)] \geq \exp(\mathbb{E}[\log Y])$, with equality only for a degenerate distribution. Concretely, predicting the mean of $\log y$ and exponentiating recovers the geometric mean of $y$, which is biased low for the arithmetic mean. When the original scale matters, apply a correction such as Duan’s smearing estimator, which multiplies the naive prediction by the sample mean of $\exp(\hat{r}_i)$ over the residuals $\hat{r}_i$ on the log scale and requires no distributional assumption.

66.3.2 3.2 Box-Cox

The Box-Cox family generalizes the log into a continuum of power transforms indexed by a parameter $\lambda$:

\[ x^{(\lambda)} = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[2mm] \log x, & \lambda = 0. \end{cases} \]

The form is constructed so that the transform is continuous in $\lambda$ at zero, where it reduces smoothly to the log. To see this, write $x^{\lambda} = \exp(\lambda \log x) = 1 + \lambda \log x + O(\lambda^2)$, so $(x^{\lambda} - 1)/\lambda \to \log x$ as $\lambda \to 0$, which is exactly why the $\lambda = 0$ case is defined to be the log. Familiar special cases line up along the continuum: $\lambda = 1$ is an affine shift that leaves the shape unchanged, $\lambda = 0.5$ is a square-root transform appropriate for count data, and $\lambda = 0$ is the log.

The parameter is chosen by maximum likelihood to make the transformed feature as close to Gaussian as possible. Treating the transformed values as normal with mean $\mu$ and variance $\sigma^2$, the profile log likelihood after concentrating out $\mu$ and $\sigma^2$ is

\[ \ell(\lambda) = -\frac{n}{2} \log \hat{\sigma}^2_\lambda + (\lambda - 1) \sum_{i=1}^{n} \log x_i, \]

where $\hat{\sigma}^2_\lambda$ is the variance of the transformed data and the final term is the log Jacobian that prevents $\lambda$ from running off to extremes simply by shrinking the scale. Maximizing $\ell$ over $\lambda$, typically by a one-dimensional search, gives the estimate. Box-Cox requires strictly positive inputs, so it cannot be applied directly to data containing zeros or negative values without an offset.

66.3.3 3.3 Yeo-Johnson

Yeo-Johnson extends the power transform to the whole real line, removing the positivity restriction:

\[ x^{(\lambda)} = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\ x \geq 0,\\[2mm] \log(x + 1), & \lambda = 0,\ x \geq 0,\\[2mm] \dfrac{-\left[(-x+1)^{2-\lambda} - 1\right]}{2 - \lambda}, & \lambda \neq 2,\ x < 0,\\[2mm] -\log(-x + 1), & \lambda = 2,\ x < 0. \end{cases} \]

It handles zeros and negatives gracefully and reduces to a shifted log on the nonnegative side, which makes it the most general purpose of the parametric power transforms. As with Box-Cox, $\lambda$ is estimated by maximum likelihood under approximate normality. In modern pipelines, Yeo-Johnson applied as a power transformer is a reliable first attempt at symmetrizing arbitrary numerical features. As always, $\lambda$ is fit on training data only.

66.3.4 3.4 Quantile and Rank Transforms

A fully nonparametric alternative maps a feature through its empirical cumulative distribution function. The quantile transform sends each value to its rank percentile and then optionally pushes that uniform variable through the inverse normal CDF to produce a Gaussian output:

\[ x' = \Phi^{-1}\big(\hat{F}(x)\big). \]

Here $\hat{F}$ is the empirical CDF estimated on the training set and $\Phi^{-1}$ is the standard normal quantile function. The construction rests on the probability integral transform: if $X$ has continuous CDF $F$, then $F(X)$ is exactly uniform on $[0, 1]$, and passing a uniform variable through $\Phi^{-1}$ yields a standard normal. The quantile transform applies this identity empirically, so it can coerce essentially any continuous shape onto the target, which makes it extremely robust to outliers since extreme values are mapped to extreme but finite quantiles rather than to extreme magnitudes. The cost is that it discards information about magnitude and the spacing between values, keeping only their order; it can overfit the empirical CDF when the sample is small, since each training point defines a step; and it requires interpolation for values that fall between or beyond observed training quantiles at inference time. Rank based transforms are valuable when the precise scale is meaningless but the ordering is informative. The open-source scikit-learn QuantileTransformer implements this map and lets you choose a uniform or normal output distribution.

66.4 4. Binning and Discretization

Binning replaces a continuous feature with a small set of intervals, converting a real value into an ordinal or categorical code. Discretization is a deliberate loss of resolution traded for robustness, nonlinearity, and interpretability.

66.4.1 4.1 Equal-Width and Equal-Frequency Binning

Equal-width binning partitions the observed range into $k$ intervals of identical width $(x_{\max} - x_{\min}) / k$. It is simple but allocates bins poorly when the distribution is skewed, leaving some bins nearly empty and others overcrowded. Equal-frequency (quantile) binning instead chooses cut points at sample quantiles so that each bin holds roughly the same number of observations. Quantile binning adapts to the distribution and is generally preferable for skewed data, though its boundaries are data dependent and must be frozen from the training set.

66.4.2 4.2 Supervised and Model-Based Binning

Unsupervised binning ignores the target. Supervised discretization chooses cut points to maximize a relationship with the outcome, for example by minimizing impurity the way a decision tree does when it selects a split. Tree based binning produces intervals that are predictive by construction and is essentially equivalent to fitting a shallow tree on a single feature. A related industrial technique is weight of evidence binning used in credit scoring, where each bin is encoded by

\[ \text{WoE} = \log\frac{P(x \in \text{bin} \mid y = 1)}{P(x \in \text{bin} \mid y = 0)}, \]

which linearizes the relationship between the feature and the log odds of the target and yields a monotone, interpretable encoding. The connection to a logistic model is direct: under a logistic regression, the log odds of $y = 1$ are additive in the features, and the weight of evidence of a bin is precisely the contribution that bin makes to those log odds relative to the population base rate. As a small illustration, suppose a bin contains $40$ of the $200$ positives and $10$ of the $800$ negatives in the training data. Then the bin’s share of positives is $0.20$ and its share of negatives is $0.0125$, so $\text{WoE} = \log(0.20 / 0.0125) = \log 16 \approx 2.77$, a strongly positive value flagging this interval as heavily enriched for the positive class. Replacing the raw feature within the bin by this scalar gives a model that is both monotone in risk and auditable, which is why the technique remains standard in regulated credit scoring.

66.4.3 4.3 When Binning Helps and When It Hurts

Binning lets a linear model capture nonlinear and non monotone effects, since each bin can receive its own coefficient once expanded into indicator variables. It is robust to outliers because extreme values fall harmlessly into edge bins, and it can improve interpretability for stakeholders who reason in terms of ranges. The drawbacks are real: discretization discards within bin variation, introduces arbitrary boundary discontinuities where similar values land in different bins, and reduces statistical power if applied too aggressively. For tree ensembles, which already partition the space, manual binning is usually redundant. Reserve binning for linear and additive models where its expressive payoff is highest.

66.5 5. Interactions and Polynomial Features

66.5.1 5.1 Why Interactions Matter

A linear model assumes the effect of each feature on the target is additive and independent of every other feature. Many real relationships violate this. The effect of a drug dose may depend on body weight; the value of a house feature may depend on neighborhood. An interaction term encodes such dependence as a product:

\[ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2. \]

The coefficient $\beta_{12}$ measures how the slope on $x_1$ changes as $x_2$ varies, since the partial derivative $\partial \hat{y} / \partial x_1 = \beta_1 + \beta_{12} x_2$ now depends on $x_2$. Without the product term, that derivative is the constant $\beta_1$ and the model cannot represent any conditional structure. Constructing interaction features by hand, guided by domain knowledge, is often the highest leverage step in feature engineering for linear models.

Two cautions accompany interaction terms. First, the product $x_1 x_2$ is usually correlated with its parents $x_1$ and $x_2$, which inflates multicollinearity; centering each feature before forming the product reduces this correlation substantially and makes the main-effect coefficients interpretable at the mean. Second, an interaction term is not invariant to shifting the origin of its inputs, so the decision to center is a modeling choice with real consequences rather than a cosmetic one.

66.5.2 5.2 Polynomial Expansions

Polynomial features generalize interactions to include powers of individual features. A degree two expansion of inputs $x_1, x_2$ produces

\[ \{1,\ x_1,\ x_2,\ x_1^2,\ x_1 x_2,\ x_2^2\}. \]

This lets a linear model fit curved response surfaces while remaining linear in its parameters, so it is still solved by ordinary least squares or its regularized variants. The danger is combinatorial explosion: for $p$ inputs, a degree $d$ expansion generates on the order of $\binom{p + d}{d}$ terms, which grows rapidly and invites overfitting and multicollinearity. Polynomial features are best used at low degree, on a curated subset of inputs, and almost always paired with regularization to control the inflated parameter space.

# Degree-2 interaction-only features keep the count manageable
poly = PolynomialFeatures(degree=2, interaction_only=True,
                          include_bias=False)
X_inter = poly.fit_transform(X)

66.5.3 5.3 Practical Guidance

Standardize before generating polynomial terms, because raising large valued features to powers produces enormous numbers that wreck conditioning. Prefer interaction_only expansions when you suspect cross effects but not curvature, since this avoids the squared terms and keeps the feature count lower. Let regularization, especially L1, prune the expanded set rather than trusting every generated term. Finally, recall that gradient boosted trees discover interactions automatically through sequential splits, so explicit polynomial construction yields the largest gains for linear and kernel free models.

66.6 6. Handling Skew and Outliers

66.6.1 6.1 Diagnosing Skew

Skewness measures asymmetry of a distribution. The standardized third moment,

\[ \gamma_1 = \mathbb{E}\!\left[\left(\frac{x - \mu}{\sigma}\right)^3\right], \]

is positive for a right tail and negative for a left tail. Heavy right skew is the most common pathology in business data, where counts and monetary amounts pile up near small values with a long tail of large ones. Skew degrades linear models because it makes residuals heteroscedastic and lets the tail dominate squared error loss. The log, Box-Cox, and Yeo-Johnson transforms of the previous sections are the primary remedies, chosen by inspecting the transformed skewness or by maximizing a normality criterion.

66.6.2 6.2 Detecting Outliers

An outlier is an observation far from the bulk of the data, and its influence depends entirely on the model. Two classical univariate rules are the $z$ score rule, which flags $|z| > 3$, and the more robust IQR rule, which flags points outside $[Q_1 - 1.5\,\text{IQR},\ Q_3 + 1.5\,\text{IQR}]$. The IQR rule is preferred because the $z$ score itself is computed from a mean and standard deviation that the outliers contaminate. Multivariate outliers, which look normal on every axis but anomalous jointly, require methods such as Mahalanobis distance,

\[ D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}, \]

or model based detectors like isolation forests and local outlier factor.

66.6.3 6.3 Treating Outliers

There is no universal correct treatment; the right action depends on whether the extreme value is an error or a genuine rare event. Common strategies include the following. Removal deletes the offending rows, appropriate only when you are confident they are measurement errors, because discarding genuine extremes biases the model. Winsorizing (clipping) caps values at chosen percentiles, for example the 1st and 99th, retaining the row while bounding its leverage. Transformation via the log or a power transform pulls in the tail so that extremes become unexceptional in the transformed space, often the most principled option. Robust modeling sidesteps the problem with loss functions such as the Huber loss that downweight large residuals rather than altering the data.

\[ L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2, & |r| \leq \delta,\\[1mm] \delta\left(|r| - \tfrac{1}{2}\delta\right), & |r| > \delta. \end{cases} \]

The Huber loss behaves quadratically for small residuals and linearly for large ones, which limits the influence any single outlier exerts on the fit.

66.6.4 6.4 Documenting and Monitoring

Whatever treatment you choose, record the thresholds and percentiles as part of the model artifact, derive them from training data only, and apply them unchanged at inference. Production monitoring should track the rate at which incoming values trigger clipping or fall outside training quantiles, because a rising rate is an early warning of distribution drift that may demand retraining.

66.7 7. When to Use What, and the Common Pitfalls

The techniques in this chapter are not interchangeable, and most production failures trace to applying the right method in the wrong place or fitting it on the wrong data. The following table consolidates the guidance scattered through the chapter into a quick reference.

Technique	Use when	Avoid or be careful when	Classic pitfall
Standardization	Regularized linear, neural, distance, or kernel models with light tails	Strong outliers dominate the mean and variance	Treating it as outlier removal; it only recenters
Min-max scaling	A bounded input range is required and outliers are rare	Heavy tails compress the bulk into a sliver	One extreme value flattening the informative range
Robust scaling	Messy data with heavy tails or contamination	The distribution is genuinely symmetric and clean	Forgetting it does not symmetrize, only rescales
Log or log1p	Positive, right-skewed, multiplicative quantities	Zeros or negatives without an offset	Naive back-transform giving a biased mean
Box-Cox	Strictly positive features needing near-normality	Any nonpositive value is present	Refitting lambda on test data, leaking information
Yeo-Johnson	Arbitrary-sign features needing symmetry	Interpretability of units matters	Assuming it fixes outliers; it tames but keeps them
Quantile transform	Order matters but scale is meaningless	Small samples or magnitude carries signal	Overfitting the empirical CDF
Binning	Linear or additive models needing nonlinearity	Tree ensembles that already partition	Discarding power through over-aggressive bins
Polynomial features	Suspected curvature in a few curated inputs	Many inputs or high degree	Combinatorial blowup and multicollinearity

Three failure modes cut across every row and are worth stating plainly. The first is data leakage: estimating any parameter, a mean, a quantile, a lambda, a bin edge, on data that includes the validation or test split. The second is silent extrapolation: a frozen transform receiving production values outside the training support, where its behavior is undefined or unstable. The third is mismatched bias: applying a scale-sensitive remedy to a tree ensemble, or a tree-friendly construction to a linear model, so the effort is wasted or even harmful.

66.8 8. Putting It Together

A coherent numerical feature pipeline composes these steps in a sensible order. Diagnose the distribution of each feature and the inductive bias of the chosen model. Apply a distributional transform such as Yeo-Johnson to symmetrize skewed inputs. Scale the result with a standard or robust scaler appropriate to the remaining tail behavior. Construct interactions and low degree polynomial terms where domain knowledge or diagnostics suggest conditional or curved effects. Bin selectively when a linear model must express non monotone structure. Treat outliers explicitly through clipping, transformation, or a robust loss. Critically, encapsulate every estimated parameter inside a pipeline so it is fit on training folds alone, preventing leakage and ensuring that cross validation reflects genuine generalization.

pipe = make_pipeline(
    PowerTransformer(method="yeo-johnson"),  # symmetrize, fit on train
    RobustScaler(),                          # scale, resist residual tails
    PolynomialFeatures(degree=2, interaction_only=True),
    Ridge(alpha=1.0),                        # regularize the expansion
)

Feature engineering for numerical data is not a mechanical preprocessing chore but a modeling decision in its own right. The transforms in this chapter encode hypotheses about the structure of the data: that effects are multiplicative, that tails are noise rather than signal, that two features interact. Chosen well and validated honestly, they often deliver larger gains than swapping one learning algorithm for another.

66.9 References

Kuhn, M., and Johnson, K. Feature Engineering and Selection: A Practical Approach for Predictive Models. http://www.feat.engineering/
Zheng, A., and Casari, A. Feature Engineering for Machine Learning. O’Reilly Media. https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/
Box, G. E. P., and Cox, D. R. “An Analysis of Transformations.” Journal of the Royal Statistical Society, Series B, 1964. https://www.jstor.org/stable/2984418
Yeo, I., and Johnson, R. A. “A New Family of Power Transformations to Improve Normality or Symmetry.” Biometrika, 2000. https://doi.org/10.1093/biomet/87.4.954
Scikit-learn Developers. “Preprocessing data.” Scikit-learn User Guide. https://scikit-learn.org/stable/modules/preprocessing.html
Scikit-learn Developers. “Compare the effect of different scalers on data with outliers.” https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
Huber, P. J. “Robust Estimation of a Location Parameter.” Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732
Duan, N. “Smearing Estimate: A Nonparametric Retransformation Method.” Journal of the American Statistical Association, 1983. https://doi.org/10.1080/01621459.1983.10478017
Liu, F. T., Ting, K. M., and Zhou, Z. “Isolation Forest.” IEEE ICDM, 2008. https://doi.org/10.1109/ICDM.2008.17
Siddiqi, N. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley. https://www.wiley.com/en-us/Credit+Risk+Scorecards-p-9780471754510
Maronna, R. A., Martin, R. D., Yohai, V. J., and Salibian-Barrera, M. Robust Statistics: Theory and Methods (with R), 2nd ed. Wiley, 2019. https://doi.org/10.1002/9781119214656

# Feature Engineering for Numerical Data Numerical features are the workhorses of applied machine learning. They appear as measurements, counts, monetary amounts, ratios, durations, and aggregates, and they arrive with distributions that rarely match the assumptions baked into our models. Feature engineering for numerical data is the disciplined practice of reshaping these raw quantities so that learning algorithms can extract signal efficiently, optimize stably, and generalize beyond the training sample. This chapter develops the core techniques: scaling and standardization, nonlinear transforms, binning, interactions and polynomial expansions, and the treatment of skew and outliers. Throughout, we emphasize both the mathematics that explains why a method works and the engineering discipline that keeps it from leaking information or silently degrading in production. A useful framing is to view every transform in this chapter as a deterministic map $g: \mathbb{R} \to \mathbb{R}$ (or $g: \mathbb{R}^p \to \mathbb{R}^q$ for interactions) parameterized by statistics $\theta$ estimated from data. Three properties of $g$ recur and are worth naming up front. A transform is **monotone** if $x < x'$ implies $g(x) < g(x')$, which means it preserves the rank ordering of values. It is **invertible** if $g^{-1}$ exists, which matters when predictions must be mapped back to the original units. And it is **affine** if $g(x) = a x + b$, the special case that scaling occupies and that cannot change distributional shape. With this vocabulary, the chapter's logic is compact: scaling supplies affine maps that fix geometry but not shape, nonlinear transforms supply monotone maps that fix shape, binning supplies many-to-one maps that trade resolution for robustness, and interactions supply multi-input maps that add expressive power. The following diagram orients the rest of the chapter around a single question: what does the model need, and what does the feature look like? ```{mermaid} flowchart TD A["Numerical feature"] --> B{"Model scale sensitive"} B -->|"No, tree ensemble"| C["Skip scaling, build interactions"] B -->|"Yes"| D{"Distribution shape"} D -->|"Skewed or heavy tail"| E["Nonlinear transform first"] D -->|"Roughly symmetric"| F["Scale directly"] E --> G{"Outliers present"} F --> G G -->|"Yes"| H["Robust scaler or clip"] G -->|"No"| I["Standard or min-max scaler"] ``` ## 1. Why Numerical Features Need Engineering ### 1.1 Scale Sensitivity and the Geometry of Models Many algorithms operate on geometric notions of distance, magnitude, or curvature, and these notions are not invariant to the units in which features are recorded. Consider $k$ nearest neighbors with Euclidean distance. If one feature is income measured in dollars (values in the tens of thousands) and another is age measured in years (values in the tens), the squared distance $$ d(\mathbf{x}, \mathbf{x}')^2 = \sum_{j=1}^{p} (x_j - x'_j)^2 $$ is dominated almost entirely by income. Age contributes nothing decision relevant even if it is the better predictor. The same pathology afflicts $k$ means clustering, support vector machines with radial basis kernels, and principal component analysis, where directions of maximal variance are an artifact of measurement units rather than of structure. Gradient based learners suffer a related problem. When features differ in scale, the loss surface becomes elongated and the condition number of the Hessian grows large. Gradient descent then zigzags across narrow valleys, forcing tiny learning rates and slow convergence. To make this precise, consider least squares with design matrix $\mathbf{X}$. The Hessian of the loss is proportional to $\mathbf{X}^{\top}\mathbf{X}$, and gradient descent converges at a rate governed by the condition number $\kappa = \lambda_{\max} / \lambda_{\min}$, the ratio of the largest to smallest eigenvalue of that matrix. The number of iterations to reach a fixed accuracy grows roughly linearly in $\kappa$, and the error contraction per step behaves like $(\kappa - 1)/(\kappa + 1)$. A feature measured in dollars contributes a diagonal entry many orders of magnitude larger than one measured in fractions, inflating $\lambda_{\max}$, blowing up $\kappa$, and crippling convergence. Standardizing features so that each contributes comparably to curvature drives $\kappa$ toward one and accelerates optimization. This is why neural network training, logistic regression with regularization, and any L1 or L2 penalized model expect inputs on comparable scales. There is a second, distinct reason that penalized models demand a common scale. A penalty such as $\lambda \sum_j \beta_j^2$ (ridge) or $\lambda \sum_j |\beta_j|$ (lasso) applies the same budget $\lambda$ to every coefficient. But the coefficient on a feature is inversely proportional to that feature's units: if income is rescaled from dollars to thousands of dollars, its coefficient grows by a factor of a thousand. A shared penalty therefore punishes small-unit features far more than large-unit ones, an arbitrary bias driven entirely by measurement choice rather than predictive value. Standardization removes this artifact and makes the penalty comparable across coefficients. ### 1.2 Models That Do and Do Not Care Not every model is scale sensitive. Decision trees and their ensembles (random forests, gradient boosted trees) split on thresholds of individual features and are invariant to any monotone transformation of a single feature. Rescaling or applying a log to one input leaves the tree structure unchanged because the order of values is preserved. For these models, scaling is largely wasted effort, though monotone transforms can still matter when they change how a feature interacts with others through the partition geometry. The practical lesson is to match preprocessing to the inductive bias of the model. Distance based, kernel based, and gradient based linear or neural models benefit from scaling and from distributional transforms. Tree ensembles benefit instead from feature construction that exposes nonlinear relationships and interactions the splits would otherwise have to approximate piecewise. ## 2. Scaling and Standardization ### 2.1 Standardization (Z-score) Standardization centers a feature at zero mean and rescales it to unit variance: $$ z = \frac{x - \mu}{\sigma}, $$ where $\mu$ and $\sigma$ are the mean and standard deviation estimated on the training set. The transformed feature has mean $0$ and variance $1$. Standardization is an affine map, so it preserves the rank ordering and the shape of the distribution exactly; a skewed feature remains just as skewed after standardizing, only recentered and rescaled. It is the default choice for regularized linear models and neural networks because it places all coefficients on a comparable footing, which makes a shared penalty strength meaningful, and because it improves the conditioning of the optimization. It does not bound the range of values and does not remove outliers, which remain present as large positive or negative $z$ scores. A practical note: estimate $\sigma$ with a numerically stable one-pass or two-pass algorithm and guard against $\sigma = 0$ for constant features, which would divide by zero. Mature libraries such as the open-source scikit-learn `StandardScaler` handle both concerns and expose the learned $\mu$ and $\sigma$ so the transform can be frozen and reapplied. ### 2.2 Min-Max and Range Scaling Min-max scaling maps a feature linearly onto a fixed interval, usually $[0, 1]$: $$ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}. $$ This preserves the shape of the distribution while bounding the range, which is useful when an algorithm expects inputs in a specific interval or when you want to preserve zero entries in sparse data with the variant that divides by the maximum absolute value. The cost is extreme sensitivity to outliers: a single anomalous maximum compresses all other values into a tiny sub interval. Max-abs scaling, $x' = x / \max |x|$, is the sparsity preserving sibling that maps to $[-1, 1]$ without shifting the origin. ### 2.3 Robust Scaling When outliers are present, robust scaling replaces the mean and standard deviation with order statistics that resist contamination: $$ x' = \frac{x - \text{median}(x)}{\text{IQR}(x)}, $$ where $\text{IQR} = Q_3 - Q_1$ is the interquartile range. Because the median and quartiles are insensitive to a small fraction of extreme values, the bulk of the distribution is scaled sensibly even when heavy tails are present. The contrast with standardization is sharp in the language of robust statistics: the mean has a **breakdown point** of zero, meaning a single sufficiently large value can drag it arbitrarily far, while the median has a breakdown point of fifty percent, tolerating contamination of up to half the sample before it can be forced to an arbitrary value. The same asymmetry holds for the standard deviation versus the IQR. Robust scaling inherits the high breakdown point of its order statistics and is a strong default for messy real world numerical data. ```python # Conceptual: choose a scaler by data characteristics # StandardScaler -> roughly symmetric, light tails # RobustScaler -> outliers present, heavy tails # MinMaxScaler -> bounded range required, few outliers # MaxAbsScaler -> sparse features, preserve zeros ``` #### Worked example: one outlier, three scalers Take the eleven values $\{10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1000\}$, where the final entry is an extreme outlier. The first ten values sit tightly in $[10, 19]$. We compare how each scaler maps this clean bulk. For **standardization**, the outlier pulls the mean up to about $104$ and inflates the standard deviation to roughly $283$. The clean values $10$ through $19$ are therefore mapped to $z$ scores between about $-0.33$ and $-0.30$, all crushed into a sliver near $-0.3$, while the outlier sits at about $+3.2$. The single anomaly has flattened the entire informative range. For **min-max scaling** to $[0,1]$, the range is $1000 - 10 = 990$, so the clean values $10$ through $19$ map to $0$ through about $0.009$. They are compressed into the bottom one percent of the unit interval, and any downstream distance computation will treat them as effectively identical. For **robust scaling**, the median is $15$ and the interquartile range (with the standard linear interpolation used by the scikit-learn `RobustScaler`) is $Q_3 - Q_1 = 17.5 - 12.5 = 5$. The clean values map to $\{-1.0, -0.8, \ldots, +0.8\}$, a sensible spread of order one, and the outlier lands far out at about $197$ where it belongs, without contaminating the encoding of everything else. This is the concrete payoff of a high breakdown point: the bulk of the data is scaled as if the outlier were not there. ### 2.4 Fit on Train, Transform Everywhere The single most important operational rule is that scaling parameters must be estimated only on training data and then applied unchanged to validation, test, and production inputs. Estimating $\mu$, $\sigma$, the median, or the min and max using the full dataset leaks information about the held out distribution into the model and inflates measured performance. The correct pattern is to fit the transformer within a pipeline and let cross validation refit it on each training fold. ```python # Correct: parameters learned inside the resampling loop pipeline = make_pipeline(RobustScaler(), Ridge(alpha=1.0)) scores = cross_val_score(pipeline, X, y, cv=5) ``` A subtle corollary concerns drift. Because the scaler freezes training statistics, a shift in the production distribution will push transformed values outside their training range. This is not a bug in the scaler; it is a signal worth monitoring, since it indicates the model is now extrapolating. ## 3. Nonlinear Transforms Scaling is linear and cannot change the shape of a distribution. When a feature is strongly skewed or when its relationship with the target is multiplicative rather than additive, a nonlinear transform is the right tool. ### 3.1 The Log Transform The logarithm compresses large values and expands small ones, which tames right skew and converts multiplicative structure into additive structure. If a target depends on a feature through a power law, $y \approx a x^b$, then $\log y \approx \log a + b \log x$ is linear and learnable by a linear model. Log transforms are natural for quantities that are inherently positive and span several orders of magnitude, such as income, prices, counts, populations, and durations. The bare logarithm is undefined at zero and negative values, so the common practice is the $\log 1p$ transform, $$ x' = \log(1 + x), $$ which is defined at $x = 0$ and behaves like $x$ near the origin, since $\log(1 + x) = x - x^2/2 + \cdots$ by Taylor expansion. For data with a natural offset $c$, use $\log(x + c)$ with $c$ chosen to keep the argument positive. The log is strictly monotone on its domain, so it never reorders values, and it is invertible through $\exp$, which is what makes back-transformation possible. The bias of back-transformation deserves care. By Jensen's inequality, because $\exp$ is convex, $\mathbb{E}[\exp(\log Y)] \geq \exp(\mathbb{E}[\log Y])$, with equality only for a degenerate distribution. Concretely, predicting the mean of $\log y$ and exponentiating recovers the geometric mean of $y$, which is biased low for the arithmetic mean. When the original scale matters, apply a correction such as Duan's smearing estimator, which multiplies the naive prediction by the sample mean of $\exp(\hat{r}_i)$ over the residuals $\hat{r}_i$ on the log scale and requires no distributional assumption. ### 3.2 Box-Cox The Box-Cox family generalizes the log into a continuum of power transforms indexed by a parameter $\lambda$: $$ x^{(\lambda)} = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[2mm] \log x, & \lambda = 0. \end{cases} $$ The form is constructed so that the transform is continuous in $\lambda$ at zero, where it reduces smoothly to the log. To see this, write $x^{\lambda} = \exp(\lambda \log x) = 1 + \lambda \log x + O(\lambda^2)$, so $(x^{\lambda} - 1)/\lambda \to \log x$ as $\lambda \to 0$, which is exactly why the $\lambda = 0$ case is defined to be the log. Familiar special cases line up along the continuum: $\lambda = 1$ is an affine shift that leaves the shape unchanged, $\lambda = 0.5$ is a square-root transform appropriate for count data, and $\lambda = 0$ is the log. The parameter is chosen by maximum likelihood to make the transformed feature as close to Gaussian as possible. Treating the transformed values as normal with mean $\mu$ and variance $\sigma^2$, the profile log likelihood after concentrating out $\mu$ and $\sigma^2$ is $$ \ell(\lambda) = -\frac{n}{2} \log \hat{\sigma}^2_\lambda + (\lambda - 1) \sum_{i=1}^{n} \log x_i, $$ where $\hat{\sigma}^2_\lambda$ is the variance of the transformed data and the final term is the log Jacobian that prevents $\lambda$ from running off to extremes simply by shrinking the scale. Maximizing $\ell$ over $\lambda$, typically by a one-dimensional search, gives the estimate. Box-Cox requires strictly positive inputs, so it cannot be applied directly to data containing zeros or negative values without an offset. ### 3.3 Yeo-Johnson Yeo-Johnson extends the power transform to the whole real line, removing the positivity restriction: $$ x^{(\lambda)} = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\ x \geq 0,\\[2mm] \log(x + 1), & \lambda = 0,\ x \geq 0,\\[2mm] \dfrac{-\left[(-x+1)^{2-\lambda} - 1\right]}{2 - \lambda}, & \lambda \neq 2,\ x < 0,\\[2mm] -\log(-x + 1), & \lambda = 2,\ x < 0. \end{cases} $$ It handles zeros and negatives gracefully and reduces to a shifted log on the nonnegative side, which makes it the most general purpose of the parametric power transforms. As with Box-Cox, $\lambda$ is estimated by maximum likelihood under approximate normality. In modern pipelines, Yeo-Johnson applied as a power transformer is a reliable first attempt at symmetrizing arbitrary numerical features. As always, $\lambda$ is fit on training data only. ### 3.4 Quantile and Rank Transforms A fully nonparametric alternative maps a feature through its empirical cumulative distribution function. The quantile transform sends each value to its rank percentile and then optionally pushes that uniform variable through the inverse normal CDF to produce a Gaussian output: $$ x' = \Phi^{-1}\big(\hat{F}(x)\big). $$ Here $\hat{F}$ is the empirical CDF estimated on the training set and $\Phi^{-1}$ is the standard normal quantile function. The construction rests on the probability integral transform: if $X$ has continuous CDF $F$, then $F(X)$ is exactly uniform on $[0, 1]$, and passing a uniform variable through $\Phi^{-1}$ yields a standard normal. The quantile transform applies this identity empirically, so it can coerce essentially any continuous shape onto the target, which makes it extremely robust to outliers since extreme values are mapped to extreme but finite quantiles rather than to extreme magnitudes. The cost is that it discards information about magnitude and the spacing between values, keeping only their order; it can overfit the empirical CDF when the sample is small, since each training point defines a step; and it requires interpolation for values that fall between or beyond observed training quantiles at inference time. Rank based transforms are valuable when the precise scale is meaningless but the ordering is informative. The open-source scikit-learn `QuantileTransformer` implements this map and lets you choose a uniform or normal output distribution. ## 4. Binning and Discretization Binning replaces a continuous feature with a small set of intervals, converting a real value into an ordinal or categorical code. Discretization is a deliberate loss of resolution traded for robustness, nonlinearity, and interpretability. ### 4.1 Equal-Width and Equal-Frequency Binning Equal-width binning partitions the observed range into $k$ intervals of identical width $(x_{\max} - x_{\min}) / k$. It is simple but allocates bins poorly when the distribution is skewed, leaving some bins nearly empty and others overcrowded. Equal-frequency (quantile) binning instead chooses cut points at sample quantiles so that each bin holds roughly the same number of observations. Quantile binning adapts to the distribution and is generally preferable for skewed data, though its boundaries are data dependent and must be frozen from the training set. ### 4.2 Supervised and Model-Based Binning Unsupervised binning ignores the target. Supervised discretization chooses cut points to maximize a relationship with the outcome, for example by minimizing impurity the way a decision tree does when it selects a split. Tree based binning produces intervals that are predictive by construction and is essentially equivalent to fitting a shallow tree on a single feature. A related industrial technique is weight of evidence binning used in credit scoring, where each bin is encoded by $$ \text{WoE} = \log\frac{P(x \in \text{bin} \mid y = 1)}{P(x \in \text{bin} \mid y = 0)}, $$ which linearizes the relationship between the feature and the log odds of the target and yields a monotone, interpretable encoding. The connection to a logistic model is direct: under a logistic regression, the log odds of $y = 1$ are additive in the features, and the weight of evidence of a bin is precisely the contribution that bin makes to those log odds relative to the population base rate. As a small illustration, suppose a bin contains $40$ of the $200$ positives and $10$ of the $800$ negatives in the training data. Then the bin's share of positives is $0.20$ and its share of negatives is $0.0125$, so $\text{WoE} = \log(0.20 / 0.0125) = \log 16 \approx 2.77$, a strongly positive value flagging this interval as heavily enriched for the positive class. Replacing the raw feature within the bin by this scalar gives a model that is both monotone in risk and auditable, which is why the technique remains standard in regulated credit scoring. ### 4.3 When Binning Helps and When It Hurts Binning lets a linear model capture nonlinear and non monotone effects, since each bin can receive its own coefficient once expanded into indicator variables. It is robust to outliers because extreme values fall harmlessly into edge bins, and it can improve interpretability for stakeholders who reason in terms of ranges. The drawbacks are real: discretization discards within bin variation, introduces arbitrary boundary discontinuities where similar values land in different bins, and reduces statistical power if applied too aggressively. For tree ensembles, which already partition the space, manual binning is usually redundant. Reserve binning for linear and additive models where its expressive payoff is highest. ## 5. Interactions and Polynomial Features ### 5.1 Why Interactions Matter A linear model assumes the effect of each feature on the target is additive and independent of every other feature. Many real relationships violate this. The effect of a drug dose may depend on body weight; the value of a house feature may depend on neighborhood. An interaction term encodes such dependence as a product: $$ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2. $$ The coefficient $\beta_{12}$ measures how the slope on $x_1$ changes as $x_2$ varies, since the partial derivative $\partial \hat{y} / \partial x_1 = \beta_1 + \beta_{12} x_2$ now depends on $x_2$. Without the product term, that derivative is the constant $\beta_1$ and the model cannot represent any conditional structure. Constructing interaction features by hand, guided by domain knowledge, is often the highest leverage step in feature engineering for linear models. Two cautions accompany interaction terms. First, the product $x_1 x_2$ is usually correlated with its parents $x_1$ and $x_2$, which inflates multicollinearity; centering each feature before forming the product reduces this correlation substantially and makes the main-effect coefficients interpretable at the mean. Second, an interaction term is not invariant to shifting the origin of its inputs, so the decision to center is a modeling choice with real consequences rather than a cosmetic one. ### 5.2 Polynomial Expansions Polynomial features generalize interactions to include powers of individual features. A degree two expansion of inputs $x_1, x_2$ produces $$ \{1,\ x_1,\ x_2,\ x_1^2,\ x_1 x_2,\ x_2^2\}. $$ This lets a linear model fit curved response surfaces while remaining linear in its parameters, so it is still solved by ordinary least squares or its regularized variants. The danger is combinatorial explosion: for $p$ inputs, a degree $d$ expansion generates on the order of $\binom{p + d}{d}$ terms, which grows rapidly and invites overfitting and multicollinearity. Polynomial features are best used at low degree, on a curated subset of inputs, and almost always paired with regularization to control the inflated parameter space. ```python # Degree-2 interaction-only features keep the count manageable poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) X_inter = poly.fit_transform(X) ``` ### 5.3 Practical Guidance Standardize before generating polynomial terms, because raising large valued features to powers produces enormous numbers that wreck conditioning. Prefer `interaction_only` expansions when you suspect cross effects but not curvature, since this avoids the squared terms and keeps the feature count lower. Let regularization, especially L1, prune the expanded set rather than trusting every generated term. Finally, recall that gradient boosted trees discover interactions automatically through sequential splits, so explicit polynomial construction yields the largest gains for linear and kernel free models. ## 6. Handling Skew and Outliers ### 6.1 Diagnosing Skew Skewness measures asymmetry of a distribution. The standardized third moment, $$ \gamma_1 = \mathbb{E}\!\left[\left(\frac{x - \mu}{\sigma}\right)^3\right], $$ is positive for a right tail and negative for a left tail. Heavy right skew is the most common pathology in business data, where counts and monetary amounts pile up near small values with a long tail of large ones. Skew degrades linear models because it makes residuals heteroscedastic and lets the tail dominate squared error loss. The log, Box-Cox, and Yeo-Johnson transforms of the previous sections are the primary remedies, chosen by inspecting the transformed skewness or by maximizing a normality criterion. ### 6.2 Detecting Outliers An outlier is an observation far from the bulk of the data, and its influence depends entirely on the model. Two classical univariate rules are the $z$ score rule, which flags $|z| > 3$, and the more robust IQR rule, which flags points outside $[Q_1 - 1.5\,\text{IQR},\ Q_3 + 1.5\,\text{IQR}]$. The IQR rule is preferred because the $z$ score itself is computed from a mean and standard deviation that the outliers contaminate. Multivariate outliers, which look normal on every axis but anomalous jointly, require methods such as Mahalanobis distance, $$ D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}, $$ or model based detectors like isolation forests and local outlier factor. ### 6.3 Treating Outliers There is no universal correct treatment; the right action depends on whether the extreme value is an error or a genuine rare event. Common strategies include the following. Removal deletes the offending rows, appropriate only when you are confident they are measurement errors, because discarding genuine extremes biases the model. Winsorizing (clipping) caps values at chosen percentiles, for example the 1st and 99th, retaining the row while bounding its leverage. Transformation via the log or a power transform pulls in the tail so that extremes become unexceptional in the transformed space, often the most principled option. Robust modeling sidesteps the problem with loss functions such as the Huber loss that downweight large residuals rather than altering the data. $$ L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2, & |r| \leq \delta,\\[1mm] \delta\left(|r| - \tfrac{1}{2}\delta\right), & |r| > \delta. \end{cases} $$ The Huber loss behaves quadratically for small residuals and linearly for large ones, which limits the influence any single outlier exerts on the fit. ### 6.4 Documenting and Monitoring Whatever treatment you choose, record the thresholds and percentiles as part of the model artifact, derive them from training data only, and apply them unchanged at inference. Production monitoring should track the rate at which incoming values trigger clipping or fall outside training quantiles, because a rising rate is an early warning of distribution drift that may demand retraining. ## 7. When to Use What, and the Common Pitfalls The techniques in this chapter are not interchangeable, and most production failures trace to applying the right method in the wrong place or fitting it on the wrong data. The following table consolidates the guidance scattered through the chapter into a quick reference. | Technique | Use when | Avoid or be careful when | Classic pitfall | |---|---|---|---| | Standardization | Regularized linear, neural, distance, or kernel models with light tails | Strong outliers dominate the mean and variance | Treating it as outlier removal; it only recenters | | Min-max scaling | A bounded input range is required and outliers are rare | Heavy tails compress the bulk into a sliver | One extreme value flattening the informative range | | Robust scaling | Messy data with heavy tails or contamination | The distribution is genuinely symmetric and clean | Forgetting it does not symmetrize, only rescales | | Log or log1p | Positive, right-skewed, multiplicative quantities | Zeros or negatives without an offset | Naive back-transform giving a biased mean | | Box-Cox | Strictly positive features needing near-normality | Any nonpositive value is present | Refitting lambda on test data, leaking information | | Yeo-Johnson | Arbitrary-sign features needing symmetry | Interpretability of units matters | Assuming it fixes outliers; it tames but keeps them | | Quantile transform | Order matters but scale is meaningless | Small samples or magnitude carries signal | Overfitting the empirical CDF | | Binning | Linear or additive models needing nonlinearity | Tree ensembles that already partition | Discarding power through over-aggressive bins | | Polynomial features | Suspected curvature in a few curated inputs | Many inputs or high degree | Combinatorial blowup and multicollinearity | Three failure modes cut across every row and are worth stating plainly. The first is **data leakage**: estimating any parameter, a mean, a quantile, a lambda, a bin edge, on data that includes the validation or test split. The second is **silent extrapolation**: a frozen transform receiving production values outside the training support, where its behavior is undefined or unstable. The third is **mismatched bias**: applying a scale-sensitive remedy to a tree ensemble, or a tree-friendly construction to a linear model, so the effort is wasted or even harmful. ## 8. Putting It Together A coherent numerical feature pipeline composes these steps in a sensible order. Diagnose the distribution of each feature and the inductive bias of the chosen model. Apply a distributional transform such as Yeo-Johnson to symmetrize skewed inputs. Scale the result with a standard or robust scaler appropriate to the remaining tail behavior. Construct interactions and low degree polynomial terms where domain knowledge or diagnostics suggest conditional or curved effects. Bin selectively when a linear model must express non monotone structure. Treat outliers explicitly through clipping, transformation, or a robust loss. Critically, encapsulate every estimated parameter inside a pipeline so it is fit on training folds alone, preventing leakage and ensuring that cross validation reflects genuine generalization. ```python pipe = make_pipeline( PowerTransformer(method="yeo-johnson"), # symmetrize, fit on train RobustScaler(), # scale, resist residual tails PolynomialFeatures(degree=2, interaction_only=True), Ridge(alpha=1.0), # regularize the expansion ) ``` Feature engineering for numerical data is not a mechanical preprocessing chore but a modeling decision in its own right. The transforms in this chapter encode hypotheses about the structure of the data: that effects are multiplicative, that tails are noise rather than signal, that two features interact. Chosen well and validated honestly, they often deliver larger gains than swapping one learning algorithm for another. ## References 1. Kuhn, M., and Johnson, K. *Feature Engineering and Selection: A Practical Approach for Predictive Models*. http://www.feat.engineering/ 2. Zheng, A., and Casari, A. *Feature Engineering for Machine Learning*. O'Reilly Media. https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/ 3. Box, G. E. P., and Cox, D. R. "An Analysis of Transformations." *Journal of the Royal Statistical Society, Series B*, 1964. https://www.jstor.org/stable/2984418 4. Yeo, I., and Johnson, R. A. "A New Family of Power Transformations to Improve Normality or Symmetry." *Biometrika*, 2000. https://doi.org/10.1093/biomet/87.4.954 5. Scikit-learn Developers. "Preprocessing data." Scikit-learn User Guide. https://scikit-learn.org/stable/modules/preprocessing.html 6. Scikit-learn Developers. "Compare the effect of different scalers on data with outliers." https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html 7. Huber, P. J. "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 1964. https://doi.org/10.1214/aoms/1177703732 8. Duan, N. "Smearing Estimate: A Nonparametric Retransformation Method." *Journal of the American Statistical Association*, 1983. https://doi.org/10.1080/01621459.1983.10478017 9. Liu, F. T., Ting, K. M., and Zhou, Z. "Isolation Forest." *IEEE ICDM*, 2008. https://doi.org/10.1109/ICDM.2008.17 10. Siddiqi, N. *Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring*. Wiley. https://www.wiley.com/en-us/Credit+Risk+Scorecards-p-9780471754510 11. Maronna, R. A., Martin, R. D., Yohai, V. J., and Salibian-Barrera, M. *Robust Statistics: Theory and Methods (with R)*, 2nd ed. Wiley, 2019. https://doi.org/10.1002/9781119214656