67 Feature Engineering for Categorical Data

Categorical variables are pervasive in applied machine learning. Customer identifiers, product SKUs, postal codes, device types, diagnosis codes, and merchant names all carry predictive signal, yet almost no learning algorithm consumes raw strings directly. The work of converting a category into a numeric representation that a model can exploit is among the most consequential decisions in a tabular modeling pipeline. A poor encoding inflates dimensionality, leaks the target, or discards the structure that made the variable useful in the first place. A good encoding preserves signal, controls variance, and respects the boundary between training and evaluation. This chapter develops the major encoding families with attention to both their statistical justification and their practical failure modes.

67.1 1. The Categorical Encoding Problem

67.1.1 1.1 What a Categorical Variable Is

A categorical variable takes values from a finite or countably infinite set of discrete levels $\mathcal{C} = \{c_1, c_2, \dots, c_K\}$, where the labels themselves carry no inherent arithmetic meaning. We distinguish nominal variables, whose levels have no natural order (color, country, payment method), from ordinal variables, whose levels admit a meaningful ranking (education level, satisfaction rating, T-shirt size). A third practically important axis is cardinality. A variable with a handful of levels (blood type) behaves very differently from one with millions of levels (user identifier). High cardinality is where naive encodings break down, and most of the interesting engineering lives there.

Formally, an encoder is a map $\psi : \mathcal{C} \to \mathbb{R}^d$ that sends each level to a numeric vector. The art of categorical feature engineering is choosing $\psi$ so that the geometry it imposes on $\mathcal{C}$ matches the structure the downstream model can exploit. The relevant scale of measurement is worth naming precisely. A nominal variable supports only the equality relation, so any valid encoding must be invariant to a relabeling of the levels (a permutation of $\mathcal{C}$). An ordinal variable additionally supports an order relation $\preceq$, so a valid encoding should be monotone in that order, but differences between codes carry no guaranteed meaning. This taxonomy, due to Stevens, is the formal reason that integer codes are appropriate for ordinal data and dangerous for nominal data: an integer code silently asserts an interval scale that the data does not support.

67.1.2 1.2 Why Encoding Choice Matters

Encoding is not a neutral preprocessing step. It interacts with the model family, the cardinality of the variable, the size of the training set, and the risk of information leakage. Three tensions recur throughout this chapter.

The first is dimensionality versus expressiveness. One-hot encoding is faithful but explodes the feature count for high cardinality variables. The second is bias versus variance. Encodings that summarize a category by its relationship to the target reduce dimensionality dramatically but introduce variance for rare categories and bias if estimated carelessly. The third is leakage versus fidelity. The most powerful supervised encodings use the target, which means they can silently leak label information into features unless the estimation respects strict separation between the rows used to fit the encoder and the rows used to fit the model.

A useful mental model is that every encoder is a small statistical estimator embedded inside your pipeline, and it inherits all the usual concerns about overfitting, regularization, and validation.

The diagram below summarizes the decision that the rest of the chapter develops in detail.

flowchart TD
    A["Categorical variable"] --> B{"Genuine order?"}
    B -->|"Yes"| C["Integer or contrast encoding"]
    B -->|"No"| D{"Cardinality"}
    D -->|"Low, under 50 levels"| E["One-hot or dummy encoding"]
    D -->|"High"| F{"Model and constraints"}
    F -->|"Linear or tree, batch"| G["Leakage-safe target encoding"]
    F -->|"Streaming or tight memory"| H["Feature hashing"]
    F -->|"Neural net, ample data"| I["Learned entity embeddings"]

67.2 2. One-Hot and Dummy Encoding

67.2.1 2.1 The Construction

One-hot encoding maps a categorical variable with $K$ levels to a binary vector of length $K$. For an observation whose category is $c_k$, the encoding is the indicator vector $\mathbf{e}_k$ that is $1$ in position $k$ and $0$ elsewhere. Formally, the encoded feature for level $j$ is

\[ x_j = \mathbb{1}[c = c_j], \quad j = 1, \dots, K. \]

This representation is sometimes called a one-of-$K$ or indicator coding. It makes no assumption about ordering and treats every level as equidistant from every other level, which is exactly what we want for nominal variables.

# conceptual, not optimized
def one_hot(value, levels):
    return [1 if value == c else 0 for c in levels]

67.2.2 2.2 Dummy Encoding and Collinearity

Dummy encoding is a close variant that uses only $K-1$ columns, designating one level as the reference category and representing it implicitly by the all zero vector. The distinction matters for linear models. With one-hot encoding plus an intercept term, the $K$ indicator columns sum to the constant $1$, which equals the intercept column. This perfect linear dependency is the dummy variable trap. The design matrix becomes rank deficient, the ordinary least squares solution is not unique, and coefficient interpretation collapses. Dropping one level restores full rank.

For regularized linear models the calculus changes. Under $L_2$ penalization the solution is again unique even with the redundant column, and dropping a reference level introduces an asymmetry because the penalty now treats the reference category differently from the others. For tree based models, neither collinearity nor the reference level choice affects predictions, since trees split on individual indicators without inverting a design matrix. The practical rule is to drop a level for unregularized linear models and to keep all levels for regularized linear models and trees.

67.2.3 2.3 Strengths, Costs, and When to Stop

One-hot encoding is lossless, interpretable, and model agnostic, which makes it the correct default for low cardinality variables. Its weakness is purely dimensional. A variable with $K$ levels adds $K$ sparse columns, and several such variables interact multiplicatively in memory and training time. Beyond roughly fifty levels the sparsity becomes a liability for many model families, the per level sample size shrinks, and rare levels contribute noisy, weakly identified coefficients. At that point one of the dimensionality reducing encodings below becomes preferable. A common compromise is to one-hot only the top $m$ most frequent levels and collapse the remaining tail into a single “other” bucket.

67.3 3. Ordinal and Contrast Encoding

67.3.1 3.1 Encoding Genuine Order

When a variable is genuinely ordinal, integer encoding that respects the ranking is both compact and informative. We map levels to integers $0, 1, 2, \dots, K-1$ following the domain order, so that “low”, “medium”, “high” becomes $0, 1, 2$. This single column communicates monotonic structure to the model, which a one-hot encoding would discard entirely. Tree based models exploit this cleanly because a single threshold split such as $x \le 1$ separates the lower levels from the upper ones.

The serious hazard is imposing a false ordering on a nominal variable. Encoding {red, green, blue} as $0, 1, 2$ asserts that green sits “between” red and blue and that the distance from red to blue is twice the distance from red to green. Linear models will take this fabricated geometry literally and may fit spurious trends. Reserve integer encoding for variables with a defensible order, and prefer one-hot or a target based scheme for nominal variables.

67.3.2 3.2 Spacing and Contrast Coding

Plain integer codes assume equal spacing between adjacent levels, which is rarely exactly true. If domain knowledge supplies meaningful numeric anchors, for example mapping satisfaction levels to survey midpoints, use those values directly rather than consecutive integers. The classical statistics literature also offers contrast coding schemes such as Helmert and polynomial contrasts that let a linear model test specific hypotheses about how successive levels differ. These are most valuable in inferential settings where the coefficients themselves are the object of interest. In predictive machine learning they are used less often, since flexible models recover the relevant structure from simpler codings, but they remain useful when interpretability of level to level differences matters.

67.4 4. Target and Mean Encoding

67.4.1 4.1 The Core Idea

Target encoding, also called mean encoding or likelihood encoding, replaces each category with a statistic of the target computed over the rows belonging to that category. For a regression target the natural choice is the conditional mean,

\[ \hat{\mu}_k = \frac{1}{n_k} \sum_{i : c_i = c_k} y_i, \]

where $n_k$ is the number of training rows in category $k$. For binary classification the same formula yields the empirical positive rate, an estimate of $P(y = 1 \mid c = c_k)$. The appeal is immediate. A variable with a million levels collapses to a single numeric column that directly encodes each level’s relationship to the target, sidestepping the dimensionality explosion of one-hot encoding while retaining far more signal than an arbitrary integer code.

67.4.2 4.2 Smoothing Toward the Prior

Raw category means are dangerously high variance for rare levels. A category observed twice, both times positive, yields an estimate of $1.0$ that the model will trust as strongly as a category observed ten thousand times. To see this quantitatively, note that for a binary target with true rate $p_k$ the raw mean has variance $\operatorname{Var}(\hat{\mu}_k) = p_k(1-p_k)/n_k$, which blows up as $n_k \to 1$. The standard remedy is empirical Bayes shrinkage toward the global mean $\mu$, blending the category estimate with the prior in proportion to how much evidence the category provides,

\[ \tilde{\mu}_k = \frac{n_k \, \hat{\mu}_k + m \, \mu}{n_k + m}, \]

where $m$ is a smoothing strength expressed in pseudo counts. When $n_k$ is large the estimate is dominated by the category’s own data, and when $n_k$ is small it is pulled toward the global mean. The hyperparameter $m$ controls the crossover point and is best chosen by cross validation.

This formula is not an ad hoc convex combination. It is the posterior mean of a conjugate Bayesian model. Place a Beta$(\alpha, \beta)$ prior on the category rate $p_k$ and observe $s_k = n_k \hat{\mu}_k$ successes in $n_k$ Bernoulli trials. The posterior is Beta$(\alpha + s_k, \beta + n_k - s_k)$, whose mean is

\[ \mathbb{E}[p_k \mid \text{data}] = \frac{\alpha + s_k}{\alpha + \beta + n_k} = \frac{n_k \hat{\mu}_k + (\alpha + \beta)\,\tfrac{\alpha}{\alpha+\beta}}{n_k + (\alpha + \beta)}. \]

Identifying $m = \alpha + \beta$ as the prior pseudo count and $\mu = \alpha/(\alpha+\beta)$ as the prior mean recovers the smoothing formula exactly. The prior strength $m$ thus has a clean interpretation: it is the number of phantom observations, drawn at the global rate, that we mix into every category before trusting its own data. Setting $m$ near the typical per category count makes the encoder skeptical of small samples and confident only when a level is well populated. A common alternative parameterization uses a logistic weighting

\[ \lambda(n_k) = \frac{1}{1 + e^{-(n_k - k_0)/f}}, \]

with $\tilde{\mu}_k = \lambda(n_k)\,\hat{\mu}_k + (1 - \lambda(n_k))\,\mu$, which gives a smooth sigmoid transition governed by an inflection point $k_0$ and a slope $f$. Either way, smoothing is not optional for high cardinality target encoding. It is what makes the estimator usable on the long tail.

Worked example. Suppose a binary churn target has a global positive rate $\mu = 0.20$ and we use smoothing strength $m = 10$. Consider three merchant levels. Merchant A appears $n_A = 2$ times, both churned, so $\hat{\mu}_A = 1.0$. Merchant B appears $n_B = 50$ times with $30$ churns, so $\hat{\mu}_B = 0.60$. Merchant C appears $n_C = 5000$ times with $1500$ churns, so $\hat{\mu}_C = 0.30$. The smoothed codes are

\[ \tilde{\mu}_A = \frac{2(1.0) + 10(0.2)}{2 + 10} = 0.333, \quad \tilde{\mu}_B = \frac{50(0.6) + 10(0.2)}{50 + 10} = 0.533, \quad \tilde{\mu}_C = \frac{5000(0.3) + 10(0.2)}{5000 + 10} = 0.300. \]

The rare merchant A, whose raw estimate of $1.0$ rested on a single decisive observation, is pulled most of the way back to the prior, while the well populated merchant C is essentially unchanged. This is exactly the behavior we want: the encoder expresses confidence in proportion to evidence.

67.4.3 4.3 The Leakage Problem

Target encoding uses the label, so it can leak. If you compute $\hat{\mu}_k$ from the same rows you then train on, each row’s encoded feature contains a contribution from that row’s own target. The model discovers that the encoded column is a slightly noisy copy of the answer, training metrics look spectacular, and validation collapses. The effect is most severe precisely for rare categories, where a single row dominates its own category mean. This is one of the most common and most costly mistakes in applied tabular modeling, and naive in place target encoding should be treated as a bug rather than a technique.

67.4.4 4.4 Leakage-Safe Cross-Fitting

The principled fix is to ensure that the encoding for any given row is estimated only from other rows. Out of fold target encoding, also called cross-fitting, achieves this. Partition the training data into $K$ folds. For each fold, compute the category statistics using only the other $K-1$ folds and apply them to the held out fold. Every training row thus receives an encoding that never saw its own target. At inference time, and for the validation and test sets, you encode using statistics fit on the entire training set, since those rows are disjoint from the data being encoded.

# out-of-fold target encoding, schematic
for train_idx, oof_idx in kfold.split(X):
    stats = smoothed_means(X[train_idx], y[train_idx])
    encoded[oof_idx] = map_with_prior(X[oof_idx], stats, global_mean)
# final encoder for test/inference: fit on all training rows
full_stats = smoothed_means(X, y)

A finer grained variant computes a leave one out or running estimate. The gradient boosting library CatBoost popularized ordered target statistics, which impose a random permutation on the rows and encode each row using only the targets of rows that precede it in the permutation, averaged over several permutations to reduce variance. This online formulation eliminates leakage without sacrificing data to fixed folds and is one reason CatBoost performs strongly on high cardinality categorical data out of the box.

Two further safeguards matter. First, combine cross-fitting with smoothing, since the two address different problems, leakage and small sample variance respectively. Second, add a small amount of Gaussian noise to the out of fold encodings during training when categories remain nearly separable, which further weakens any residual ability of the model to memorize through the encoded column.

67.4.5 4.5 Multiclass and Count Variants

For a multiclass target with $C$ classes, target encoding produces $C-1$ columns per categorical variable, one conditional probability per class with the usual reference dropped. This partially reintroduces dimensionality, though far below one-hot for high cardinality inputs. A lighter alternative is count or frequency encoding, which replaces each level with its number of occurrences or its relative frequency. Frequency encoding ignores the target entirely, so it cannot leak, and it captures the often predictive notion of how common a category is, but it conflates distinct categories that happen to share a frequency and should be viewed as a complement to rather than a substitute for target encoding.

67.5 5. Feature Hashing

67.5.1 5.1 The Hashing Trick

Feature hashing, or the hashing trick, sidesteps the need to enumerate categories at all. A hash function $h$ maps each category string to an index in a fixed range $\{0, 1, \dots, d-1\}$, and the category is encoded as the corresponding position in a $d$ dimensional vector. The output dimension $d$ is chosen in advance and is independent of the true cardinality $K$, which may be unknown or unbounded. To reduce the bias introduced by collisions, a signed variant uses a second hash $\xi$ mapping to $\{-1, +1\}$ and writes $\pm 1$ into the bucket,

\[ \phi_j(c) = \sum_{c : h(c) = j} \xi(c), \]

so that collisions tend to cancel in expectation rather than always adding. This signed hashing is unbiased for inner products, which is the property that makes it attractive for linear models.

67.5.2 5.2 Properties and Trade-offs

Hashing is stateless and streaming friendly. It requires no fitting pass, no stored vocabulary, and no special handling for categories unseen at training time, since any string simply hashes to some bucket. This makes it well suited to online learning, to extremely high cardinality features such as raw text n-grams, and to memory constrained or low latency production systems. The cost is collisions. When $K \gg d$, distinct categories share buckets and their signals superimpose, which injects noise and destroys interpretability, since you can no longer invert a bucket back to a category. The collision rate is governed by the load factor $K/d$. Under the idealized assumption that a good hash distributes categories uniformly and independently over the $d$ buckets, the probability that a particular category collides with at least one of the other $K-1$ categories is

\[ P(\text{collision}) = 1 - \left(1 - \frac{1}{d}\right)^{K-1} \approx 1 - e^{-(K-1)/d}, \]

which is the same birthday paradox arithmetic that governs hash tables. The practical reading is that collisions become common once $K$ approaches $d$, so to keep the expected fraction of collided categories small one chooses $d$ several times larger than $K$. The practical knob is to enlarge $d$ until collisions are tolerable for the task. Hashing pairs naturally with linear models and is a staple of large scale click through rate prediction, where billions of sparse categorical features must be handled within a fixed memory budget.

67.6 6. Learned Entity Embeddings

67.6.1 6.1 From Lookup Table to Dense Vector

Entity embeddings treat each categorical level as an index into a trainable lookup table. A variable with $K$ levels is associated with an embedding matrix $\mathbf{E} \in \mathbb{R}^{K \times d}$, and level $k$ is represented by the dense row vector $\mathbf{E}_k \in \mathbb{R}^d$, with $d$ typically far smaller than $K$. These vectors are parameters of the model and are learned by gradient descent jointly with the rest of the network, so the representation is optimized directly for the predictive task. This is the same mechanism that underlies word embeddings, applied to arbitrary categorical entities such as stores, products, or postal codes.

# embedding lookup, schematic
embedding = EmbeddingTable(num_levels=K, dim=d)  # K x d trainable
vec = embedding[category_index]                  # d-dim dense vector

67.6.2 6.2 Why Embeddings Help

Unlike one-hot encoding, which forces every category to be orthogonal and equidistant, an embedding places categories in a continuous space where the geometry can express similarity. During training the network is free to position related levels close together, so that two postal codes with similar customer behavior acquire nearby vectors and the model generalizes across them. This learned similarity structure is the central advantage. It lets the model share statistical strength among related categories rather than treating each as an isolated parameter, which is especially valuable for high cardinality variables where many levels individually have little data.

A practical heuristic for choosing the dimension is to scale it sublinearly with cardinality, for example $d = \min(50, \lceil K/2 \rceil)$ or $d \approx K^{0.25}$ rounded up. The dimension trades capacity against overfitting and memory and is worth tuning, but the precise value is usually not critical within a reasonable range. Embeddings shine when the cardinality is high, when the categorical variable interacts with other features in complex ways, and when enough data exists to fit the additional parameters, since each level’s vector must be learned.

67.6.3 6.3 Reuse, Transfer, and Limits

Because embeddings are dense vectors learned for a task, they can be inspected, visualized, and reused. Plotting store embeddings with a dimensionality reduction method often reveals interpretable clusters that mirror geography or product mix, and embeddings trained on a large task can be transferred as fixed features into a smaller downstream model. The classic demonstration came from a Kaggle forecasting competition in which entity embeddings of stores and dates substantially improved a neural network and, once extracted, lifted the performance of gradient boosted trees fed the same vectors.

The limitations mirror the strengths. Embeddings require a model that can train them, typically a neural network, which is heavier than fitting a target encoder. They need sufficient data per level to learn meaningful positions, and a level seen only a handful of times will have a poorly determined vector. Unseen categories at inference time require an explicit out of vocabulary slot or a fallback. For small low cardinality problems the added machinery rarely pays off, and a simple encoding is both faster and more robust.

67.7 7. Practical Guidance

67.7.1 7.1 Choosing an Encoding

The decision is driven primarily by cardinality and model family. For low cardinality variables, one-hot or dummy encoding is the safe default, with the reference level dropped only for unregularized linear models. For genuinely ordinal variables, integer encoding that respects the order is compact and effective. For high cardinality variables the choice widens. Leakage-safe target encoding is the strongest general purpose option for both linear and tree based models, frequency encoding is a cheap leak free complement, hashing is preferred under streaming or tight memory constraints, and learned embeddings are the right tool when a neural network is already in play and data is plentiful.

67.7.2 7.2 Cross-Cutting Concerns

Several issues span every encoder. Unseen categories must be handled explicitly, whether through an “other” bucket, a hashing fallback, or a dedicated out of vocabulary embedding, because production data will contain levels absent from training. Missing values should be treated as their own category rather than silently imputed, since missingness is frequently informative. All target based encoders must be fit strictly inside the cross validation loop or via out of fold estimation, never on the full data before splitting, or the validation estimate of generalization will be optimistic. Finally, encoding choices should be evaluated empirically. The interactions among cardinality, sample size, smoothing strength, and model family are difficult to predict in advance, so the reliable path is to treat the encoder as a tunable component of the pipeline and let leakage-safe cross validation decide.

67.7.3 7.3 Encoders at a Glance

The table consolidates the trade-offs developed above. The output dimension column gives the number of columns added per categorical variable, $K$ is the cardinality, $C$ the number of target classes, and $d$ a chosen output width.

Encoder	Output dim	Uses target	Leakage risk	Best when
One-hot or dummy	$K$ or $K-1$	No	None	Low cardinality, any model
Ordinal or contrast	$1$	No	None	Genuine order in the levels
Frequency or count	$1$	No	None	Popularity is predictive
Target or mean	$1$ (or $C-1$)	Yes	High if naive	High cardinality, batch training
Feature hashing	$d$	No	None	Streaming, huge or unbounded $K$
Entity embeddings	$d$	Yes (jointly)	Low	Neural net, high $K$, ample data

The recurring pitfalls are worth restating as a checklist. Do not impose a fabricated order on a nominal variable with integer codes. Do not drop a reference level for a regularized linear model or a tree. Do not fit a target encoder on data the model will also train on. Do not forget to smooth high cardinality target codes, and do not forget an out of vocabulary path for categories that appear only in production. Each of these is a quiet failure that passes unit tests and surfaces only as a gap between offline and online performance.

67.8 References

Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations. https://dl.acm.org/doi/10.1145/507533.507538
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J. (2009). Feature Hashing for Large Scale Multitask Learning. ICML. https://arxiv.org/abs/0902.2206
Guo, C., Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. https://arxiv.org/abs/1604.06737
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS. https://arxiv.org/abs/1706.09516
Pargent, F., Pfisterer, F., Thomas, J., Bischl, B. (2022). Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features. Computational Statistics. https://arxiv.org/abs/2104.00629
scikit-learn Developers. Encoding Categorical Features (User Guide). https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
scikit-learn Developers. TargetEncoder and Cross-Fitting. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
Category Encoders Contributors. Category Encoders Documentation. https://contrib.scikit-learn.org/category_encoders/

# Feature Engineering for Categorical Data Categorical variables are pervasive in applied machine learning. Customer identifiers, product SKUs, postal codes, device types, diagnosis codes, and merchant names all carry predictive signal, yet almost no learning algorithm consumes raw strings directly. The work of converting a category into a numeric representation that a model can exploit is among the most consequential decisions in a tabular modeling pipeline. A poor encoding inflates dimensionality, leaks the target, or discards the structure that made the variable useful in the first place. A good encoding preserves signal, controls variance, and respects the boundary between training and evaluation. This chapter develops the major encoding families with attention to both their statistical justification and their practical failure modes. ## 1. The Categorical Encoding Problem ### 1.1 What a Categorical Variable Is A categorical variable takes values from a finite or countably infinite set of discrete levels $\mathcal{C} = \{c_1, c_2, \dots, c_K\}$, where the labels themselves carry no inherent arithmetic meaning. We distinguish nominal variables, whose levels have no natural order (color, country, payment method), from ordinal variables, whose levels admit a meaningful ranking (education level, satisfaction rating, T-shirt size). A third practically important axis is cardinality. A variable with a handful of levels (blood type) behaves very differently from one with millions of levels (user identifier). High cardinality is where naive encodings break down, and most of the interesting engineering lives there. Formally, an encoder is a map $\psi : \mathcal{C} \to \mathbb{R}^d$ that sends each level to a numeric vector. The art of categorical feature engineering is choosing $\psi$ so that the geometry it imposes on $\mathcal{C}$ matches the structure the downstream model can exploit. The relevant scale of measurement is worth naming precisely. A nominal variable supports only the equality relation, so any valid encoding must be invariant to a relabeling of the levels (a permutation of $\mathcal{C}$). An ordinal variable additionally supports an order relation $\preceq$, so a valid encoding should be monotone in that order, but differences between codes carry no guaranteed meaning. This taxonomy, due to Stevens, is the formal reason that integer codes are appropriate for ordinal data and dangerous for nominal data: an integer code silently asserts an interval scale that the data does not support. ### 1.2 Why Encoding Choice Matters Encoding is not a neutral preprocessing step. It interacts with the model family, the cardinality of the variable, the size of the training set, and the risk of information leakage. Three tensions recur throughout this chapter. The first is dimensionality versus expressiveness. One-hot encoding is faithful but explodes the feature count for high cardinality variables. The second is bias versus variance. Encodings that summarize a category by its relationship to the target reduce dimensionality dramatically but introduce variance for rare categories and bias if estimated carelessly. The third is leakage versus fidelity. The most powerful supervised encodings use the target, which means they can silently leak label information into features unless the estimation respects strict separation between the rows used to fit the encoder and the rows used to fit the model. A useful mental model is that every encoder is a small statistical estimator embedded inside your pipeline, and it inherits all the usual concerns about overfitting, regularization, and validation. The diagram below summarizes the decision that the rest of the chapter develops in detail. ```{mermaid} flowchart TD A["Categorical variable"] --> B{"Genuine order?"} B -->|"Yes"| C["Integer or contrast encoding"] B -->|"No"| D{"Cardinality"} D -->|"Low, under 50 levels"| E["One-hot or dummy encoding"] D -->|"High"| F{"Model and constraints"} F -->|"Linear or tree, batch"| G["Leakage-safe target encoding"] F -->|"Streaming or tight memory"| H["Feature hashing"] F -->|"Neural net, ample data"| I["Learned entity embeddings"] ``` ## 2. One-Hot and Dummy Encoding ### 2.1 The Construction One-hot encoding maps a categorical variable with $K$ levels to a binary vector of length $K$. For an observation whose category is $c_k$, the encoding is the indicator vector $\mathbf{e}_k$ that is $1$ in position $k$ and $0$ elsewhere. Formally, the encoded feature for level $j$ is $$ x_j = \mathbb{1}[c = c_j], \quad j = 1, \dots, K. $$ This representation is sometimes called a one-of-$K$ or indicator coding. It makes no assumption about ordering and treats every level as equidistant from every other level, which is exactly what we want for nominal variables. ```python # conceptual, not optimized def one_hot(value, levels): return [1 if value == c else 0 for c in levels] ``` ### 2.2 Dummy Encoding and Collinearity Dummy encoding is a close variant that uses only $K-1$ columns, designating one level as the reference category and representing it implicitly by the all zero vector. The distinction matters for linear models. With one-hot encoding plus an intercept term, the $K$ indicator columns sum to the constant $1$, which equals the intercept column. This perfect linear dependency is the dummy variable trap. The design matrix becomes rank deficient, the ordinary least squares solution is not unique, and coefficient interpretation collapses. Dropping one level restores full rank. For regularized linear models the calculus changes. Under $L_2$ penalization the solution is again unique even with the redundant column, and dropping a reference level introduces an asymmetry because the penalty now treats the reference category differently from the others. For tree based models, neither collinearity nor the reference level choice affects predictions, since trees split on individual indicators without inverting a design matrix. The practical rule is to drop a level for unregularized linear models and to keep all levels for regularized linear models and trees. ### 2.3 Strengths, Costs, and When to Stop One-hot encoding is lossless, interpretable, and model agnostic, which makes it the correct default for low cardinality variables. Its weakness is purely dimensional. A variable with $K$ levels adds $K$ sparse columns, and several such variables interact multiplicatively in memory and training time. Beyond roughly fifty levels the sparsity becomes a liability for many model families, the per level sample size shrinks, and rare levels contribute noisy, weakly identified coefficients. At that point one of the dimensionality reducing encodings below becomes preferable. A common compromise is to one-hot only the top $m$ most frequent levels and collapse the remaining tail into a single "other" bucket. ## 3. Ordinal and Contrast Encoding ### 3.1 Encoding Genuine Order When a variable is genuinely ordinal, integer encoding that respects the ranking is both compact and informative. We map levels to integers $0, 1, 2, \dots, K-1$ following the domain order, so that "low", "medium", "high" becomes $0, 1, 2$. This single column communicates monotonic structure to the model, which a one-hot encoding would discard entirely. Tree based models exploit this cleanly because a single threshold split such as $x \le 1$ separates the lower levels from the upper ones. The serious hazard is imposing a false ordering on a nominal variable. Encoding {red, green, blue} as $0, 1, 2$ asserts that green sits "between" red and blue and that the distance from red to blue is twice the distance from red to green. Linear models will take this fabricated geometry literally and may fit spurious trends. Reserve integer encoding for variables with a defensible order, and prefer one-hot or a target based scheme for nominal variables. ### 3.2 Spacing and Contrast Coding Plain integer codes assume equal spacing between adjacent levels, which is rarely exactly true. If domain knowledge supplies meaningful numeric anchors, for example mapping satisfaction levels to survey midpoints, use those values directly rather than consecutive integers. The classical statistics literature also offers contrast coding schemes such as Helmert and polynomial contrasts that let a linear model test specific hypotheses about how successive levels differ. These are most valuable in inferential settings where the coefficients themselves are the object of interest. In predictive machine learning they are used less often, since flexible models recover the relevant structure from simpler codings, but they remain useful when interpretability of level to level differences matters. ## 4. Target and Mean Encoding ### 4.1 The Core Idea Target encoding, also called mean encoding or likelihood encoding, replaces each category with a statistic of the target computed over the rows belonging to that category. For a regression target the natural choice is the conditional mean, $$ \hat{\mu}_k = \frac{1}{n_k} \sum_{i : c_i = c_k} y_i, $$ where $n_k$ is the number of training rows in category $k$. For binary classification the same formula yields the empirical positive rate, an estimate of $P(y = 1 \mid c = c_k)$. The appeal is immediate. A variable with a million levels collapses to a single numeric column that directly encodes each level's relationship to the target, sidestepping the dimensionality explosion of one-hot encoding while retaining far more signal than an arbitrary integer code. ### 4.2 Smoothing Toward the Prior Raw category means are dangerously high variance for rare levels. A category observed twice, both times positive, yields an estimate of $1.0$ that the model will trust as strongly as a category observed ten thousand times. To see this quantitatively, note that for a binary target with true rate $p_k$ the raw mean has variance $\operatorname{Var}(\hat{\mu}_k) = p_k(1-p_k)/n_k$, which blows up as $n_k \to 1$. The standard remedy is empirical Bayes shrinkage toward the global mean $\mu$, blending the category estimate with the prior in proportion to how much evidence the category provides, $$ \tilde{\mu}_k = \frac{n_k \, \hat{\mu}_k + m \, \mu}{n_k + m}, $$ where $m$ is a smoothing strength expressed in pseudo counts. When $n_k$ is large the estimate is dominated by the category's own data, and when $n_k$ is small it is pulled toward the global mean. The hyperparameter $m$ controls the crossover point and is best chosen by cross validation. This formula is not an ad hoc convex combination. It is the posterior mean of a conjugate Bayesian model. Place a Beta$(\alpha, \beta)$ prior on the category rate $p_k$ and observe $s_k = n_k \hat{\mu}_k$ successes in $n_k$ Bernoulli trials. The posterior is Beta$(\alpha + s_k, \beta + n_k - s_k)$, whose mean is $$ \mathbb{E}[p_k \mid \text{data}] = \frac{\alpha + s_k}{\alpha + \beta + n_k} = \frac{n_k \hat{\mu}_k + (\alpha + \beta)\,\tfrac{\alpha}{\alpha+\beta}}{n_k + (\alpha + \beta)}. $$ Identifying $m = \alpha + \beta$ as the prior pseudo count and $\mu = \alpha/(\alpha+\beta)$ as the prior mean recovers the smoothing formula exactly. The prior strength $m$ thus has a clean interpretation: it is the number of phantom observations, drawn at the global rate, that we mix into every category before trusting its own data. Setting $m$ near the typical per category count makes the encoder skeptical of small samples and confident only when a level is well populated. A common alternative parameterization uses a logistic weighting $$ \lambda(n_k) = \frac{1}{1 + e^{-(n_k - k_0)/f}}, $$ with $\tilde{\mu}_k = \lambda(n_k)\,\hat{\mu}_k + (1 - \lambda(n_k))\,\mu$, which gives a smooth sigmoid transition governed by an inflection point $k_0$ and a slope $f$. Either way, smoothing is not optional for high cardinality target encoding. It is what makes the estimator usable on the long tail. **Worked example.** Suppose a binary churn target has a global positive rate $\mu = 0.20$ and we use smoothing strength $m = 10$. Consider three merchant levels. Merchant A appears $n_A = 2$ times, both churned, so $\hat{\mu}_A = 1.0$. Merchant B appears $n_B = 50$ times with $30$ churns, so $\hat{\mu}_B = 0.60$. Merchant C appears $n_C = 5000$ times with $1500$ churns, so $\hat{\mu}_C = 0.30$. The smoothed codes are $$ \tilde{\mu}_A = \frac{2(1.0) + 10(0.2)}{2 + 10} = 0.333, \quad \tilde{\mu}_B = \frac{50(0.6) + 10(0.2)}{50 + 10} = 0.533, \quad \tilde{\mu}_C = \frac{5000(0.3) + 10(0.2)}{5000 + 10} = 0.300. $$ The rare merchant A, whose raw estimate of $1.0$ rested on a single decisive observation, is pulled most of the way back to the prior, while the well populated merchant C is essentially unchanged. This is exactly the behavior we want: the encoder expresses confidence in proportion to evidence. ### 4.3 The Leakage Problem Target encoding uses the label, so it can leak. If you compute $\hat{\mu}_k$ from the same rows you then train on, each row's encoded feature contains a contribution from that row's own target. The model discovers that the encoded column is a slightly noisy copy of the answer, training metrics look spectacular, and validation collapses. The effect is most severe precisely for rare categories, where a single row dominates its own category mean. This is one of the most common and most costly mistakes in applied tabular modeling, and naive in place target encoding should be treated as a bug rather than a technique. ### 4.4 Leakage-Safe Cross-Fitting The principled fix is to ensure that the encoding for any given row is estimated only from other rows. Out of fold target encoding, also called cross-fitting, achieves this. Partition the training data into $K$ folds. For each fold, compute the category statistics using only the other $K-1$ folds and apply them to the held out fold. Every training row thus receives an encoding that never saw its own target. At inference time, and for the validation and test sets, you encode using statistics fit on the entire training set, since those rows are disjoint from the data being encoded. ```python # out-of-fold target encoding, schematic for train_idx, oof_idx in kfold.split(X): stats = smoothed_means(X[train_idx], y[train_idx]) encoded[oof_idx] = map_with_prior(X[oof_idx], stats, global_mean) # final encoder for test/inference: fit on all training rows full_stats = smoothed_means(X, y) ``` A finer grained variant computes a leave one out or running estimate. The gradient boosting library CatBoost popularized ordered target statistics, which impose a random permutation on the rows and encode each row using only the targets of rows that precede it in the permutation, averaged over several permutations to reduce variance. This online formulation eliminates leakage without sacrificing data to fixed folds and is one reason CatBoost performs strongly on high cardinality categorical data out of the box. Two further safeguards matter. First, combine cross-fitting with smoothing, since the two address different problems, leakage and small sample variance respectively. Second, add a small amount of Gaussian noise to the out of fold encodings during training when categories remain nearly separable, which further weakens any residual ability of the model to memorize through the encoded column. ### 4.5 Multiclass and Count Variants For a multiclass target with $C$ classes, target encoding produces $C-1$ columns per categorical variable, one conditional probability per class with the usual reference dropped. This partially reintroduces dimensionality, though far below one-hot for high cardinality inputs. A lighter alternative is count or frequency encoding, which replaces each level with its number of occurrences or its relative frequency. Frequency encoding ignores the target entirely, so it cannot leak, and it captures the often predictive notion of how common a category is, but it conflates distinct categories that happen to share a frequency and should be viewed as a complement to rather than a substitute for target encoding. ## 5. Feature Hashing ### 5.1 The Hashing Trick Feature hashing, or the hashing trick, sidesteps the need to enumerate categories at all. A hash function $h$ maps each category string to an index in a fixed range $\{0, 1, \dots, d-1\}$, and the category is encoded as the corresponding position in a $d$ dimensional vector. The output dimension $d$ is chosen in advance and is independent of the true cardinality $K$, which may be unknown or unbounded. To reduce the bias introduced by collisions, a signed variant uses a second hash $\xi$ mapping to $\{-1, +1\}$ and writes $\pm 1$ into the bucket, $$ \phi_j(c) = \sum_{c : h(c) = j} \xi(c), $$ so that collisions tend to cancel in expectation rather than always adding. This signed hashing is unbiased for inner products, which is the property that makes it attractive for linear models. ### 5.2 Properties and Trade-offs Hashing is stateless and streaming friendly. It requires no fitting pass, no stored vocabulary, and no special handling for categories unseen at training time, since any string simply hashes to some bucket. This makes it well suited to online learning, to extremely high cardinality features such as raw text n-grams, and to memory constrained or low latency production systems. The cost is collisions. When $K \gg d$, distinct categories share buckets and their signals superimpose, which injects noise and destroys interpretability, since you can no longer invert a bucket back to a category. The collision rate is governed by the load factor $K/d$. Under the idealized assumption that a good hash distributes categories uniformly and independently over the $d$ buckets, the probability that a particular category collides with at least one of the other $K-1$ categories is $$ P(\text{collision}) = 1 - \left(1 - \frac{1}{d}\right)^{K-1} \approx 1 - e^{-(K-1)/d}, $$ which is the same birthday paradox arithmetic that governs hash tables. The practical reading is that collisions become common once $K$ approaches $d$, so to keep the expected fraction of collided categories small one chooses $d$ several times larger than $K$. The practical knob is to enlarge $d$ until collisions are tolerable for the task. Hashing pairs naturally with linear models and is a staple of large scale click through rate prediction, where billions of sparse categorical features must be handled within a fixed memory budget. ## 6. Learned Entity Embeddings ### 6.1 From Lookup Table to Dense Vector Entity embeddings treat each categorical level as an index into a trainable lookup table. A variable with $K$ levels is associated with an embedding matrix $\mathbf{E} \in \mathbb{R}^{K \times d}$, and level $k$ is represented by the dense row vector $\mathbf{E}_k \in \mathbb{R}^d$, with $d$ typically far smaller than $K$. These vectors are parameters of the model and are learned by gradient descent jointly with the rest of the network, so the representation is optimized directly for the predictive task. This is the same mechanism that underlies word embeddings, applied to arbitrary categorical entities such as stores, products, or postal codes. ```python # embedding lookup, schematic embedding = EmbeddingTable(num_levels=K, dim=d) # K x d trainable vec = embedding[category_index] # d-dim dense vector ``` ### 6.2 Why Embeddings Help Unlike one-hot encoding, which forces every category to be orthogonal and equidistant, an embedding places categories in a continuous space where the geometry can express similarity. During training the network is free to position related levels close together, so that two postal codes with similar customer behavior acquire nearby vectors and the model generalizes across them. This learned similarity structure is the central advantage. It lets the model share statistical strength among related categories rather than treating each as an isolated parameter, which is especially valuable for high cardinality variables where many levels individually have little data. A practical heuristic for choosing the dimension is to scale it sublinearly with cardinality, for example $d = \min(50, \lceil K/2 \rceil)$ or $d \approx K^{0.25}$ rounded up. The dimension trades capacity against overfitting and memory and is worth tuning, but the precise value is usually not critical within a reasonable range. Embeddings shine when the cardinality is high, when the categorical variable interacts with other features in complex ways, and when enough data exists to fit the additional parameters, since each level's vector must be learned. ### 6.3 Reuse, Transfer, and Limits Because embeddings are dense vectors learned for a task, they can be inspected, visualized, and reused. Plotting store embeddings with a dimensionality reduction method often reveals interpretable clusters that mirror geography or product mix, and embeddings trained on a large task can be transferred as fixed features into a smaller downstream model. The classic demonstration came from a Kaggle forecasting competition in which entity embeddings of stores and dates substantially improved a neural network and, once extracted, lifted the performance of gradient boosted trees fed the same vectors. The limitations mirror the strengths. Embeddings require a model that can train them, typically a neural network, which is heavier than fitting a target encoder. They need sufficient data per level to learn meaningful positions, and a level seen only a handful of times will have a poorly determined vector. Unseen categories at inference time require an explicit out of vocabulary slot or a fallback. For small low cardinality problems the added machinery rarely pays off, and a simple encoding is both faster and more robust. ## 7. Practical Guidance ### 7.1 Choosing an Encoding The decision is driven primarily by cardinality and model family. For low cardinality variables, one-hot or dummy encoding is the safe default, with the reference level dropped only for unregularized linear models. For genuinely ordinal variables, integer encoding that respects the order is compact and effective. For high cardinality variables the choice widens. Leakage-safe target encoding is the strongest general purpose option for both linear and tree based models, frequency encoding is a cheap leak free complement, hashing is preferred under streaming or tight memory constraints, and learned embeddings are the right tool when a neural network is already in play and data is plentiful. ### 7.2 Cross-Cutting Concerns Several issues span every encoder. Unseen categories must be handled explicitly, whether through an "other" bucket, a hashing fallback, or a dedicated out of vocabulary embedding, because production data will contain levels absent from training. Missing values should be treated as their own category rather than silently imputed, since missingness is frequently informative. All target based encoders must be fit strictly inside the cross validation loop or via out of fold estimation, never on the full data before splitting, or the validation estimate of generalization will be optimistic. Finally, encoding choices should be evaluated empirically. The interactions among cardinality, sample size, smoothing strength, and model family are difficult to predict in advance, so the reliable path is to treat the encoder as a tunable component of the pipeline and let leakage-safe cross validation decide. ### 7.3 Encoders at a Glance The table consolidates the trade-offs developed above. The output dimension column gives the number of columns added per categorical variable, $K$ is the cardinality, $C$ the number of target classes, and $d$ a chosen output width. | Encoder | Output dim | Uses target | Leakage risk | Best when | |---|---|---|---|---| | One-hot or dummy | $K$ or $K-1$ | No | None | Low cardinality, any model | | Ordinal or contrast | $1$ | No | None | Genuine order in the levels | | Frequency or count | $1$ | No | None | Popularity is predictive | | Target or mean | $1$ (or $C-1$) | Yes | High if naive | High cardinality, batch training | | Feature hashing | $d$ | No | None | Streaming, huge or unbounded $K$ | | Entity embeddings | $d$ | Yes (jointly) | Low | Neural net, high $K$, ample data | The recurring pitfalls are worth restating as a checklist. Do not impose a fabricated order on a nominal variable with integer codes. Do not drop a reference level for a regularized linear model or a tree. Do not fit a target encoder on data the model will also train on. Do not forget to smooth high cardinality target codes, and do not forget an out of vocabulary path for categories that appear only in production. Each of these is a quiet failure that passes unit tests and surfaces only as a gap between offline and online performance. ## References 1. Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations. https://dl.acm.org/doi/10.1145/507533.507538 2. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J. (2009). Feature Hashing for Large Scale Multitask Learning. ICML. https://arxiv.org/abs/0902.2206 3. Guo, C., Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. https://arxiv.org/abs/1604.06737 4. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS. https://arxiv.org/abs/1706.09516 5. Pargent, F., Pfisterer, F., Thomas, J., Bischl, B. (2022). Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features. Computational Statistics. https://arxiv.org/abs/2104.00629 6. scikit-learn Developers. Encoding Categorical Features (User Guide). https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features 7. scikit-learn Developers. TargetEncoder and Cross-Fitting. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html 8. Category Encoders Contributors. Category Encoders Documentation. https://contrib.scikit-learn.org/category_encoders/