67 Feature Engineering for Categorical Data
Categorical variables are pervasive in applied machine learning. Customer identifiers, product SKUs, postal codes, device types, diagnosis codes, and merchant names all carry predictive signal, yet almost no learning algorithm consumes raw strings directly. The work of converting a category into a numeric representation that a model can exploit is among the most consequential decisions in a tabular modeling pipeline. A poor encoding inflates dimensionality, leaks the target, or discards the structure that made the variable useful in the first place. A good encoding preserves signal, controls variance, and respects the boundary between training and evaluation. This chapter develops the major encoding families with attention to both their statistical justification and their practical failure modes.
67.1 1. The Categorical Encoding Problem
67.1.1 1.1 What a Categorical Variable Is
A categorical variable takes values from a finite or countably infinite set of discrete levels \(\mathcal{C} = \{c_1, c_2, \dots, c_K\}\), where the labels themselves carry no inherent arithmetic meaning. We distinguish nominal variables, whose levels have no natural order (color, country, payment method), from ordinal variables, whose levels admit a meaningful ranking (education level, satisfaction rating, T-shirt size). A third practically important axis is cardinality. A variable with a handful of levels (blood type) behaves very differently from one with millions of levels (user identifier). High cardinality is where naive encodings break down, and most of the interesting engineering lives there.
67.1.2 1.2 Why Encoding Choice Matters
Encoding is not a neutral preprocessing step. It interacts with the model family, the cardinality of the variable, the size of the training set, and the risk of information leakage. Three tensions recur throughout this chapter.
The first is dimensionality versus expressiveness. One-hot encoding is faithful but explodes the feature count for high cardinality variables. The second is bias versus variance. Encodings that summarize a category by its relationship to the target reduce dimensionality dramatically but introduce variance for rare categories and bias if estimated carelessly. The third is leakage versus fidelity. The most powerful supervised encodings use the target, which means they can silently leak label information into features unless the estimation respects strict separation between the rows used to fit the encoder and the rows used to fit the model.
A useful mental model is that every encoder is a small statistical estimator embedded inside your pipeline, and it inherits all the usual concerns about overfitting, regularization, and validation.
67.2 2. One-Hot and Dummy Encoding
67.2.1 2.1 The Construction
One-hot encoding maps a categorical variable with \(K\) levels to a binary vector of length \(K\). For an observation whose category is \(c_k\), the encoding is the indicator vector \(\mathbf{e}_k\) that is \(1\) in position \(k\) and \(0\) elsewhere. Formally, the encoded feature for level \(j\) is
\[ x_j = \mathbb{1}[c = c_j], \quad j = 1, \dots, K. \]
This representation is sometimes called a one-of-\(K\) or indicator coding. It makes no assumption about ordering and treats every level as equidistant from every other level, which is exactly what we want for nominal variables.
# conceptual, not optimized
def one_hot(value, levels):
return [1 if value == c else 0 for c in levels]67.2.2 2.2 Dummy Encoding and Collinearity
Dummy encoding is a close variant that uses only \(K-1\) columns, designating one level as the reference category and representing it implicitly by the all zero vector. The distinction matters for linear models. With one-hot encoding plus an intercept term, the \(K\) indicator columns sum to the constant \(1\), which equals the intercept column. This perfect linear dependency is the dummy variable trap. The design matrix becomes rank deficient, the ordinary least squares solution is not unique, and coefficient interpretation collapses. Dropping one level restores full rank.
For regularized linear models the calculus changes. Under \(L_2\) penalization the solution is again unique even with the redundant column, and dropping a reference level introduces an asymmetry because the penalty now treats the reference category differently from the others. For tree based models, neither collinearity nor the reference level choice affects predictions, since trees split on individual indicators without inverting a design matrix. The practical rule is to drop a level for unregularized linear models and to keep all levels for regularized linear models and trees.
67.2.3 2.3 Strengths, Costs, and When to Stop
One-hot encoding is lossless, interpretable, and model agnostic, which makes it the correct default for low cardinality variables. Its weakness is purely dimensional. A variable with \(K\) levels adds \(K\) sparse columns, and several such variables interact multiplicatively in memory and training time. Beyond roughly fifty levels the sparsity becomes a liability for many model families, the per level sample size shrinks, and rare levels contribute noisy, weakly identified coefficients. At that point one of the dimensionality reducing encodings below becomes preferable. A common compromise is to one-hot only the top \(m\) most frequent levels and collapse the remaining tail into a single “other” bucket.
67.3 3. Ordinal and Contrast Encoding
67.3.1 3.1 Encoding Genuine Order
When a variable is genuinely ordinal, integer encoding that respects the ranking is both compact and informative. We map levels to integers \(0, 1, 2, \dots, K-1\) following the domain order, so that “low”, “medium”, “high” becomes \(0, 1, 2\). This single column communicates monotonic structure to the model, which a one-hot encoding would discard entirely. Tree based models exploit this cleanly because a single threshold split such as \(x \le 1\) separates the lower levels from the upper ones.
The serious hazard is imposing a false ordering on a nominal variable. Encoding {red, green, blue} as \(0, 1, 2\) asserts that green sits “between” red and blue and that the distance from red to blue is twice the distance from red to green. Linear models will take this fabricated geometry literally and may fit spurious trends. Reserve integer encoding for variables with a defensible order, and prefer one-hot or a target based scheme for nominal variables.
67.3.2 3.2 Spacing and Contrast Coding
Plain integer codes assume equal spacing between adjacent levels, which is rarely exactly true. If domain knowledge supplies meaningful numeric anchors, for example mapping satisfaction levels to survey midpoints, use those values directly rather than consecutive integers. The classical statistics literature also offers contrast coding schemes such as Helmert and polynomial contrasts that let a linear model test specific hypotheses about how successive levels differ. These are most valuable in inferential settings where the coefficients themselves are the object of interest. In predictive machine learning they are used less often, since flexible models recover the relevant structure from simpler codings, but they remain useful when interpretability of level to level differences matters.
67.4 4. Target and Mean Encoding
67.4.1 4.1 The Core Idea
Target encoding, also called mean encoding or likelihood encoding, replaces each category with a statistic of the target computed over the rows belonging to that category. For a regression target the natural choice is the conditional mean,
\[ \hat{\mu}_k = \frac{1}{n_k} \sum_{i : c_i = c_k} y_i, \]
where \(n_k\) is the number of training rows in category \(k\). For binary classification the same formula yields the empirical positive rate, an estimate of \(P(y = 1 \mid c = c_k)\). The appeal is immediate. A variable with a million levels collapses to a single numeric column that directly encodes each level’s relationship to the target, sidestepping the dimensionality explosion of one-hot encoding while retaining far more signal than an arbitrary integer code.
67.4.2 4.2 Smoothing Toward the Prior
Raw category means are dangerously high variance for rare levels. A category observed twice, both times positive, yields an estimate of \(1.0\) that the model will trust as strongly as a category observed ten thousand times. The standard remedy is empirical Bayes shrinkage toward the global mean \(\mu\), blending the category estimate with the prior in proportion to how much evidence the category provides,
\[ \tilde{\mu}_k = \frac{n_k \, \hat{\mu}_k + m \, \mu}{n_k + m}, \]
where \(m\) is a smoothing strength expressed in pseudo counts. When \(n_k\) is large the estimate is dominated by the category’s own data, and when \(n_k\) is small it is pulled toward the global mean. The hyperparameter \(m\) controls the crossover point and is best chosen by cross validation. A common alternative parameterization uses a logistic weighting
\[ \lambda(n_k) = \frac{1}{1 + e^{-(n_k - k_0)/f}}, \]
with \(\tilde{\mu}_k = \lambda(n_k)\,\hat{\mu}_k + (1 - \lambda(n_k))\,\mu\), which gives a smooth sigmoid transition governed by an inflection point \(k_0\) and a slope \(f\). Either way, smoothing is not optional for high cardinality target encoding. It is what makes the estimator usable on the long tail.
67.4.3 4.3 The Leakage Problem
Target encoding uses the label, so it can leak. If you compute \(\hat{\mu}_k\) from the same rows you then train on, each row’s encoded feature contains a contribution from that row’s own target. The model discovers that the encoded column is a slightly noisy copy of the answer, training metrics look spectacular, and validation collapses. The effect is most severe precisely for rare categories, where a single row dominates its own category mean. This is one of the most common and most costly mistakes in applied tabular modeling, and naive in place target encoding should be treated as a bug rather than a technique.
67.4.4 4.4 Leakage-Safe Cross-Fitting
The principled fix is to ensure that the encoding for any given row is estimated only from other rows. Out of fold target encoding, also called cross-fitting, achieves this. Partition the training data into \(K\) folds. For each fold, compute the category statistics using only the other \(K-1\) folds and apply them to the held out fold. Every training row thus receives an encoding that never saw its own target. At inference time, and for the validation and test sets, you encode using statistics fit on the entire training set, since those rows are disjoint from the data being encoded.
# out-of-fold target encoding, schematic
for train_idx, oof_idx in kfold.split(X):
stats = smoothed_means(X[train_idx], y[train_idx])
encoded[oof_idx] = map_with_prior(X[oof_idx], stats, global_mean)
# final encoder for test/inference: fit on all training rows
full_stats = smoothed_means(X, y)A finer grained variant computes a leave one out or running estimate. The gradient boosting library CatBoost popularized ordered target statistics, which impose a random permutation on the rows and encode each row using only the targets of rows that precede it in the permutation, averaged over several permutations to reduce variance. This online formulation eliminates leakage without sacrificing data to fixed folds and is one reason CatBoost performs strongly on high cardinality categorical data out of the box.
Two further safeguards matter. First, combine cross-fitting with smoothing, since the two address different problems, leakage and small sample variance respectively. Second, add a small amount of Gaussian noise to the out of fold encodings during training when categories remain nearly separable, which further weakens any residual ability of the model to memorize through the encoded column.
67.4.5 4.5 Multiclass and Count Variants
For a multiclass target with \(C\) classes, target encoding produces \(C-1\) columns per categorical variable, one conditional probability per class with the usual reference dropped. This partially reintroduces dimensionality, though far below one-hot for high cardinality inputs. A lighter alternative is count or frequency encoding, which replaces each level with its number of occurrences or its relative frequency. Frequency encoding ignores the target entirely, so it cannot leak, and it captures the often predictive notion of how common a category is, but it conflates distinct categories that happen to share a frequency and should be viewed as a complement to rather than a substitute for target encoding.
67.5 5. Feature Hashing
67.5.1 5.1 The Hashing Trick
Feature hashing, or the hashing trick, sidesteps the need to enumerate categories at all. A hash function \(h\) maps each category string to an index in a fixed range \(\{0, 1, \dots, d-1\}\), and the category is encoded as the corresponding position in a \(d\) dimensional vector. The output dimension \(d\) is chosen in advance and is independent of the true cardinality \(K\), which may be unknown or unbounded. To reduce the bias introduced by collisions, a signed variant uses a second hash \(\xi\) mapping to \(\{-1, +1\}\) and writes \(\pm 1\) into the bucket,
\[ \phi_j(c) = \sum_{c : h(c) = j} \xi(c), \]
so that collisions tend to cancel in expectation rather than always adding. This signed hashing is unbiased for inner products, which is the property that makes it attractive for linear models.
67.5.2 5.2 Properties and Trade-offs
Hashing is stateless and streaming friendly. It requires no fitting pass, no stored vocabulary, and no special handling for categories unseen at training time, since any string simply hashes to some bucket. This makes it well suited to online learning, to extremely high cardinality features such as raw text n-grams, and to memory constrained or low latency production systems. The cost is collisions. When \(K \gg d\), distinct categories share buckets and their signals superimpose, which injects noise and destroys interpretability, since you can no longer invert a bucket back to a category. The collision rate is governed by the load factor \(K/d\), and the practical knob is to enlarge \(d\) until collisions are tolerable for the task. Hashing pairs naturally with linear models and is a staple of large scale click through rate prediction, where billions of sparse categorical features must be handled within a fixed memory budget.
67.6 6. Learned Entity Embeddings
67.6.1 6.1 From Lookup Table to Dense Vector
Entity embeddings treat each categorical level as an index into a trainable lookup table. A variable with \(K\) levels is associated with an embedding matrix \(\mathbf{E} \in \mathbb{R}^{K \times d}\), and level \(k\) is represented by the dense row vector \(\mathbf{E}_k \in \mathbb{R}^d\), with \(d\) typically far smaller than \(K\). These vectors are parameters of the model and are learned by gradient descent jointly with the rest of the network, so the representation is optimized directly for the predictive task. This is the same mechanism that underlies word embeddings, applied to arbitrary categorical entities such as stores, products, or postal codes.
# embedding lookup, schematic
embedding = EmbeddingTable(num_levels=K, dim=d) # K x d trainable
vec = embedding[category_index] # d-dim dense vector67.6.2 6.2 Why Embeddings Help
Unlike one-hot encoding, which forces every category to be orthogonal and equidistant, an embedding places categories in a continuous space where the geometry can express similarity. During training the network is free to position related levels close together, so that two postal codes with similar customer behavior acquire nearby vectors and the model generalizes across them. This learned similarity structure is the central advantage. It lets the model share statistical strength among related categories rather than treating each as an isolated parameter, which is especially valuable for high cardinality variables where many levels individually have little data.
A practical heuristic for choosing the dimension is to scale it sublinearly with cardinality, for example \(d = \min(50, \lceil K/2 \rceil)\) or \(d \approx K^{0.25}\) rounded up. The dimension trades capacity against overfitting and memory and is worth tuning, but the precise value is usually not critical within a reasonable range. Embeddings shine when the cardinality is high, when the categorical variable interacts with other features in complex ways, and when enough data exists to fit the additional parameters, since each level’s vector must be learned.
67.6.3 6.3 Reuse, Transfer, and Limits
Because embeddings are dense vectors learned for a task, they can be inspected, visualized, and reused. Plotting store embeddings with a dimensionality reduction method often reveals interpretable clusters that mirror geography or product mix, and embeddings trained on a large task can be transferred as fixed features into a smaller downstream model. The classic demonstration came from a Kaggle forecasting competition in which entity embeddings of stores and dates substantially improved a neural network and, once extracted, lifted the performance of gradient boosted trees fed the same vectors.
The limitations mirror the strengths. Embeddings require a model that can train them, typically a neural network, which is heavier than fitting a target encoder. They need sufficient data per level to learn meaningful positions, and a level seen only a handful of times will have a poorly determined vector. Unseen categories at inference time require an explicit out of vocabulary slot or a fallback. For small low cardinality problems the added machinery rarely pays off, and a simple encoding is both faster and more robust.
67.7 7. Practical Guidance
67.7.1 7.1 Choosing an Encoding
The decision is driven primarily by cardinality and model family. For low cardinality variables, one-hot or dummy encoding is the safe default, with the reference level dropped only for unregularized linear models. For genuinely ordinal variables, integer encoding that respects the order is compact and effective. For high cardinality variables the choice widens. Leakage-safe target encoding is the strongest general purpose option for both linear and tree based models, frequency encoding is a cheap leak free complement, hashing is preferred under streaming or tight memory constraints, and learned embeddings are the right tool when a neural network is already in play and data is plentiful.
67.7.2 7.2 Cross-Cutting Concerns
Several issues span every encoder. Unseen categories must be handled explicitly, whether through an “other” bucket, a hashing fallback, or a dedicated out of vocabulary embedding, because production data will contain levels absent from training. Missing values should be treated as their own category rather than silently imputed, since missingness is frequently informative. All target based encoders must be fit strictly inside the cross validation loop or via out of fold estimation, never on the full data before splitting, or the validation estimate of generalization will be optimistic. Finally, encoding choices should be evaluated empirically. The interactions among cardinality, sample size, smoothing strength, and model family are difficult to predict in advance, so the reliable path is to treat the encoder as a tunable component of the pipeline and let leakage-safe cross validation decide.
67.8 References
- Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations. https://dl.acm.org/doi/10.1145/507533.507538
- Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J. (2009). Feature Hashing for Large Scale Multitask Learning. ICML. https://arxiv.org/abs/0902.2206
- Guo, C., Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. https://arxiv.org/abs/1604.06737
- Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS. https://arxiv.org/abs/1706.09516
- Pargent, F., Pfisterer, F., Thomas, J., Bischl, B. (2022). Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features. Computational Statistics. https://arxiv.org/abs/2104.00629
- scikit-learn Developers. Encoding Categorical Features (User Guide). https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
- scikit-learn Developers. TargetEncoder and Cross-Fitting. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
- Category Encoders Contributors. Category Encoders Documentation. https://contrib.scikit-learn.org/category_encoders/